mirror of https://github.com/autoscriptlabs/nccl-mesh-plugin.git synced 2026-01-11 11:34:06 +00:00

No description

Find a file

autoscriptlabs 201ec9e321 docs: clarify unified memory advantage on Grace Hopper - Remove incorrect 'host memory staging' limitation - Add section explaining NVLink-C2C unified memory benefits - Zero-copy RDMA works automatically on DGX Spark		2026-01-09 14:16:48 -05:00
docs	docs: clarify unified memory advantage on Grace Hopper	2026-01-09 14:16:48 -05:00
examples	Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies	2026-01-09 14:09:33 -05:00
include	Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies	2026-01-09 14:09:33 -05:00
nccl	Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies	2026-01-09 14:09:33 -05:00
src	Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies	2026-01-09 14:09:33 -05:00
LICENSE	Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies	2026-01-09 14:09:33 -05:00
Makefile	Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies	2026-01-09 14:09:33 -05:00
README.md	docs: clarify unified memory advantage on Grace Hopper	2026-01-09 14:16:48 -05:00

README.md

NCCL Mesh Plugin

Custom NCCL network plugin enabling distributed ML over direct-connect RDMA mesh topologies.

🎯 What This Does

This plugin enables NCCL (NVIDIA Collective Communications Library) to work with direct-connect mesh topologies where each node pair is on a different subnet. Standard NCCL plugins assume either:

A switched InfiniBand fabric (all nodes on same subnet)
TCP/IP networking (slow, high latency)

Neither works for direct-cabled RDMA meshes. This plugin does.

🔧 The Problem We Solved

                    ┌─────────────┐
                    │   Spark-A   │
                    │  (titanic)  │
                    └──────┬──────┘
           192.168.101.x   │   192.168.100.x
              (100Gbps)    │      (100Gbps)
                    ┌──────┴──────┐
                    │             │
              ┌─────┴─────┐ ┌─────┴─────┐
              │  Spark-B  │ │  Spark-C  │
              │ (iceberg) │ │(carpathia)│
              └─────┬─────┘ └─────┬─────┘
                    │             │
                    └──────┬──────┘
                    192.168.102.x
                      (100Gbps)

Three NVIDIA DGX Spark workstations connected in a triangle mesh with direct 100Gbps RDMA cables. Each link is on a different subnet - a configuration NVIDIA never intended to support.

🚀 Results

Metric	Value
Effective Bandwidth	8+ GB/s
Line Rate Utilization	~64%
Topology	3-node triangle mesh
Link Speed	100 Gbps per link

Successfully ran distributed LLM inference (Mistral-7B) across all 3 nodes using NCCL over this custom topology.

⚡ Unified Memory Advantage

On Grace Hopper / DGX Spark systems, the GPU and CPU share the same physical memory via NVLink-C2C. This unified memory architecture means:

No staging copies - RDMA operates directly on GPU-accessible memory
GPUDirect-like performance - Without additional kernel modules or configuration
Simplified memory management - Register once, use everywhere

The 8+ GB/s bandwidth is the real deal, not bottlenecked by GPU↔Host transfers.

🏗️ Architecture

Key Innovations

Multi-Address Handle Exchange
- Each node advertises ALL its subnet IPs in the NCCL handle
- Connector searches for reachable addresses by subnet matching
Subnet-Aware NIC Selection
- connect() finds the local NIC on the same subnet as the peer
- Automatic routing without IP forwarding or bridges
Background Handshake Thread
- Eliminates deadlock when both ranks call connect() simultaneously
- TCP-based QP info exchange runs asynchronously
Bidirectional QP Exchange
- Each connection creates fresh Queue Pairs on both sides
- No QP reuse across multiple NCCL channels

RDMA Implementation

Raw InfiniBand Verbs API (libibverbs)
Reliable Connected (RC) Queue Pairs
RoCE v2 over Ethernet
Zero-copy on unified memory systems

📦 Installation

Prerequisites

# Ubuntu/Debian
sudo apt-get install libibverbs-dev librdmacm-dev

# Verify RDMA devices
ibv_devices

Build

git clone https://github.com/autoscriptlabs/nccl-mesh-plugin.git
cd nccl-mesh-plugin
make

Use

export LD_LIBRARY_PATH=$(pwd):$LD_LIBRARY_PATH
export NCCL_NET_PLUGIN=mesh
export NCCL_DEBUG=INFO  # or WARN for less output

# Run your distributed job
python your_distributed_script.py

🧪 Testing

Basic All-Reduce Test

import torch
import torch.distributed as dist

dist.init_process_group('nccl', rank=RANK, world_size=3,
    init_method='tcp://MASTER_IP:29500')

t = torch.ones(1000, device='cuda')
dist.all_reduce(t)
print(f'Result: {t[0]}')  # Should print 3.0

dist.destroy_process_group()

Bandwidth Benchmark

import torch
import torch.distributed as dist
import time

dist.init_process_group('nccl', rank=RANK, world_size=3,
    init_method='tcp://MASTER_IP:29500')

t = torch.ones(1024*1024*64, device='cuda')  # 256MB

# Warmup
for _ in range(5):
    dist.all_reduce(t)
torch.cuda.synchronize()

# Benchmark
start = time.time()
for _ in range(20):
    dist.all_reduce(t)
torch.cuda.synchronize()
elapsed = time.time() - start

print(f'Bandwidth: {(256*20/1024)/elapsed:.2f} GB/s')

🔬 How It Works

Connection Flow

Rank 0 (listen)                    Rank 1 (connect)
     │                                   │
     ▼                                   │
 listen()                                │
 ├─ Create QPs on ALL NICs               │
 ├─ Start handshake thread               │
 ├─ Return handle with all IPs           │
     │                                   │
     │◄──────── handle exchange ────────►│
     │                                   │
     │                                   ▼
     │                              connect()
     │                              ├─ Find matching subnet
     │                              ├─ Create QP on that NIC
     │                              ├─ TCP handshake ──────────►│
     │                                   │                      │
     │◄────────────────────────────────────────── QP info ─────┤
     │                                   │                      │
     ▼                                   ▼                      ▼
 accept()                           Connect QP            [handshake thread]
 ├─ Get QP from queue               to peer's QP          ├─ Accept TCP
 └─ Return recv_comm                     │                ├─ Create new QP
                                         │                ├─ Connect QPs
                                         │                └─ Queue for accept()
                                         │
                                    ┌────┴────┐
                                    │ RDMA OK │
                                    └─────────┘

Subnet Matching

// For each peer address in handle
for (int i = 0; i < handle->num_addrs; i++) {
    uint32_t peer_ip = handle->addrs[i].ip;
    
    // Find local NIC on same subnet
    for (int j = 0; j < num_nics; j++) {
        if ((peer_ip & nic[j].netmask) == nic[j].subnet) {
            // Found matching NIC!
            selected_nic = &nic[j];
            break;
        }
    }
}

⚙️ Configuration

Environment Variable	Default	Description
`NCCL_NET_PLUGIN`	-	Set to `mesh` to use this plugin
`NCCL_DEBUG`	`WARN`	Set to `INFO` for detailed logs
`NCCL_MESH_GID_INDEX`	`3`	RoCE GID index to use
`NCCL_MESH_DEBUG`	`0`	Enable plugin debug output

🚧 Current Limitations

Single QP per connection - No multi-rail aggregation yet
No relay routing - Non-adjacent nodes can't communicate (fine for fully-connected mesh)
RoCE v2 only - Ethernet-based RDMA, no native InfiniBand support

🗺️ Roadmap

Multi-QP per connection for higher bandwidth
Adaptive routing for partial mesh topologies
Performance tuning (inline data, selective signaling)
Support for non-unified-memory systems with explicit GPUDirect RDMA

🛠️ Hardware Tested

Component	Specification
Nodes	3x NVIDIA DGX Spark
CPU	NVIDIA Grace (ARM64)
GPU	NVIDIA Blackwell
Memory	Unified (NVLink-C2C)
NICs	ConnectX-7 (100GbE)
Cables	Direct-attach QSFP56

📚 References

📄 License

MIT License - see LICENSE file.

🙏 Acknowledgments

Built to connect three DGX Spark workstations that NVIDIA never intended to cluster. Sometimes the best solutions come from ignoring "supported configurations."

"The future of distributed AI computing is here." — Mistral-7B, running distributed inference on this very plugin