- Remove incorrect 'host memory staging' limitation - Add section explaining NVLink-C2C unified memory benefits - Zero-copy RDMA works automatically on DGX Spark |
||
|---|---|---|
| docs | ||
| examples | ||
| include | ||
| nccl | ||
| src | ||
| LICENSE | ||
| Makefile | ||
| README.md | ||
NCCL Mesh Plugin
Custom NCCL network plugin enabling distributed ML over direct-connect RDMA mesh topologies.
🎯 What This Does
This plugin enables NCCL (NVIDIA Collective Communications Library) to work with direct-connect mesh topologies where each node pair is on a different subnet. Standard NCCL plugins assume either:
- A switched InfiniBand fabric (all nodes on same subnet)
- TCP/IP networking (slow, high latency)
Neither works for direct-cabled RDMA meshes. This plugin does.
🔧 The Problem We Solved
┌─────────────┐
│ Spark-A │
│ (titanic) │
└──────┬──────┘
192.168.101.x │ 192.168.100.x
(100Gbps) │ (100Gbps)
┌──────┴──────┐
│ │
┌─────┴─────┐ ┌─────┴─────┐
│ Spark-B │ │ Spark-C │
│ (iceberg) │ │(carpathia)│
└─────┬─────┘ └─────┬─────┘
│ │
└──────┬──────┘
192.168.102.x
(100Gbps)
Three NVIDIA DGX Spark workstations connected in a triangle mesh with direct 100Gbps RDMA cables. Each link is on a different subnet - a configuration NVIDIA never intended to support.
🚀 Results
| Metric | Value |
|---|---|
| Effective Bandwidth | 8+ GB/s |
| Line Rate Utilization | ~64% |
| Topology | 3-node triangle mesh |
| Link Speed | 100 Gbps per link |
Successfully ran distributed LLM inference (Mistral-7B) across all 3 nodes using NCCL over this custom topology.
⚡ Unified Memory Advantage
On Grace Hopper / DGX Spark systems, the GPU and CPU share the same physical memory via NVLink-C2C. This unified memory architecture means:
- No staging copies - RDMA operates directly on GPU-accessible memory
- GPUDirect-like performance - Without additional kernel modules or configuration
- Simplified memory management - Register once, use everywhere
The 8+ GB/s bandwidth is the real deal, not bottlenecked by GPU↔Host transfers.
🏗️ Architecture
Key Innovations
-
Multi-Address Handle Exchange
- Each node advertises ALL its subnet IPs in the NCCL handle
- Connector searches for reachable addresses by subnet matching
-
Subnet-Aware NIC Selection
connect()finds the local NIC on the same subnet as the peer- Automatic routing without IP forwarding or bridges
-
Background Handshake Thread
- Eliminates deadlock when both ranks call
connect()simultaneously - TCP-based QP info exchange runs asynchronously
- Eliminates deadlock when both ranks call
-
Bidirectional QP Exchange
- Each connection creates fresh Queue Pairs on both sides
- No QP reuse across multiple NCCL channels
RDMA Implementation
- Raw InfiniBand Verbs API (libibverbs)
- Reliable Connected (RC) Queue Pairs
- RoCE v2 over Ethernet
- Zero-copy on unified memory systems
📦 Installation
Prerequisites
# Ubuntu/Debian
sudo apt-get install libibverbs-dev librdmacm-dev
# Verify RDMA devices
ibv_devices
Build
git clone https://github.com/autoscriptlabs/nccl-mesh-plugin.git
cd nccl-mesh-plugin
make
Use
export LD_LIBRARY_PATH=$(pwd):$LD_LIBRARY_PATH
export NCCL_NET_PLUGIN=mesh
export NCCL_DEBUG=INFO # or WARN for less output
# Run your distributed job
python your_distributed_script.py
🧪 Testing
Basic All-Reduce Test
import torch
import torch.distributed as dist
dist.init_process_group('nccl', rank=RANK, world_size=3,
init_method='tcp://MASTER_IP:29500')
t = torch.ones(1000, device='cuda')
dist.all_reduce(t)
print(f'Result: {t[0]}') # Should print 3.0
dist.destroy_process_group()
Bandwidth Benchmark
import torch
import torch.distributed as dist
import time
dist.init_process_group('nccl', rank=RANK, world_size=3,
init_method='tcp://MASTER_IP:29500')
t = torch.ones(1024*1024*64, device='cuda') # 256MB
# Warmup
for _ in range(5):
dist.all_reduce(t)
torch.cuda.synchronize()
# Benchmark
start = time.time()
for _ in range(20):
dist.all_reduce(t)
torch.cuda.synchronize()
elapsed = time.time() - start
print(f'Bandwidth: {(256*20/1024)/elapsed:.2f} GB/s')
🔬 How It Works
Connection Flow
Rank 0 (listen) Rank 1 (connect)
│ │
▼ │
listen() │
├─ Create QPs on ALL NICs │
├─ Start handshake thread │
├─ Return handle with all IPs │
│ │
│◄──────── handle exchange ────────►│
│ │
│ ▼
│ connect()
│ ├─ Find matching subnet
│ ├─ Create QP on that NIC
│ ├─ TCP handshake ──────────►│
│ │ │
│◄────────────────────────────────────────── QP info ─────┤
│ │ │
▼ ▼ ▼
accept() Connect QP [handshake thread]
├─ Get QP from queue to peer's QP ├─ Accept TCP
└─ Return recv_comm │ ├─ Create new QP
│ ├─ Connect QPs
│ └─ Queue for accept()
│
┌────┴────┐
│ RDMA OK │
└─────────┘
Subnet Matching
// For each peer address in handle
for (int i = 0; i < handle->num_addrs; i++) {
uint32_t peer_ip = handle->addrs[i].ip;
// Find local NIC on same subnet
for (int j = 0; j < num_nics; j++) {
if ((peer_ip & nic[j].netmask) == nic[j].subnet) {
// Found matching NIC!
selected_nic = &nic[j];
break;
}
}
}
⚙️ Configuration
| Environment Variable | Default | Description |
|---|---|---|
NCCL_NET_PLUGIN |
- | Set to mesh to use this plugin |
NCCL_DEBUG |
WARN |
Set to INFO for detailed logs |
NCCL_MESH_GID_INDEX |
3 |
RoCE GID index to use |
NCCL_MESH_DEBUG |
0 |
Enable plugin debug output |
🚧 Current Limitations
- Single QP per connection - No multi-rail aggregation yet
- No relay routing - Non-adjacent nodes can't communicate (fine for fully-connected mesh)
- RoCE v2 only - Ethernet-based RDMA, no native InfiniBand support
🗺️ Roadmap
- Multi-QP per connection for higher bandwidth
- Adaptive routing for partial mesh topologies
- Performance tuning (inline data, selective signaling)
- Support for non-unified-memory systems with explicit GPUDirect RDMA
🛠️ Hardware Tested
| Component | Specification |
|---|---|
| Nodes | 3x NVIDIA DGX Spark |
| CPU | NVIDIA Grace (ARM64) |
| GPU | NVIDIA Blackwell |
| Memory | Unified (NVLink-C2C) |
| NICs | ConnectX-7 (100GbE) |
| Cables | Direct-attach QSFP56 |
📚 References
📄 License
MIT License - see LICENSE file.
🙏 Acknowledgments
Built to connect three DGX Spark workstations that NVIDIA never intended to cluster. Sometimes the best solutions come from ignoring "supported configurations."
"The future of distributed AI computing is here." — Mistral-7B, running distributed inference on this very plugin