- Enables NCCL over multi-subnet mesh topologies - 8+ GB/s bandwidth over 100Gbps RDMA - Successfully tested with distributed LLM inference (Mistral-7B) - Custom subnet-aware NIC selection - Background handshake thread for deadlock-free connection setup |
||
|---|---|---|
| docs | ||
| examples | ||
| include | ||
| nccl | ||
| src | ||
| LICENSE | ||
| Makefile | ||
| README.md | ||
NCCL Mesh Plugin
Custom NCCL network plugin enabling distributed ML over direct-connect RDMA mesh topologies.
🎯 What This Does
This plugin enables NCCL (NVIDIA Collective Communications Library) to work with direct-connect mesh topologies where each node pair is on a different subnet. Standard NCCL plugins assume either:
- A switched InfiniBand fabric (all nodes on same subnet)
- TCP/IP networking (slow, high latency)
Neither works for direct-cabled RDMA meshes. This plugin does.
🔧 The Problem We Solved
┌─────────────┐
│ Spark-A │
│ (titanic) │
└──────┬──────┘
192.168.101.x │ 192.168.100.x
(100Gbps) │ (100Gbps)
┌──────┴──────┐
│ │
┌─────┴─────┐ ┌─────┴─────┐
│ Spark-B │ │ Spark-C │
│ (iceberg) │ │(carpathia)│
└─────┬─────┘ └─────┬─────┘
│ │
└──────┬──────┘
192.168.102.x
(100Gbps)
Three DGX Spark workstations connected in a triangle mesh with direct 100Gbps RDMA cables. Each link is on a different subnet - a configuration NVIDIA never intended to support.
🚀 Results
| Metric | Value |
|---|---|
| Effective Bandwidth | 8+ GB/s |
| Line Rate Utilization | ~64% |
| Topology | 3-node triangle mesh |
| Link Speed | 100 Gbps per link |
Successfully ran distributed LLM inference (Mistral-7B) across all 3 nodes using NCCL over this custom topology.
🏗️ Architecture
Key Innovations
-
Multi-Address Handle Exchange
- Each node advertises ALL its subnet IPs in the NCCL handle
- Connector searches for reachable addresses by subnet matching
-
Subnet-Aware NIC Selection
connect()finds the local NIC on the same subnet as the peer- Automatic routing without IP forwarding or bridges
-
Background Handshake Thread
- Eliminates deadlock when both ranks call
connect()simultaneously - TCP-based QP info exchange runs asynchronously
- Eliminates deadlock when both ranks call
-
Bidirectional QP Exchange
- Each connection creates fresh Queue Pairs on both sides
- No QP reuse across multiple NCCL channels
RDMA Implementation
- Raw InfiniBand Verbs API (libibverbs)
- Reliable Connected (RC) Queue Pairs
- RoCE v2 over Ethernet
- Host memory staging (GPU→Host→RDMA→Host→GPU)
📦 Installation
Prerequisites
# Ubuntu/Debian
sudo apt-get install libibverbs-dev librdmacm-dev
# Verify RDMA devices
ibv_devices
Build
git clone https://github.com/yourusername/nccl-mesh-plugin.git
cd nccl-mesh-plugin
make
Use
export LD_LIBRARY_PATH=$(pwd):$LD_LIBRARY_PATH
export NCCL_NET_PLUGIN=mesh
export NCCL_DEBUG=INFO # or WARN for less output
# Run your distributed job
python your_distributed_script.py
🧪 Testing
Basic All-Reduce Test
import torch
import torch.distributed as dist
dist.init_process_group('nccl', rank=RANK, world_size=3,
init_method='tcp://MASTER_IP:29500')
t = torch.ones(1000, device='cuda')
dist.all_reduce(t)
print(f'Result: {t[0]}') # Should print 3.0
dist.destroy_process_group()
Bandwidth Benchmark
import torch
import torch.distributed as dist
import time
dist.init_process_group('nccl', rank=RANK, world_size=3,
init_method='tcp://MASTER_IP:29500')
t = torch.ones(1024*1024*64, device='cuda') # 256MB
# Warmup
for _ in range(5):
dist.all_reduce(t)
torch.cuda.synchronize()
# Benchmark
start = time.time()
for _ in range(20):
dist.all_reduce(t)
torch.cuda.synchronize()
elapsed = time.time() - start
print(f'Bandwidth: {(256*20/1024)/elapsed:.2f} GB/s')
🔬 How It Works
Connection Flow
Rank 0 (listen) Rank 1 (connect)
│ │
▼ │
listen() │
├─ Create QPs on ALL NICs │
├─ Start handshake thread │
├─ Return handle with all IPs │
│ │
│◄──────── handle exchange ────────►│
│ │
│ ▼
│ connect()
│ ├─ Find matching subnet
│ ├─ Create QP on that NIC
│ ├─ TCP handshake ──────────►│
│ │ │
│◄────────────────────────────────────────── QP info ─────┤
│ │ │
▼ ▼ ▼
accept() Connect QP [handshake thread]
├─ Get QP from queue to peer's QP ├─ Accept TCP
└─ Return recv_comm │ ├─ Create new QP
│ ├─ Connect QPs
│ └─ Queue for accept()
│
┌────┴────┐
│ RDMA OK │
└─────────┘
Subnet Matching
// For each peer address in handle
for (int i = 0; i < handle->num_addrs; i++) {
uint32_t peer_ip = handle->addrs[i].ip;
// Find local NIC on same subnet
for (int j = 0; j < num_nics; j++) {
if ((peer_ip & nic[j].netmask) == nic[j].subnet) {
// Found matching NIC!
selected_nic = &nic[j];
break;
}
}
}
⚙️ Configuration
| Environment Variable | Default | Description |
|---|---|---|
NCCL_NET_PLUGIN |
- | Set to mesh to use this plugin |
NCCL_DEBUG |
WARN |
Set to INFO for detailed logs |
NCCL_MESH_GID_INDEX |
3 |
RoCE GID index to use |
NCCL_MESH_DEBUG |
0 |
Enable plugin debug output |
🚧 Limitations
- Host memory staging: GPU memory goes through host (no GPUDirect RDMA yet)
- Single QP per connection: No multi-rail aggregation
- No relay routing: Non-adjacent nodes can't communicate (fine for fully-connected mesh)
- RoCE v2 only: No InfiniBand support (Ethernet only)
🗺️ Roadmap
- GPUDirect RDMA support (bypass host memory)
- Multi-QP per connection for higher bandwidth
- Adaptive routing for partial meshes
- Performance tuning (inline data, signaling)
📚 References
📄 License
MIT License - see LICENSE file.
🙏 Acknowledgments
Built to connect three DGX Spark workstations that NVIDIA never intended to be clustered. Sometimes the best solutions come from ignoring "supported configurations."
"The future of distributed AI computing is here." - Mistral-7B, running on this very plugin