mirror of
https://github.com/autoscriptlabs/nccl-mesh-plugin.git
synced 2026-01-11 11:34:06 +00:00
- Remove incorrect 'host memory staging' limitation - Add section explaining NVLink-C2C unified memory benefits - Zero-copy RDMA works automatically on DGX Spark
264 lines
8.6 KiB
Markdown
264 lines
8.6 KiB
Markdown
# NCCL Mesh Plugin
|
|
|
|
**Custom NCCL network plugin enabling distributed ML over direct-connect RDMA mesh topologies.**
|
|
|
|
[](https://opensource.org/licenses/MIT)
|
|
|
|
## 🎯 What This Does
|
|
|
|
This plugin enables NCCL (NVIDIA Collective Communications Library) to work with **direct-connect mesh topologies** where each node pair is on a different subnet. Standard NCCL plugins assume either:
|
|
- A switched InfiniBand fabric (all nodes on same subnet)
|
|
- TCP/IP networking (slow, high latency)
|
|
|
|
Neither works for direct-cabled RDMA meshes. This plugin does.
|
|
|
|
## 🔧 The Problem We Solved
|
|
|
|
```
|
|
┌─────────────┐
|
|
│ Spark-A │
|
|
│ (titanic) │
|
|
└──────┬──────┘
|
|
192.168.101.x │ 192.168.100.x
|
|
(100Gbps) │ (100Gbps)
|
|
┌──────┴──────┐
|
|
│ │
|
|
┌─────┴─────┐ ┌─────┴─────┐
|
|
│ Spark-B │ │ Spark-C │
|
|
│ (iceberg) │ │(carpathia)│
|
|
└─────┬─────┘ └─────┬─────┘
|
|
│ │
|
|
└──────┬──────┘
|
|
192.168.102.x
|
|
(100Gbps)
|
|
```
|
|
|
|
**Three NVIDIA DGX Spark workstations** connected in a triangle mesh with direct 100Gbps RDMA cables. Each link is on a **different subnet** - a configuration NVIDIA never intended to support.
|
|
|
|
## 🚀 Results
|
|
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| Effective Bandwidth | **8+ GB/s** |
|
|
| Line Rate Utilization | ~64% |
|
|
| Topology | 3-node triangle mesh |
|
|
| Link Speed | 100 Gbps per link |
|
|
|
|
Successfully ran **distributed LLM inference** (Mistral-7B) across all 3 nodes using NCCL over this custom topology.
|
|
|
|
## ⚡ Unified Memory Advantage
|
|
|
|
On **Grace Hopper / DGX Spark** systems, the GPU and CPU share the same physical memory via NVLink-C2C. This unified memory architecture means:
|
|
|
|
- **No staging copies** - RDMA operates directly on GPU-accessible memory
|
|
- **GPUDirect-like performance** - Without additional kernel modules or configuration
|
|
- **Simplified memory management** - Register once, use everywhere
|
|
|
|
The 8+ GB/s bandwidth is the real deal, not bottlenecked by GPU↔Host transfers.
|
|
|
|
## 🏗️ Architecture
|
|
|
|
### Key Innovations
|
|
|
|
1. **Multi-Address Handle Exchange**
|
|
- Each node advertises ALL its subnet IPs in the NCCL handle
|
|
- Connector searches for reachable addresses by subnet matching
|
|
|
|
2. **Subnet-Aware NIC Selection**
|
|
- `connect()` finds the local NIC on the same subnet as the peer
|
|
- Automatic routing without IP forwarding or bridges
|
|
|
|
3. **Background Handshake Thread**
|
|
- Eliminates deadlock when both ranks call `connect()` simultaneously
|
|
- TCP-based QP info exchange runs asynchronously
|
|
|
|
4. **Bidirectional QP Exchange**
|
|
- Each connection creates fresh Queue Pairs on both sides
|
|
- No QP reuse across multiple NCCL channels
|
|
|
|
### RDMA Implementation
|
|
|
|
- Raw InfiniBand Verbs API (libibverbs)
|
|
- Reliable Connected (RC) Queue Pairs
|
|
- RoCE v2 over Ethernet
|
|
- Zero-copy on unified memory systems
|
|
|
|
## 📦 Installation
|
|
|
|
### Prerequisites
|
|
|
|
```bash
|
|
# Ubuntu/Debian
|
|
sudo apt-get install libibverbs-dev librdmacm-dev
|
|
|
|
# Verify RDMA devices
|
|
ibv_devices
|
|
```
|
|
|
|
### Build
|
|
|
|
```bash
|
|
git clone https://github.com/autoscriptlabs/nccl-mesh-plugin.git
|
|
cd nccl-mesh-plugin
|
|
make
|
|
```
|
|
|
|
### Use
|
|
|
|
```bash
|
|
export LD_LIBRARY_PATH=$(pwd):$LD_LIBRARY_PATH
|
|
export NCCL_NET_PLUGIN=mesh
|
|
export NCCL_DEBUG=INFO # or WARN for less output
|
|
|
|
# Run your distributed job
|
|
python your_distributed_script.py
|
|
```
|
|
|
|
## 🧪 Testing
|
|
|
|
### Basic All-Reduce Test
|
|
|
|
```python
|
|
import torch
|
|
import torch.distributed as dist
|
|
|
|
dist.init_process_group('nccl', rank=RANK, world_size=3,
|
|
init_method='tcp://MASTER_IP:29500')
|
|
|
|
t = torch.ones(1000, device='cuda')
|
|
dist.all_reduce(t)
|
|
print(f'Result: {t[0]}') # Should print 3.0
|
|
|
|
dist.destroy_process_group()
|
|
```
|
|
|
|
### Bandwidth Benchmark
|
|
|
|
```python
|
|
import torch
|
|
import torch.distributed as dist
|
|
import time
|
|
|
|
dist.init_process_group('nccl', rank=RANK, world_size=3,
|
|
init_method='tcp://MASTER_IP:29500')
|
|
|
|
t = torch.ones(1024*1024*64, device='cuda') # 256MB
|
|
|
|
# Warmup
|
|
for _ in range(5):
|
|
dist.all_reduce(t)
|
|
torch.cuda.synchronize()
|
|
|
|
# Benchmark
|
|
start = time.time()
|
|
for _ in range(20):
|
|
dist.all_reduce(t)
|
|
torch.cuda.synchronize()
|
|
elapsed = time.time() - start
|
|
|
|
print(f'Bandwidth: {(256*20/1024)/elapsed:.2f} GB/s')
|
|
```
|
|
|
|
## 🔬 How It Works
|
|
|
|
### Connection Flow
|
|
|
|
```
|
|
Rank 0 (listen) Rank 1 (connect)
|
|
│ │
|
|
▼ │
|
|
listen() │
|
|
├─ Create QPs on ALL NICs │
|
|
├─ Start handshake thread │
|
|
├─ Return handle with all IPs │
|
|
│ │
|
|
│◄──────── handle exchange ────────►│
|
|
│ │
|
|
│ ▼
|
|
│ connect()
|
|
│ ├─ Find matching subnet
|
|
│ ├─ Create QP on that NIC
|
|
│ ├─ TCP handshake ──────────►│
|
|
│ │ │
|
|
│◄────────────────────────────────────────── QP info ─────┤
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
accept() Connect QP [handshake thread]
|
|
├─ Get QP from queue to peer's QP ├─ Accept TCP
|
|
└─ Return recv_comm │ ├─ Create new QP
|
|
│ ├─ Connect QPs
|
|
│ └─ Queue for accept()
|
|
│
|
|
┌────┴────┐
|
|
│ RDMA OK │
|
|
└─────────┘
|
|
```
|
|
|
|
### Subnet Matching
|
|
|
|
```c
|
|
// For each peer address in handle
|
|
for (int i = 0; i < handle->num_addrs; i++) {
|
|
uint32_t peer_ip = handle->addrs[i].ip;
|
|
|
|
// Find local NIC on same subnet
|
|
for (int j = 0; j < num_nics; j++) {
|
|
if ((peer_ip & nic[j].netmask) == nic[j].subnet) {
|
|
// Found matching NIC!
|
|
selected_nic = &nic[j];
|
|
break;
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## ⚙️ Configuration
|
|
|
|
| Environment Variable | Default | Description |
|
|
|---------------------|---------|-------------|
|
|
| `NCCL_NET_PLUGIN` | - | Set to `mesh` to use this plugin |
|
|
| `NCCL_DEBUG` | `WARN` | Set to `INFO` for detailed logs |
|
|
| `NCCL_MESH_GID_INDEX` | `3` | RoCE GID index to use |
|
|
| `NCCL_MESH_DEBUG` | `0` | Enable plugin debug output |
|
|
|
|
## 🚧 Current Limitations
|
|
|
|
- **Single QP per connection** - No multi-rail aggregation yet
|
|
- **No relay routing** - Non-adjacent nodes can't communicate (fine for fully-connected mesh)
|
|
- **RoCE v2 only** - Ethernet-based RDMA, no native InfiniBand support
|
|
|
|
## 🗺️ Roadmap
|
|
|
|
- [ ] Multi-QP per connection for higher bandwidth
|
|
- [ ] Adaptive routing for partial mesh topologies
|
|
- [ ] Performance tuning (inline data, selective signaling)
|
|
- [ ] Support for non-unified-memory systems with explicit GPUDirect RDMA
|
|
|
|
## 🛠️ Hardware Tested
|
|
|
|
| Component | Specification |
|
|
|-----------|--------------|
|
|
| Nodes | 3x NVIDIA DGX Spark |
|
|
| CPU | NVIDIA Grace (ARM64) |
|
|
| GPU | NVIDIA Blackwell |
|
|
| Memory | Unified (NVLink-C2C) |
|
|
| NICs | ConnectX-7 (100GbE) |
|
|
| Cables | Direct-attach QSFP56 |
|
|
|
|
## 📚 References
|
|
|
|
- [NCCL Documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/)
|
|
- [RDMA Aware Networks Programming User Manual](https://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf)
|
|
- [InfiniBand Verbs API](https://github.com/linux-rdma/rdma-core)
|
|
|
|
## 📄 License
|
|
|
|
MIT License - see [LICENSE](LICENSE) file.
|
|
|
|
## 🙏 Acknowledgments
|
|
|
|
Built to connect three DGX Spark workstations that NVIDIA never intended to cluster. Sometimes the best solutions come from ignoring "supported configurations."
|
|
|
|
---
|
|
|
|
*"The future of distributed AI computing is here."* — Mistral-7B, running distributed inference on this very plugin
|