nccl-mesh-plugin/README.md

# NCCL Mesh Plugin

**Custom NCCL network plugin enabling distributed ML over direct-connect RDMA mesh topologies.**

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## 🎯 What This Does

This plugin enables NCCL (NVIDIA Collective Communications Library) to work with **direct-connect mesh topologies** where each node pair is on a different subnet. Standard NCCL plugins assume either:
- A switched InfiniBand fabric (all nodes on same subnet)
- TCP/IP networking (slow, high latency)

Neither works for direct-cabled RDMA meshes. This plugin does.

## 🔧 The Problem We Solved

```
                    ┌─────────────┐
                    │   Spark-A   │
                    │  (titanic)  │
                    └──────┬──────┘
           192.168.101.x   │   192.168.100.x
              (100Gbps)    │      (100Gbps)
                    ┌──────┴──────┐
                    │             │
              ┌─────┴─────┐ ┌─────┴─────┐
              │  Spark-B  │ │  Spark-C  │
              │ (iceberg) │ │(carpathia)│
              └─────┬─────┘ └─────┬─────┘
                    │             │
                    └──────┬──────┘
                    192.168.102.x
                      (100Gbps)
```

**Three NVIDIA DGX Spark workstations** connected in a triangle mesh with direct 100Gbps RDMA cables. Each link is on a **different subnet** - a configuration NVIDIA never intended to support.

## 🚀 Results

| Metric | Value |
|--------|-------|
| Effective Bandwidth | **8+ GB/s** |
| Line Rate Utilization | ~64% |
| Topology | 3-node triangle mesh |
| Link Speed | 100 Gbps per link |

Successfully ran **distributed LLM inference** (Mistral-7B) across all 3 nodes using NCCL over this custom topology.

## ⚡ Unified Memory Advantage

On **Grace Hopper / DGX Spark** systems, the GPU and CPU share the same physical memory via NVLink-C2C. This unified memory architecture means:

- **No staging copies** - RDMA operates directly on GPU-accessible memory
- **GPUDirect-like performance** - Without additional kernel modules or configuration
- **Simplified memory management** - Register once, use everywhere

The 8+ GB/s bandwidth is the real deal, not bottlenecked by GPU↔Host transfers.

## 🏗️ Architecture

### Key Innovations

1. **Multi-Address Handle Exchange**
   - Each node advertises ALL its subnet IPs in the NCCL handle
   - Connector searches for reachable addresses by subnet matching

2. **Subnet-Aware NIC Selection**
   - `connect()` finds the local NIC on the same subnet as the peer
   - Automatic routing without IP forwarding or bridges

3. **Background Handshake Thread**
   - Eliminates deadlock when both ranks call `connect()` simultaneously
   - TCP-based QP info exchange runs asynchronously

4. **Bidirectional QP Exchange**
   - Each connection creates fresh Queue Pairs on both sides
   - No QP reuse across multiple NCCL channels

### RDMA Implementation

- Raw InfiniBand Verbs API (libibverbs)
- Reliable Connected (RC) Queue Pairs
- RoCE v2 over Ethernet
- Zero-copy on unified memory systems

## 📦 Installation

### Prerequisites

```bash
# Ubuntu/Debian
sudo apt-get install libibverbs-dev librdmacm-dev

# Verify RDMA devices
ibv_devices
```

### Build

```bash
git clone https://github.com/autoscriptlabs/nccl-mesh-plugin.git
cd nccl-mesh-plugin
make
```

### Use

```bash
export LD_LIBRARY_PATH=$(pwd):$LD_LIBRARY_PATH
export NCCL_NET_PLUGIN=mesh
export NCCL_DEBUG=INFO  # or WARN for less output

# Run your distributed job
python your_distributed_script.py
```

## 🧪 Testing

### Basic All-Reduce Test

```python
import torch
import torch.distributed as dist

dist.init_process_group('nccl', rank=RANK, world_size=3,
    init_method='tcp://MASTER_IP:29500')

t = torch.ones(1000, device='cuda')
dist.all_reduce(t)
print(f'Result: {t[0]}')  # Should print 3.0

dist.destroy_process_group()
```

### Bandwidth Benchmark

```python
import torch
import torch.distributed as dist
import time

dist.init_process_group('nccl', rank=RANK, world_size=3,
    init_method='tcp://MASTER_IP:29500')

t = torch.ones(1024*1024*64, device='cuda')  # 256MB

# Warmup
for _ in range(5):
    dist.all_reduce(t)
torch.cuda.synchronize()

# Benchmark
start = time.time()
for _ in range(20):
    dist.all_reduce(t)
torch.cuda.synchronize()
elapsed = time.time() - start

print(f'Bandwidth: {(256*20/1024)/elapsed:.2f} GB/s')
```

## 🔬 How It Works

### Connection Flow

```
Rank 0 (listen)                    Rank 1 (connect)
     │                                   │
     ▼                                   │
 listen()                                │
 ├─ Create QPs on ALL NICs               │
 ├─ Start handshake thread               │
 ├─ Return handle with all IPs           │
     │                                   │
     │◄──────── handle exchange ────────►│
     │                                   │
     │                                   ▼
     │                              connect()
     │                              ├─ Find matching subnet
     │                              ├─ Create QP on that NIC
     │                              ├─ TCP handshake ──────────►│
     │                                   │                      │
     │◄────────────────────────────────────────── QP info ─────┤
     │                                   │                      │
     ▼                                   ▼                      ▼
 accept()                           Connect QP            [handshake thread]
 ├─ Get QP from queue               to peer's QP          ├─ Accept TCP
 └─ Return recv_comm                     │                ├─ Create new QP
                                         │                ├─ Connect QPs
                                         │                └─ Queue for accept()
                                         │
                                    ┌────┴────┐
                                    │ RDMA OK │
                                    └─────────┘
```

### Subnet Matching

```c
// For each peer address in handle
for (int i = 0; i < handle->num_addrs; i++) {
    uint32_t peer_ip = handle->addrs[i].ip;

    // Find local NIC on same subnet
    for (int j = 0; j < num_nics; j++) {
        if ((peer_ip & nic[j].netmask) == nic[j].subnet) {
            // Found matching NIC!
            selected_nic = &nic[j];
            break;
        }
    }
}
```

## ⚙️ Configuration

| Environment Variable | Default | Description |
|---------------------|---------|-------------|
| `NCCL_NET_PLUGIN` | - | Set to `mesh` to use this plugin |
| `NCCL_DEBUG` | `WARN` | Set to `INFO` for detailed logs |
| `NCCL_MESH_GID_INDEX` | `3` | RoCE GID index to use |
| `NCCL_MESH_DEBUG` | `0` | Enable plugin debug output |

## 🚧 Current Limitations

- **Single QP per connection** - No multi-rail aggregation yet
- **No relay routing** - Non-adjacent nodes can't communicate (fine for fully-connected mesh)
- **RoCE v2 only** - Ethernet-based RDMA, no native InfiniBand support

## 🗺️ Roadmap

- [ ] Multi-QP per connection for higher bandwidth
- [ ] Adaptive routing for partial mesh topologies
- [ ] Performance tuning (inline data, selective signaling)
- [ ] Support for non-unified-memory systems with explicit GPUDirect RDMA

## 🛠️ Hardware Tested

| Component | Specification |
|-----------|--------------|
| Nodes | 3x NVIDIA DGX Spark |
| CPU | NVIDIA Grace (ARM64) |
| GPU | NVIDIA Blackwell |
| Memory | Unified (NVLink-C2C) |
| NICs | ConnectX-7 (100GbE) |
| Cables | Direct-attach QSFP56 |

## 📚 References

- [NCCL Documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/)
- [RDMA Aware Networks Programming User Manual](https://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf)
- [InfiniBand Verbs API](https://github.com/linux-rdma/rdma-core)

## 📄 License

MIT License - see [LICENSE](LICENSE) file.

## 🙏 Acknowledgments

Built to connect three DGX Spark workstations that NVIDIA never intended to cluster. Sometimes the best solutions come from ignoring "supported configurations."

---

*"The future of distributed AI computing is here."* — Mistral-7B, running distributed inference on this very plugin