Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies

- Enables NCCL over multi-subnet mesh topologies - 8+ GB/s bandwidth over 100Gbps RDMA - Successfully tested with distributed LLM inference (Mistral-7B) - Custom subnet-aware NIC selection - Background handshake thread for deadlock-free connection setup
2026-01-11 11:34:06 +00:00 · 2026-01-09 14:09:33 -05:00 · 2026-01-09 14:09:33 -05:00 · 031bc48953
commit 031bc48953
13 changed files with 3074 additions and 0 deletions
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@ -0,0 +1,337 @@
+# NCCL Mesh Plugin Architecture
+
+This document provides a deep dive into the architecture and implementation of the NCCL Mesh Plugin.
+
+## Overview
+
+The NCCL Mesh Plugin is a custom network transport that enables NCCL to work with direct-connect RDMA mesh topologies where each node pair is on a different subnet. This is a configuration that standard NCCL plugins cannot handle.
+
+## The Problem
+
+### Standard NCCL Networking
+
+NCCL's built-in network plugins assume one of two scenarios:
+
+1. **InfiniBand Fabric**: All nodes connected through IB switches, sharing a single subnet
+2. **TCP/IP Sockets**: Standard IP networking with routing
+
+### Our Topology
+
+```
+     Node A (192.168.100.2, 192.168.101.2)
+              /                \
+    192.168.100.x         192.168.101.x
+            /                    \
+    Node C                      Node B
+(192.168.100.3,            (192.168.101.3,
+ 192.168.102.3)             192.168.102.2)
+            \                    /
+             \   192.168.102.x  /
+              \                /
+               \--------------/
+```
+
+Each link is on a **different subnet**:
+- A↔B: 192.168.101.0/24
+- A↔C: 192.168.100.0/24
+- B↔C: 192.168.102.0/24
+
+This means:
+- No single IP can reach all peers
+- Standard IB plugin fails (expects single subnet)
+- TCP socket plugin would need IP routing (adds latency)
+
+## Solution Architecture
+
+### Key Insight
+
+Each node has **multiple NICs**, each on a different subnet. When connecting to a peer, we must:
+1. Determine which subnet the peer is on
+2. Use the local NIC on that same subnet
+3. Establish RDMA connection over that specific NIC pair
+
+### Handle Structure
+
+The NCCL handle is expanded to advertise **all** local addresses:
+
+```c
+struct mesh_handle {
+    uint32_t magic;              // Validation
+    uint8_t  num_addrs;          // Number of addresses
+    uint16_t handshake_port;     // TCP port for QP exchange
+    
+    struct mesh_addr_entry {
+        uint32_t ip;             // IP address (network order)
+        uint32_t mask;           // Subnet mask
+        uint32_t qp_num;         // Queue Pair number
+        uint8_t  nic_idx;        // Index into local NIC array
+    } addrs[MESH_MAX_ADDRS];
+};
+```
+
+### Connection Flow
+
+#### Phase 1: Listen
+
+```c
+ncclResult_t mesh_listen(int dev, void *handle, void **listenComm) {
+    // 1. Create QPs on ALL local NICs
+    for (int i = 0; i < num_nics; i++) {
+        create_qp_on_nic(&nics[i]);
+    }
+    
+    // 2. Start background handshake thread
+    pthread_create(&thread, handshake_thread_func, lcomm);
+    
+    // 3. Fill handle with ALL addresses
+    for (int i = 0; i < num_nics; i++) {
+        handle->addrs[i].ip = nics[i].ip_addr;
+        handle->addrs[i].mask = nics[i].netmask;
+        handle->addrs[i].qp_num = qps[i]->qp_num;
+    }
+}
+```
+
+#### Phase 2: Connect
+
+```c
+ncclResult_t mesh_connect(int dev, void *handle, void **sendComm) {
+    // 1. Search peer's addresses for reachable one
+    for (int i = 0; i < handle->num_addrs; i++) {
+        uint32_t peer_subnet = handle->addrs[i].ip & handle->addrs[i].mask;
+        
+        // Find local NIC on same subnet
+        for (int j = 0; j < num_local_nics; j++) {
+            if (local_nics[j].subnet == peer_subnet) {
+                selected_nic = &local_nics[j];
+                selected_peer_addr = &handle->addrs[i];
+                break;
+            }
+        }
+    }
+    
+    // 2. Create QP on selected NIC
+    create_qp_on_nic(selected_nic);
+    
+    // 3. Exchange QP info via TCP handshake
+    send_handshake(peer_ip, peer_port, &local_qp_info, &remote_qp_info);
+    
+    // 4. Connect QP to peer's QP
+    connect_qp(local_qp, remote_qp_info);
+}
+```
+
+#### Phase 3: Accept
+
+```c
+ncclResult_t mesh_accept(void *listenComm, void **recvComm) {
+    // Get pre-connected QP from handshake thread's queue
+    pthread_mutex_lock(&queue_mutex);
+    while (queue_empty) {
+        pthread_cond_wait(&queue_cond, &queue_mutex);
+    }
+    entry = dequeue();
+    pthread_mutex_unlock(&queue_mutex);
+    
+    // Return the ready connection
+    rcomm->qp = entry->local_qp;
+    rcomm->nic = entry->nic;
+}
+```
+
+### Background Handshake Thread
+
+The handshake thread solves a critical deadlock problem:
+
+**Without thread:**
+```
+Rank 0: connect() → TCP connect to Rank 1 → blocks waiting for accept()
+Rank 1: connect() → TCP connect to Rank 0 → blocks waiting for accept()
+// DEADLOCK: Neither can call accept() because both stuck in connect()
+```
+
+**With thread:**
+```
+Rank 0: listen() starts thread → thread waits for TCP connections
+Rank 1: listen() starts thread → thread waits for TCP connections
+Rank 0: connect() → TCP connects to Rank 1's thread → gets response → returns
+Rank 1: connect() → TCP connects to Rank 0's thread → gets response → returns
+Rank 0: accept() → gets QP from queue (filled by thread) → returns
+Rank 1: accept() → gets QP from queue (filled by thread) → returns
+// SUCCESS: Thread handles incoming connections asynchronously
+```
+
+### RDMA Queue Pair Setup
+
+Each connection requires proper QP state transitions:
+
+```
+RESET → INIT → RTR → RTS
+```
+
+```c
+int mesh_connect_qp(struct ibv_qp *qp, struct mesh_nic *nic,
+                    struct mesh_handle *remote) {
+    // RESET → INIT
+    qp_attr.qp_state = IBV_QPS_INIT;
+    qp_attr.pkey_index = 0;
+    qp_attr.port_num = nic->port_num;
+    qp_attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE | 
+                              IBV_ACCESS_REMOTE_READ |
+                              IBV_ACCESS_LOCAL_WRITE;
+    ibv_modify_qp(qp, &qp_attr, ...);
+    
+    // INIT → RTR (Ready to Receive)
+    qp_attr.qp_state = IBV_QPS_RTR;
+    qp_attr.path_mtu = IBV_MTU_4096;
+    qp_attr.dest_qp_num = remote->qp_num;
+    qp_attr.rq_psn = remote->psn;
+    qp_attr.ah_attr.dlid = remote->lid;  // 0 for RoCE
+    qp_attr.ah_attr.grh.dgid = remote->gid;  // Peer's GID
+    ibv_modify_qp(qp, &qp_attr, ...);
+    
+    // RTR → RTS (Ready to Send)
+    qp_attr.qp_state = IBV_QPS_RTS;
+    qp_attr.sq_psn = local_psn;
+    qp_attr.timeout = 14;
+    qp_attr.retry_cnt = 7;
+    qp_attr.rnr_retry = 7;
+    ibv_modify_qp(qp, &qp_attr, ...);
+}
+```
+
+### Data Transfer
+
+#### Send Path
+
+```c
+ncclResult_t mesh_isend(void *sendComm, void *data, int size,
+                        void *mhandle, void **request) {
+    struct ibv_send_wr wr = {
+        .wr_id = (uint64_t)req,
+        .sg_list = &sge,
+        .num_sge = 1,
+        .opcode = IBV_WR_SEND,
+        .send_flags = IBV_SEND_SIGNALED,
+    };
+    
+    sge.addr = (uint64_t)data;
+    sge.length = size;
+    sge.lkey = mr->lkey;
+    
+    ibv_post_send(comm->qp, &wr, &bad_wr);
+}
+```
+
+#### Receive Path
+
+```c
+ncclResult_t mesh_irecv(void *recvComm, int n, void **data,
+                        int *sizes, void **mhandles, void **request) {
+    struct ibv_recv_wr wr = {
+        .wr_id = (uint64_t)req,
+        .sg_list = &sge,
+        .num_sge = 1,
+    };
+    
+    sge.addr = (uint64_t)data[0];
+    sge.length = sizes[0];
+    sge.lkey = mr->lkey;
+    
+    ibv_post_recv(comm->qp, &wr, &bad_wr);
+}
+```
+
+#### Completion Polling
+
+```c
+ncclResult_t mesh_test(void *request, int *done, int *sizes) {
+    struct ibv_wc wc;
+    
+    int ret = ibv_poll_cq(req->cq, 1, &wc);
+    if (ret > 0) {
+        if (wc.status == IBV_WC_SUCCESS) {
+            *done = 1;
+            if (sizes) *sizes = wc.byte_len;
+        } else {
+            // Handle error
+        }
+    } else {
+        *done = 0;  // Not complete yet
+    }
+}
+```
+
+## Memory Registration
+
+RDMA requires memory to be registered with the NIC:
+
+```c
+ncclResult_t mesh_regMr(void *comm, void *data, size_t size,
+                        int type, void **mhandle) {
+    int access = IBV_ACCESS_LOCAL_WRITE | 
+                 IBV_ACCESS_REMOTE_WRITE |
+                 IBV_ACCESS_REMOTE_READ;
+    
+    mrh->mr = ibv_reg_mr(nic->pd, data, size, access);
+    *mhandle = mrh;
+}
+```
+
+**Note**: Current implementation uses host memory staging. GPU memory is copied to host, sent via RDMA, then copied back to GPU on the receiver. GPUDirect RDMA would eliminate these copies.
+
+## Performance Considerations
+
+### Current Bottlenecks
+
+1. **Host Memory Staging**: GPU↔Host copies add latency
+2. **Single QP**: One Queue Pair per connection limits parallelism
+3. **Completion Signaling**: Every operation signals completion
+
+### Achieved Performance
+
+- **8+ GB/s** effective bandwidth
+- **~64%** of 100 Gbps line rate
+- Sufficient for distributed ML workloads
+
+### Future Optimizations
+
+1. **GPUDirect RDMA**: Register GPU memory directly
+2. **Multi-QP**: Multiple QPs per connection
+3. **Selective Signaling**: Signal every N operations
+4. **Inline Data**: Small messages in WQE
+
+## File Structure
+
+```
+nccl-mesh-plugin/
+├── src/
+│   └── mesh_plugin.c      # Main implementation (~1400 lines)
+├── include/
+│   └── mesh_plugin.h      # Data structures and declarations
+├── nccl/
+│   ├── net.h              # NCCL net plugin interface
+│   ├── net_v8.h           # v8 properties structure
+│   └── err.h              # NCCL error codes
+└── Makefile
+```
+
+## Debugging
+
+Enable debug output:
+
+```bash
+export NCCL_DEBUG=INFO
+export NCCL_MESH_DEBUG=1
+```
+
+Common issues:
+
+1. **"No local NIC found"**: Subnet mismatch, check IP configuration
+2. **"Handshake timeout"**: Firewall blocking TCP, check ports
+3. **"QP transition failed"**: GID index wrong, try different `NCCL_MESH_GID_INDEX`
+4. **"WC error status=12"**: Transport retry exceeded, check RDMA connectivity
+
+## Conclusion
+
+The NCCL Mesh Plugin demonstrates that with careful engineering, NCCL can be extended to support unconventional network topologies. The key innovations—multi-address handles, subnet-aware NIC selection, and asynchronous handshaking—provide a template for other custom NCCL transports.
--- a/docs/SETUP.md
+++ b/docs/SETUP.md
@ -0,0 +1,249 @@
+# Hardware Setup Guide
+
+This guide covers setting up a direct-connect RDMA mesh topology with multiple nodes.
+
+## Overview
+
+Our reference setup uses three NVIDIA DGX Spark workstations connected in a triangle mesh topology. Each pair of nodes has a dedicated 100 Gbps RDMA link on its own subnet.
+
+## Hardware Requirements
+
+- 3+ nodes with RDMA-capable NICs (ConnectX-6/7 recommended)
+- Direct-attach cables (QSFP56 for 100GbE)
+- Each node needs N-1 NICs for N nodes in a fully-connected mesh
+
+## Network Topology
+
+### Triangle Mesh (3 Nodes)
+
+```
+        Node A
+       /      \
+   NIC1        NIC2
+     |          |
+192.168.101.x  192.168.100.x
+     |          |
+   NIC1        NIC1
+     |          |
+   Node B ---- Node C
+          NIC2
+     192.168.102.x
+```
+
+### IP Address Assignment
+
+| Link | Subnet | Node A | Node B | Node C |
+|------|--------|--------|--------|--------|
+| A↔B | 192.168.101.0/24 | .2 | .3 | - |
+| A↔C | 192.168.100.0/24 | .2 | - | .3 |
+| B↔C | 192.168.102.0/24 | - | .2 | .3 |
+
+## Network Configuration
+
+### 1. Identify NICs
+
+```bash
+# List RDMA devices
+ibv_devices
+
+# List network interfaces with RDMA
+ls -la /sys/class/infiniband/*/device/net/
+```
+
+### 2. Configure IP Addresses
+
+On **Node A** (example):
+
+```bash
+# Link to Node B
+sudo ip addr add 192.168.101.2/24 dev enp1s0f0np0
+sudo ip link set enp1s0f0np0 up
+
+# Link to Node C  
+sudo ip addr add 192.168.100.2/24 dev enp1s0f1np1
+sudo ip link set enp1s0f1np1 up
+```
+
+On **Node B**:
+
+```bash
+# Link to Node A
+sudo ip addr add 192.168.101.3/24 dev enp1s0f0np0
+sudo ip link set enp1s0f0np0 up
+
+# Link to Node C
+sudo ip addr add 192.168.102.2/24 dev enp1s0f1np1
+sudo ip link set enp1s0f1np1 up
+```
+
+On **Node C**:
+
+```bash
+# Link to Node A
+sudo ip addr add 192.168.100.3/24 dev enp1s0f0np0
+sudo ip link set enp1s0f0np0 up
+
+# Link to Node B
+sudo ip addr add 192.168.102.3/24 dev enp1s0f1np1
+sudo ip link set enp1s0f1np1 up
+```
+
+### 3. Make Configuration Persistent
+
+Create netplan config (Ubuntu):
+
+```yaml
+# /etc/netplan/99-rdma-mesh.yaml
+network:
+  version: 2
+  ethernets:
+    enp1s0f0np0:
+      addresses:
+        - 192.168.101.2/24  # Adjust per node
+    enp1s0f1np1:
+      addresses:
+        - 192.168.100.2/24  # Adjust per node
+```
+
+Apply:
+```bash
+sudo netplan apply
+```
+
+## Verify Connectivity
+
+### 1. Ping Test
+
+From Node A:
+```bash
+ping 192.168.101.3  # Node B
+ping 192.168.100.3  # Node C
+```
+
+### 2. RDMA Test
+
+```bash
+# On Node B (server)
+ib_send_bw -d rocep1s0f0 -x 3
+
+# On Node A (client)
+ib_send_bw -d rocep1s0f0 -x 3 192.168.101.3
+```
+
+Expected output: ~12 GB/s for 100GbE
+
+### 3. Verify GID Index
+
+```bash
+# Show GID table
+show_gids
+
+# Find RoCE v2 GID (usually index 3)
+ibv_devinfo -v | grep -A5 GID
+```
+
+## RoCE Configuration
+
+### Enable RoCE v2
+
+```bash
+# Check current mode
+cat /sys/class/infiniband/rocep*/ports/1/gid_attrs/types/*
+
+# Enable RoCE v2 (if needed)
+echo "RoCE v2" | sudo tee /sys/class/infiniband/rocep1s0f0/ports/1/gid_attrs/types/0
+```
+
+### Configure ECN (Optional but Recommended)
+
+```bash
+# Enable ECN for RoCE
+sudo sysctl -w net.ipv4.tcp_ecn=1
+
+# Configure PFC (Priority Flow Control) on switch if applicable
+```
+
+## Firewall Configuration
+
+Open ports for NCCL communication:
+
+```bash
+# TCP ports for handshake (dynamic, 40000-50000 range)
+sudo ufw allow 40000:50000/tcp
+
+# Or disable firewall for mesh interfaces
+sudo ufw allow in on enp1s0f0np0
+sudo ufw allow in on enp1s0f1np1
+```
+
+## Troubleshooting
+
+### No RDMA Devices Found
+
+```bash
+# Load kernel modules
+sudo modprobe ib_core
+sudo modprobe mlx5_core
+sudo modprobe mlx5_ib
+
+# Check dmesg
+dmesg | grep -i mlx
+```
+
+### Link Not Coming Up
+
+```bash
+# Check physical connection
+ethtool enp1s0f0np0
+
+# Check for errors
+ip -s link show enp1s0f0np0
+```
+
+### RDMA Connection Fails
+
+```bash
+# Verify GID is populated
+cat /sys/class/infiniband/rocep1s0f0/ports/1/gids/3
+
+# Check RDMA CM
+rdma link show
+```
+
+### Wrong GID Index
+
+Try different GID indices:
+
+```bash
+export NCCL_MESH_GID_INDEX=0  # or 1, 2, 3...
+```
+
+## Scaling Beyond 3 Nodes
+
+For N nodes in a fully-connected mesh:
+- Each node needs N-1 NICs
+- Total links: N*(N-1)/2
+- Each link on unique subnet
+
+For 4 nodes:
+```
+    A
+   /|\
+  B-+-C
+   \|/
+    D
+```
+- 6 links, 6 subnets
+- Each node needs 3 NICs
+
+For larger clusters, consider a **partial mesh** or **fat-tree** topology with relay routing (not yet implemented in this plugin).
+
+## Reference: DGX Spark Mesh
+
+Our tested configuration:
+
+| Hostname | Management IP | Mesh IPs |
+|----------|--------------|----------|
+| titanic (A) | 10.0.0.170 | 192.168.100.2, 192.168.101.2 |
+| iceberg (B) | 10.0.0.171 | 192.168.101.3, 192.168.102.2 |
+| carpathia (C) | 10.0.0.172 | 192.168.100.3, 192.168.102.3 |