Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies

- Enables NCCL over multi-subnet mesh topologies
- 8+ GB/s bandwidth over 100Gbps RDMA
- Successfully tested with distributed LLM inference (Mistral-7B)
- Custom subnet-aware NIC selection
- Background handshake thread for deadlock-free connection setup
This commit is contained in:
autoscriptlabs 2026-01-09 14:09:33 -05:00
commit 031bc48953
13 changed files with 3074 additions and 0 deletions

337
docs/ARCHITECTURE.md Normal file
View file

@ -0,0 +1,337 @@
# NCCL Mesh Plugin Architecture
This document provides a deep dive into the architecture and implementation of the NCCL Mesh Plugin.
## Overview
The NCCL Mesh Plugin is a custom network transport that enables NCCL to work with direct-connect RDMA mesh topologies where each node pair is on a different subnet. This is a configuration that standard NCCL plugins cannot handle.
## The Problem
### Standard NCCL Networking
NCCL's built-in network plugins assume one of two scenarios:
1. **InfiniBand Fabric**: All nodes connected through IB switches, sharing a single subnet
2. **TCP/IP Sockets**: Standard IP networking with routing
### Our Topology
```
Node A (192.168.100.2, 192.168.101.2)
/ \
192.168.100.x 192.168.101.x
/ \
Node C Node B
(192.168.100.3, (192.168.101.3,
192.168.102.3) 192.168.102.2)
\ /
\ 192.168.102.x /
\ /
\--------------/
```
Each link is on a **different subnet**:
- A↔B: 192.168.101.0/24
- A↔C: 192.168.100.0/24
- B↔C: 192.168.102.0/24
This means:
- No single IP can reach all peers
- Standard IB plugin fails (expects single subnet)
- TCP socket plugin would need IP routing (adds latency)
## Solution Architecture
### Key Insight
Each node has **multiple NICs**, each on a different subnet. When connecting to a peer, we must:
1. Determine which subnet the peer is on
2. Use the local NIC on that same subnet
3. Establish RDMA connection over that specific NIC pair
### Handle Structure
The NCCL handle is expanded to advertise **all** local addresses:
```c
struct mesh_handle {
uint32_t magic; // Validation
uint8_t num_addrs; // Number of addresses
uint16_t handshake_port; // TCP port for QP exchange
struct mesh_addr_entry {
uint32_t ip; // IP address (network order)
uint32_t mask; // Subnet mask
uint32_t qp_num; // Queue Pair number
uint8_t nic_idx; // Index into local NIC array
} addrs[MESH_MAX_ADDRS];
};
```
### Connection Flow
#### Phase 1: Listen
```c
ncclResult_t mesh_listen(int dev, void *handle, void **listenComm) {
// 1. Create QPs on ALL local NICs
for (int i = 0; i < num_nics; i++) {
create_qp_on_nic(&nics[i]);
}
// 2. Start background handshake thread
pthread_create(&thread, handshake_thread_func, lcomm);
// 3. Fill handle with ALL addresses
for (int i = 0; i < num_nics; i++) {
handle->addrs[i].ip = nics[i].ip_addr;
handle->addrs[i].mask = nics[i].netmask;
handle->addrs[i].qp_num = qps[i]->qp_num;
}
}
```
#### Phase 2: Connect
```c
ncclResult_t mesh_connect(int dev, void *handle, void **sendComm) {
// 1. Search peer's addresses for reachable one
for (int i = 0; i < handle->num_addrs; i++) {
uint32_t peer_subnet = handle->addrs[i].ip & handle->addrs[i].mask;
// Find local NIC on same subnet
for (int j = 0; j < num_local_nics; j++) {
if (local_nics[j].subnet == peer_subnet) {
selected_nic = &local_nics[j];
selected_peer_addr = &handle->addrs[i];
break;
}
}
}
// 2. Create QP on selected NIC
create_qp_on_nic(selected_nic);
// 3. Exchange QP info via TCP handshake
send_handshake(peer_ip, peer_port, &local_qp_info, &remote_qp_info);
// 4. Connect QP to peer's QP
connect_qp(local_qp, remote_qp_info);
}
```
#### Phase 3: Accept
```c
ncclResult_t mesh_accept(void *listenComm, void **recvComm) {
// Get pre-connected QP from handshake thread's queue
pthread_mutex_lock(&queue_mutex);
while (queue_empty) {
pthread_cond_wait(&queue_cond, &queue_mutex);
}
entry = dequeue();
pthread_mutex_unlock(&queue_mutex);
// Return the ready connection
rcomm->qp = entry->local_qp;
rcomm->nic = entry->nic;
}
```
### Background Handshake Thread
The handshake thread solves a critical deadlock problem:
**Without thread:**
```
Rank 0: connect() → TCP connect to Rank 1 → blocks waiting for accept()
Rank 1: connect() → TCP connect to Rank 0 → blocks waiting for accept()
// DEADLOCK: Neither can call accept() because both stuck in connect()
```
**With thread:**
```
Rank 0: listen() starts thread → thread waits for TCP connections
Rank 1: listen() starts thread → thread waits for TCP connections
Rank 0: connect() → TCP connects to Rank 1's thread → gets response → returns
Rank 1: connect() → TCP connects to Rank 0's thread → gets response → returns
Rank 0: accept() → gets QP from queue (filled by thread) → returns
Rank 1: accept() → gets QP from queue (filled by thread) → returns
// SUCCESS: Thread handles incoming connections asynchronously
```
### RDMA Queue Pair Setup
Each connection requires proper QP state transitions:
```
RESET → INIT → RTR → RTS
```
```c
int mesh_connect_qp(struct ibv_qp *qp, struct mesh_nic *nic,
struct mesh_handle *remote) {
// RESET → INIT
qp_attr.qp_state = IBV_QPS_INIT;
qp_attr.pkey_index = 0;
qp_attr.port_num = nic->port_num;
qp_attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE |
IBV_ACCESS_REMOTE_READ |
IBV_ACCESS_LOCAL_WRITE;
ibv_modify_qp(qp, &qp_attr, ...);
// INIT → RTR (Ready to Receive)
qp_attr.qp_state = IBV_QPS_RTR;
qp_attr.path_mtu = IBV_MTU_4096;
qp_attr.dest_qp_num = remote->qp_num;
qp_attr.rq_psn = remote->psn;
qp_attr.ah_attr.dlid = remote->lid; // 0 for RoCE
qp_attr.ah_attr.grh.dgid = remote->gid; // Peer's GID
ibv_modify_qp(qp, &qp_attr, ...);
// RTR → RTS (Ready to Send)
qp_attr.qp_state = IBV_QPS_RTS;
qp_attr.sq_psn = local_psn;
qp_attr.timeout = 14;
qp_attr.retry_cnt = 7;
qp_attr.rnr_retry = 7;
ibv_modify_qp(qp, &qp_attr, ...);
}
```
### Data Transfer
#### Send Path
```c
ncclResult_t mesh_isend(void *sendComm, void *data, int size,
void *mhandle, void **request) {
struct ibv_send_wr wr = {
.wr_id = (uint64_t)req,
.sg_list = &sge,
.num_sge = 1,
.opcode = IBV_WR_SEND,
.send_flags = IBV_SEND_SIGNALED,
};
sge.addr = (uint64_t)data;
sge.length = size;
sge.lkey = mr->lkey;
ibv_post_send(comm->qp, &wr, &bad_wr);
}
```
#### Receive Path
```c
ncclResult_t mesh_irecv(void *recvComm, int n, void **data,
int *sizes, void **mhandles, void **request) {
struct ibv_recv_wr wr = {
.wr_id = (uint64_t)req,
.sg_list = &sge,
.num_sge = 1,
};
sge.addr = (uint64_t)data[0];
sge.length = sizes[0];
sge.lkey = mr->lkey;
ibv_post_recv(comm->qp, &wr, &bad_wr);
}
```
#### Completion Polling
```c
ncclResult_t mesh_test(void *request, int *done, int *sizes) {
struct ibv_wc wc;
int ret = ibv_poll_cq(req->cq, 1, &wc);
if (ret > 0) {
if (wc.status == IBV_WC_SUCCESS) {
*done = 1;
if (sizes) *sizes = wc.byte_len;
} else {
// Handle error
}
} else {
*done = 0; // Not complete yet
}
}
```
## Memory Registration
RDMA requires memory to be registered with the NIC:
```c
ncclResult_t mesh_regMr(void *comm, void *data, size_t size,
int type, void **mhandle) {
int access = IBV_ACCESS_LOCAL_WRITE |
IBV_ACCESS_REMOTE_WRITE |
IBV_ACCESS_REMOTE_READ;
mrh->mr = ibv_reg_mr(nic->pd, data, size, access);
*mhandle = mrh;
}
```
**Note**: Current implementation uses host memory staging. GPU memory is copied to host, sent via RDMA, then copied back to GPU on the receiver. GPUDirect RDMA would eliminate these copies.
## Performance Considerations
### Current Bottlenecks
1. **Host Memory Staging**: GPU↔Host copies add latency
2. **Single QP**: One Queue Pair per connection limits parallelism
3. **Completion Signaling**: Every operation signals completion
### Achieved Performance
- **8+ GB/s** effective bandwidth
- **~64%** of 100 Gbps line rate
- Sufficient for distributed ML workloads
### Future Optimizations
1. **GPUDirect RDMA**: Register GPU memory directly
2. **Multi-QP**: Multiple QPs per connection
3. **Selective Signaling**: Signal every N operations
4. **Inline Data**: Small messages in WQE
## File Structure
```
nccl-mesh-plugin/
├── src/
│ └── mesh_plugin.c # Main implementation (~1400 lines)
├── include/
│ └── mesh_plugin.h # Data structures and declarations
├── nccl/
│ ├── net.h # NCCL net plugin interface
│ ├── net_v8.h # v8 properties structure
│ └── err.h # NCCL error codes
└── Makefile
```
## Debugging
Enable debug output:
```bash
export NCCL_DEBUG=INFO
export NCCL_MESH_DEBUG=1
```
Common issues:
1. **"No local NIC found"**: Subnet mismatch, check IP configuration
2. **"Handshake timeout"**: Firewall blocking TCP, check ports
3. **"QP transition failed"**: GID index wrong, try different `NCCL_MESH_GID_INDEX`
4. **"WC error status=12"**: Transport retry exceeded, check RDMA connectivity
## Conclusion
The NCCL Mesh Plugin demonstrates that with careful engineering, NCCL can be extended to support unconventional network topologies. The key innovations—multi-address handles, subnet-aware NIC selection, and asynchronous handshaking—provide a template for other custom NCCL transports.

249
docs/SETUP.md Normal file
View file

@ -0,0 +1,249 @@
# Hardware Setup Guide
This guide covers setting up a direct-connect RDMA mesh topology with multiple nodes.
## Overview
Our reference setup uses three NVIDIA DGX Spark workstations connected in a triangle mesh topology. Each pair of nodes has a dedicated 100 Gbps RDMA link on its own subnet.
## Hardware Requirements
- 3+ nodes with RDMA-capable NICs (ConnectX-6/7 recommended)
- Direct-attach cables (QSFP56 for 100GbE)
- Each node needs N-1 NICs for N nodes in a fully-connected mesh
## Network Topology
### Triangle Mesh (3 Nodes)
```
Node A
/ \
NIC1 NIC2
| |
192.168.101.x 192.168.100.x
| |
NIC1 NIC1
| |
Node B ---- Node C
NIC2
192.168.102.x
```
### IP Address Assignment
| Link | Subnet | Node A | Node B | Node C |
|------|--------|--------|--------|--------|
| A↔B | 192.168.101.0/24 | .2 | .3 | - |
| A↔C | 192.168.100.0/24 | .2 | - | .3 |
| B↔C | 192.168.102.0/24 | - | .2 | .3 |
## Network Configuration
### 1. Identify NICs
```bash
# List RDMA devices
ibv_devices
# List network interfaces with RDMA
ls -la /sys/class/infiniband/*/device/net/
```
### 2. Configure IP Addresses
On **Node A** (example):
```bash
# Link to Node B
sudo ip addr add 192.168.101.2/24 dev enp1s0f0np0
sudo ip link set enp1s0f0np0 up
# Link to Node C
sudo ip addr add 192.168.100.2/24 dev enp1s0f1np1
sudo ip link set enp1s0f1np1 up
```
On **Node B**:
```bash
# Link to Node A
sudo ip addr add 192.168.101.3/24 dev enp1s0f0np0
sudo ip link set enp1s0f0np0 up
# Link to Node C
sudo ip addr add 192.168.102.2/24 dev enp1s0f1np1
sudo ip link set enp1s0f1np1 up
```
On **Node C**:
```bash
# Link to Node A
sudo ip addr add 192.168.100.3/24 dev enp1s0f0np0
sudo ip link set enp1s0f0np0 up
# Link to Node B
sudo ip addr add 192.168.102.3/24 dev enp1s0f1np1
sudo ip link set enp1s0f1np1 up
```
### 3. Make Configuration Persistent
Create netplan config (Ubuntu):
```yaml
# /etc/netplan/99-rdma-mesh.yaml
network:
version: 2
ethernets:
enp1s0f0np0:
addresses:
- 192.168.101.2/24 # Adjust per node
enp1s0f1np1:
addresses:
- 192.168.100.2/24 # Adjust per node
```
Apply:
```bash
sudo netplan apply
```
## Verify Connectivity
### 1. Ping Test
From Node A:
```bash
ping 192.168.101.3 # Node B
ping 192.168.100.3 # Node C
```
### 2. RDMA Test
```bash
# On Node B (server)
ib_send_bw -d rocep1s0f0 -x 3
# On Node A (client)
ib_send_bw -d rocep1s0f0 -x 3 192.168.101.3
```
Expected output: ~12 GB/s for 100GbE
### 3. Verify GID Index
```bash
# Show GID table
show_gids
# Find RoCE v2 GID (usually index 3)
ibv_devinfo -v | grep -A5 GID
```
## RoCE Configuration
### Enable RoCE v2
```bash
# Check current mode
cat /sys/class/infiniband/rocep*/ports/1/gid_attrs/types/*
# Enable RoCE v2 (if needed)
echo "RoCE v2" | sudo tee /sys/class/infiniband/rocep1s0f0/ports/1/gid_attrs/types/0
```
### Configure ECN (Optional but Recommended)
```bash
# Enable ECN for RoCE
sudo sysctl -w net.ipv4.tcp_ecn=1
# Configure PFC (Priority Flow Control) on switch if applicable
```
## Firewall Configuration
Open ports for NCCL communication:
```bash
# TCP ports for handshake (dynamic, 40000-50000 range)
sudo ufw allow 40000:50000/tcp
# Or disable firewall for mesh interfaces
sudo ufw allow in on enp1s0f0np0
sudo ufw allow in on enp1s0f1np1
```
## Troubleshooting
### No RDMA Devices Found
```bash
# Load kernel modules
sudo modprobe ib_core
sudo modprobe mlx5_core
sudo modprobe mlx5_ib
# Check dmesg
dmesg | grep -i mlx
```
### Link Not Coming Up
```bash
# Check physical connection
ethtool enp1s0f0np0
# Check for errors
ip -s link show enp1s0f0np0
```
### RDMA Connection Fails
```bash
# Verify GID is populated
cat /sys/class/infiniband/rocep1s0f0/ports/1/gids/3
# Check RDMA CM
rdma link show
```
### Wrong GID Index
Try different GID indices:
```bash
export NCCL_MESH_GID_INDEX=0 # or 1, 2, 3...
```
## Scaling Beyond 3 Nodes
For N nodes in a fully-connected mesh:
- Each node needs N-1 NICs
- Total links: N*(N-1)/2
- Each link on unique subnet
For 4 nodes:
```
A
/|\
B-+-C
\|/
D
```
- 6 links, 6 subnets
- Each node needs 3 NICs
For larger clusters, consider a **partial mesh** or **fat-tree** topology with relay routing (not yet implemented in this plugin).
## Reference: DGX Spark Mesh
Our tested configuration:
| Hostname | Management IP | Mesh IPs |
|----------|--------------|----------|
| titanic (A) | 10.0.0.170 | 192.168.100.2, 192.168.101.2 |
| iceberg (B) | 10.0.0.171 | 192.168.101.3, 192.168.102.2 |
| carpathia (C) | 10.0.0.172 | 192.168.100.3, 192.168.102.3 |