mirror of
https://github.com/autoscriptlabs/nccl-mesh-plugin.git
synced 2026-01-11 11:34:06 +00:00
Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies
- Enables NCCL over multi-subnet mesh topologies - 8+ GB/s bandwidth over 100Gbps RDMA - Successfully tested with distributed LLM inference (Mistral-7B) - Custom subnet-aware NIC selection - Background handshake thread for deadlock-free connection setup
This commit is contained in:
commit
031bc48953
13 changed files with 3074 additions and 0 deletions
337
docs/ARCHITECTURE.md
Normal file
337
docs/ARCHITECTURE.md
Normal file
|
|
@ -0,0 +1,337 @@
|
|||
# NCCL Mesh Plugin Architecture
|
||||
|
||||
This document provides a deep dive into the architecture and implementation of the NCCL Mesh Plugin.
|
||||
|
||||
## Overview
|
||||
|
||||
The NCCL Mesh Plugin is a custom network transport that enables NCCL to work with direct-connect RDMA mesh topologies where each node pair is on a different subnet. This is a configuration that standard NCCL plugins cannot handle.
|
||||
|
||||
## The Problem
|
||||
|
||||
### Standard NCCL Networking
|
||||
|
||||
NCCL's built-in network plugins assume one of two scenarios:
|
||||
|
||||
1. **InfiniBand Fabric**: All nodes connected through IB switches, sharing a single subnet
|
||||
2. **TCP/IP Sockets**: Standard IP networking with routing
|
||||
|
||||
### Our Topology
|
||||
|
||||
```
|
||||
Node A (192.168.100.2, 192.168.101.2)
|
||||
/ \
|
||||
192.168.100.x 192.168.101.x
|
||||
/ \
|
||||
Node C Node B
|
||||
(192.168.100.3, (192.168.101.3,
|
||||
192.168.102.3) 192.168.102.2)
|
||||
\ /
|
||||
\ 192.168.102.x /
|
||||
\ /
|
||||
\--------------/
|
||||
```
|
||||
|
||||
Each link is on a **different subnet**:
|
||||
- A↔B: 192.168.101.0/24
|
||||
- A↔C: 192.168.100.0/24
|
||||
- B↔C: 192.168.102.0/24
|
||||
|
||||
This means:
|
||||
- No single IP can reach all peers
|
||||
- Standard IB plugin fails (expects single subnet)
|
||||
- TCP socket plugin would need IP routing (adds latency)
|
||||
|
||||
## Solution Architecture
|
||||
|
||||
### Key Insight
|
||||
|
||||
Each node has **multiple NICs**, each on a different subnet. When connecting to a peer, we must:
|
||||
1. Determine which subnet the peer is on
|
||||
2. Use the local NIC on that same subnet
|
||||
3. Establish RDMA connection over that specific NIC pair
|
||||
|
||||
### Handle Structure
|
||||
|
||||
The NCCL handle is expanded to advertise **all** local addresses:
|
||||
|
||||
```c
|
||||
struct mesh_handle {
|
||||
uint32_t magic; // Validation
|
||||
uint8_t num_addrs; // Number of addresses
|
||||
uint16_t handshake_port; // TCP port for QP exchange
|
||||
|
||||
struct mesh_addr_entry {
|
||||
uint32_t ip; // IP address (network order)
|
||||
uint32_t mask; // Subnet mask
|
||||
uint32_t qp_num; // Queue Pair number
|
||||
uint8_t nic_idx; // Index into local NIC array
|
||||
} addrs[MESH_MAX_ADDRS];
|
||||
};
|
||||
```
|
||||
|
||||
### Connection Flow
|
||||
|
||||
#### Phase 1: Listen
|
||||
|
||||
```c
|
||||
ncclResult_t mesh_listen(int dev, void *handle, void **listenComm) {
|
||||
// 1. Create QPs on ALL local NICs
|
||||
for (int i = 0; i < num_nics; i++) {
|
||||
create_qp_on_nic(&nics[i]);
|
||||
}
|
||||
|
||||
// 2. Start background handshake thread
|
||||
pthread_create(&thread, handshake_thread_func, lcomm);
|
||||
|
||||
// 3. Fill handle with ALL addresses
|
||||
for (int i = 0; i < num_nics; i++) {
|
||||
handle->addrs[i].ip = nics[i].ip_addr;
|
||||
handle->addrs[i].mask = nics[i].netmask;
|
||||
handle->addrs[i].qp_num = qps[i]->qp_num;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Phase 2: Connect
|
||||
|
||||
```c
|
||||
ncclResult_t mesh_connect(int dev, void *handle, void **sendComm) {
|
||||
// 1. Search peer's addresses for reachable one
|
||||
for (int i = 0; i < handle->num_addrs; i++) {
|
||||
uint32_t peer_subnet = handle->addrs[i].ip & handle->addrs[i].mask;
|
||||
|
||||
// Find local NIC on same subnet
|
||||
for (int j = 0; j < num_local_nics; j++) {
|
||||
if (local_nics[j].subnet == peer_subnet) {
|
||||
selected_nic = &local_nics[j];
|
||||
selected_peer_addr = &handle->addrs[i];
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// 2. Create QP on selected NIC
|
||||
create_qp_on_nic(selected_nic);
|
||||
|
||||
// 3. Exchange QP info via TCP handshake
|
||||
send_handshake(peer_ip, peer_port, &local_qp_info, &remote_qp_info);
|
||||
|
||||
// 4. Connect QP to peer's QP
|
||||
connect_qp(local_qp, remote_qp_info);
|
||||
}
|
||||
```
|
||||
|
||||
#### Phase 3: Accept
|
||||
|
||||
```c
|
||||
ncclResult_t mesh_accept(void *listenComm, void **recvComm) {
|
||||
// Get pre-connected QP from handshake thread's queue
|
||||
pthread_mutex_lock(&queue_mutex);
|
||||
while (queue_empty) {
|
||||
pthread_cond_wait(&queue_cond, &queue_mutex);
|
||||
}
|
||||
entry = dequeue();
|
||||
pthread_mutex_unlock(&queue_mutex);
|
||||
|
||||
// Return the ready connection
|
||||
rcomm->qp = entry->local_qp;
|
||||
rcomm->nic = entry->nic;
|
||||
}
|
||||
```
|
||||
|
||||
### Background Handshake Thread
|
||||
|
||||
The handshake thread solves a critical deadlock problem:
|
||||
|
||||
**Without thread:**
|
||||
```
|
||||
Rank 0: connect() → TCP connect to Rank 1 → blocks waiting for accept()
|
||||
Rank 1: connect() → TCP connect to Rank 0 → blocks waiting for accept()
|
||||
// DEADLOCK: Neither can call accept() because both stuck in connect()
|
||||
```
|
||||
|
||||
**With thread:**
|
||||
```
|
||||
Rank 0: listen() starts thread → thread waits for TCP connections
|
||||
Rank 1: listen() starts thread → thread waits for TCP connections
|
||||
Rank 0: connect() → TCP connects to Rank 1's thread → gets response → returns
|
||||
Rank 1: connect() → TCP connects to Rank 0's thread → gets response → returns
|
||||
Rank 0: accept() → gets QP from queue (filled by thread) → returns
|
||||
Rank 1: accept() → gets QP from queue (filled by thread) → returns
|
||||
// SUCCESS: Thread handles incoming connections asynchronously
|
||||
```
|
||||
|
||||
### RDMA Queue Pair Setup
|
||||
|
||||
Each connection requires proper QP state transitions:
|
||||
|
||||
```
|
||||
RESET → INIT → RTR → RTS
|
||||
```
|
||||
|
||||
```c
|
||||
int mesh_connect_qp(struct ibv_qp *qp, struct mesh_nic *nic,
|
||||
struct mesh_handle *remote) {
|
||||
// RESET → INIT
|
||||
qp_attr.qp_state = IBV_QPS_INIT;
|
||||
qp_attr.pkey_index = 0;
|
||||
qp_attr.port_num = nic->port_num;
|
||||
qp_attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE |
|
||||
IBV_ACCESS_REMOTE_READ |
|
||||
IBV_ACCESS_LOCAL_WRITE;
|
||||
ibv_modify_qp(qp, &qp_attr, ...);
|
||||
|
||||
// INIT → RTR (Ready to Receive)
|
||||
qp_attr.qp_state = IBV_QPS_RTR;
|
||||
qp_attr.path_mtu = IBV_MTU_4096;
|
||||
qp_attr.dest_qp_num = remote->qp_num;
|
||||
qp_attr.rq_psn = remote->psn;
|
||||
qp_attr.ah_attr.dlid = remote->lid; // 0 for RoCE
|
||||
qp_attr.ah_attr.grh.dgid = remote->gid; // Peer's GID
|
||||
ibv_modify_qp(qp, &qp_attr, ...);
|
||||
|
||||
// RTR → RTS (Ready to Send)
|
||||
qp_attr.qp_state = IBV_QPS_RTS;
|
||||
qp_attr.sq_psn = local_psn;
|
||||
qp_attr.timeout = 14;
|
||||
qp_attr.retry_cnt = 7;
|
||||
qp_attr.rnr_retry = 7;
|
||||
ibv_modify_qp(qp, &qp_attr, ...);
|
||||
}
|
||||
```
|
||||
|
||||
### Data Transfer
|
||||
|
||||
#### Send Path
|
||||
|
||||
```c
|
||||
ncclResult_t mesh_isend(void *sendComm, void *data, int size,
|
||||
void *mhandle, void **request) {
|
||||
struct ibv_send_wr wr = {
|
||||
.wr_id = (uint64_t)req,
|
||||
.sg_list = &sge,
|
||||
.num_sge = 1,
|
||||
.opcode = IBV_WR_SEND,
|
||||
.send_flags = IBV_SEND_SIGNALED,
|
||||
};
|
||||
|
||||
sge.addr = (uint64_t)data;
|
||||
sge.length = size;
|
||||
sge.lkey = mr->lkey;
|
||||
|
||||
ibv_post_send(comm->qp, &wr, &bad_wr);
|
||||
}
|
||||
```
|
||||
|
||||
#### Receive Path
|
||||
|
||||
```c
|
||||
ncclResult_t mesh_irecv(void *recvComm, int n, void **data,
|
||||
int *sizes, void **mhandles, void **request) {
|
||||
struct ibv_recv_wr wr = {
|
||||
.wr_id = (uint64_t)req,
|
||||
.sg_list = &sge,
|
||||
.num_sge = 1,
|
||||
};
|
||||
|
||||
sge.addr = (uint64_t)data[0];
|
||||
sge.length = sizes[0];
|
||||
sge.lkey = mr->lkey;
|
||||
|
||||
ibv_post_recv(comm->qp, &wr, &bad_wr);
|
||||
}
|
||||
```
|
||||
|
||||
#### Completion Polling
|
||||
|
||||
```c
|
||||
ncclResult_t mesh_test(void *request, int *done, int *sizes) {
|
||||
struct ibv_wc wc;
|
||||
|
||||
int ret = ibv_poll_cq(req->cq, 1, &wc);
|
||||
if (ret > 0) {
|
||||
if (wc.status == IBV_WC_SUCCESS) {
|
||||
*done = 1;
|
||||
if (sizes) *sizes = wc.byte_len;
|
||||
} else {
|
||||
// Handle error
|
||||
}
|
||||
} else {
|
||||
*done = 0; // Not complete yet
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Memory Registration
|
||||
|
||||
RDMA requires memory to be registered with the NIC:
|
||||
|
||||
```c
|
||||
ncclResult_t mesh_regMr(void *comm, void *data, size_t size,
|
||||
int type, void **mhandle) {
|
||||
int access = IBV_ACCESS_LOCAL_WRITE |
|
||||
IBV_ACCESS_REMOTE_WRITE |
|
||||
IBV_ACCESS_REMOTE_READ;
|
||||
|
||||
mrh->mr = ibv_reg_mr(nic->pd, data, size, access);
|
||||
*mhandle = mrh;
|
||||
}
|
||||
```
|
||||
|
||||
**Note**: Current implementation uses host memory staging. GPU memory is copied to host, sent via RDMA, then copied back to GPU on the receiver. GPUDirect RDMA would eliminate these copies.
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Current Bottlenecks
|
||||
|
||||
1. **Host Memory Staging**: GPU↔Host copies add latency
|
||||
2. **Single QP**: One Queue Pair per connection limits parallelism
|
||||
3. **Completion Signaling**: Every operation signals completion
|
||||
|
||||
### Achieved Performance
|
||||
|
||||
- **8+ GB/s** effective bandwidth
|
||||
- **~64%** of 100 Gbps line rate
|
||||
- Sufficient for distributed ML workloads
|
||||
|
||||
### Future Optimizations
|
||||
|
||||
1. **GPUDirect RDMA**: Register GPU memory directly
|
||||
2. **Multi-QP**: Multiple QPs per connection
|
||||
3. **Selective Signaling**: Signal every N operations
|
||||
4. **Inline Data**: Small messages in WQE
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
nccl-mesh-plugin/
|
||||
├── src/
|
||||
│ └── mesh_plugin.c # Main implementation (~1400 lines)
|
||||
├── include/
|
||||
│ └── mesh_plugin.h # Data structures and declarations
|
||||
├── nccl/
|
||||
│ ├── net.h # NCCL net plugin interface
|
||||
│ ├── net_v8.h # v8 properties structure
|
||||
│ └── err.h # NCCL error codes
|
||||
└── Makefile
|
||||
```
|
||||
|
||||
## Debugging
|
||||
|
||||
Enable debug output:
|
||||
|
||||
```bash
|
||||
export NCCL_DEBUG=INFO
|
||||
export NCCL_MESH_DEBUG=1
|
||||
```
|
||||
|
||||
Common issues:
|
||||
|
||||
1. **"No local NIC found"**: Subnet mismatch, check IP configuration
|
||||
2. **"Handshake timeout"**: Firewall blocking TCP, check ports
|
||||
3. **"QP transition failed"**: GID index wrong, try different `NCCL_MESH_GID_INDEX`
|
||||
4. **"WC error status=12"**: Transport retry exceeded, check RDMA connectivity
|
||||
|
||||
## Conclusion
|
||||
|
||||
The NCCL Mesh Plugin demonstrates that with careful engineering, NCCL can be extended to support unconventional network topologies. The key innovations—multi-address handles, subnet-aware NIC selection, and asynchronous handshaking—provide a template for other custom NCCL transports.
|
||||
249
docs/SETUP.md
Normal file
249
docs/SETUP.md
Normal file
|
|
@ -0,0 +1,249 @@
|
|||
# Hardware Setup Guide
|
||||
|
||||
This guide covers setting up a direct-connect RDMA mesh topology with multiple nodes.
|
||||
|
||||
## Overview
|
||||
|
||||
Our reference setup uses three NVIDIA DGX Spark workstations connected in a triangle mesh topology. Each pair of nodes has a dedicated 100 Gbps RDMA link on its own subnet.
|
||||
|
||||
## Hardware Requirements
|
||||
|
||||
- 3+ nodes with RDMA-capable NICs (ConnectX-6/7 recommended)
|
||||
- Direct-attach cables (QSFP56 for 100GbE)
|
||||
- Each node needs N-1 NICs for N nodes in a fully-connected mesh
|
||||
|
||||
## Network Topology
|
||||
|
||||
### Triangle Mesh (3 Nodes)
|
||||
|
||||
```
|
||||
Node A
|
||||
/ \
|
||||
NIC1 NIC2
|
||||
| |
|
||||
192.168.101.x 192.168.100.x
|
||||
| |
|
||||
NIC1 NIC1
|
||||
| |
|
||||
Node B ---- Node C
|
||||
NIC2
|
||||
192.168.102.x
|
||||
```
|
||||
|
||||
### IP Address Assignment
|
||||
|
||||
| Link | Subnet | Node A | Node B | Node C |
|
||||
|------|--------|--------|--------|--------|
|
||||
| A↔B | 192.168.101.0/24 | .2 | .3 | - |
|
||||
| A↔C | 192.168.100.0/24 | .2 | - | .3 |
|
||||
| B↔C | 192.168.102.0/24 | - | .2 | .3 |
|
||||
|
||||
## Network Configuration
|
||||
|
||||
### 1. Identify NICs
|
||||
|
||||
```bash
|
||||
# List RDMA devices
|
||||
ibv_devices
|
||||
|
||||
# List network interfaces with RDMA
|
||||
ls -la /sys/class/infiniband/*/device/net/
|
||||
```
|
||||
|
||||
### 2. Configure IP Addresses
|
||||
|
||||
On **Node A** (example):
|
||||
|
||||
```bash
|
||||
# Link to Node B
|
||||
sudo ip addr add 192.168.101.2/24 dev enp1s0f0np0
|
||||
sudo ip link set enp1s0f0np0 up
|
||||
|
||||
# Link to Node C
|
||||
sudo ip addr add 192.168.100.2/24 dev enp1s0f1np1
|
||||
sudo ip link set enp1s0f1np1 up
|
||||
```
|
||||
|
||||
On **Node B**:
|
||||
|
||||
```bash
|
||||
# Link to Node A
|
||||
sudo ip addr add 192.168.101.3/24 dev enp1s0f0np0
|
||||
sudo ip link set enp1s0f0np0 up
|
||||
|
||||
# Link to Node C
|
||||
sudo ip addr add 192.168.102.2/24 dev enp1s0f1np1
|
||||
sudo ip link set enp1s0f1np1 up
|
||||
```
|
||||
|
||||
On **Node C**:
|
||||
|
||||
```bash
|
||||
# Link to Node A
|
||||
sudo ip addr add 192.168.100.3/24 dev enp1s0f0np0
|
||||
sudo ip link set enp1s0f0np0 up
|
||||
|
||||
# Link to Node B
|
||||
sudo ip addr add 192.168.102.3/24 dev enp1s0f1np1
|
||||
sudo ip link set enp1s0f1np1 up
|
||||
```
|
||||
|
||||
### 3. Make Configuration Persistent
|
||||
|
||||
Create netplan config (Ubuntu):
|
||||
|
||||
```yaml
|
||||
# /etc/netplan/99-rdma-mesh.yaml
|
||||
network:
|
||||
version: 2
|
||||
ethernets:
|
||||
enp1s0f0np0:
|
||||
addresses:
|
||||
- 192.168.101.2/24 # Adjust per node
|
||||
enp1s0f1np1:
|
||||
addresses:
|
||||
- 192.168.100.2/24 # Adjust per node
|
||||
```
|
||||
|
||||
Apply:
|
||||
```bash
|
||||
sudo netplan apply
|
||||
```
|
||||
|
||||
## Verify Connectivity
|
||||
|
||||
### 1. Ping Test
|
||||
|
||||
From Node A:
|
||||
```bash
|
||||
ping 192.168.101.3 # Node B
|
||||
ping 192.168.100.3 # Node C
|
||||
```
|
||||
|
||||
### 2. RDMA Test
|
||||
|
||||
```bash
|
||||
# On Node B (server)
|
||||
ib_send_bw -d rocep1s0f0 -x 3
|
||||
|
||||
# On Node A (client)
|
||||
ib_send_bw -d rocep1s0f0 -x 3 192.168.101.3
|
||||
```
|
||||
|
||||
Expected output: ~12 GB/s for 100GbE
|
||||
|
||||
### 3. Verify GID Index
|
||||
|
||||
```bash
|
||||
# Show GID table
|
||||
show_gids
|
||||
|
||||
# Find RoCE v2 GID (usually index 3)
|
||||
ibv_devinfo -v | grep -A5 GID
|
||||
```
|
||||
|
||||
## RoCE Configuration
|
||||
|
||||
### Enable RoCE v2
|
||||
|
||||
```bash
|
||||
# Check current mode
|
||||
cat /sys/class/infiniband/rocep*/ports/1/gid_attrs/types/*
|
||||
|
||||
# Enable RoCE v2 (if needed)
|
||||
echo "RoCE v2" | sudo tee /sys/class/infiniband/rocep1s0f0/ports/1/gid_attrs/types/0
|
||||
```
|
||||
|
||||
### Configure ECN (Optional but Recommended)
|
||||
|
||||
```bash
|
||||
# Enable ECN for RoCE
|
||||
sudo sysctl -w net.ipv4.tcp_ecn=1
|
||||
|
||||
# Configure PFC (Priority Flow Control) on switch if applicable
|
||||
```
|
||||
|
||||
## Firewall Configuration
|
||||
|
||||
Open ports for NCCL communication:
|
||||
|
||||
```bash
|
||||
# TCP ports for handshake (dynamic, 40000-50000 range)
|
||||
sudo ufw allow 40000:50000/tcp
|
||||
|
||||
# Or disable firewall for mesh interfaces
|
||||
sudo ufw allow in on enp1s0f0np0
|
||||
sudo ufw allow in on enp1s0f1np1
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### No RDMA Devices Found
|
||||
|
||||
```bash
|
||||
# Load kernel modules
|
||||
sudo modprobe ib_core
|
||||
sudo modprobe mlx5_core
|
||||
sudo modprobe mlx5_ib
|
||||
|
||||
# Check dmesg
|
||||
dmesg | grep -i mlx
|
||||
```
|
||||
|
||||
### Link Not Coming Up
|
||||
|
||||
```bash
|
||||
# Check physical connection
|
||||
ethtool enp1s0f0np0
|
||||
|
||||
# Check for errors
|
||||
ip -s link show enp1s0f0np0
|
||||
```
|
||||
|
||||
### RDMA Connection Fails
|
||||
|
||||
```bash
|
||||
# Verify GID is populated
|
||||
cat /sys/class/infiniband/rocep1s0f0/ports/1/gids/3
|
||||
|
||||
# Check RDMA CM
|
||||
rdma link show
|
||||
```
|
||||
|
||||
### Wrong GID Index
|
||||
|
||||
Try different GID indices:
|
||||
|
||||
```bash
|
||||
export NCCL_MESH_GID_INDEX=0 # or 1, 2, 3...
|
||||
```
|
||||
|
||||
## Scaling Beyond 3 Nodes
|
||||
|
||||
For N nodes in a fully-connected mesh:
|
||||
- Each node needs N-1 NICs
|
||||
- Total links: N*(N-1)/2
|
||||
- Each link on unique subnet
|
||||
|
||||
For 4 nodes:
|
||||
```
|
||||
A
|
||||
/|\
|
||||
B-+-C
|
||||
\|/
|
||||
D
|
||||
```
|
||||
- 6 links, 6 subnets
|
||||
- Each node needs 3 NICs
|
||||
|
||||
For larger clusters, consider a **partial mesh** or **fat-tree** topology with relay routing (not yet implemented in this plugin).
|
||||
|
||||
## Reference: DGX Spark Mesh
|
||||
|
||||
Our tested configuration:
|
||||
|
||||
| Hostname | Management IP | Mesh IPs |
|
||||
|----------|--------------|----------|
|
||||
| titanic (A) | 10.0.0.170 | 192.168.100.2, 192.168.101.2 |
|
||||
| iceberg (B) | 10.0.0.171 | 192.168.101.3, 192.168.102.2 |
|
||||
| carpathia (C) | 10.0.0.172 | 192.168.100.3, 192.168.102.3 |
|
||||
Loading…
Add table
Add a link
Reference in a new issue