- Enables NCCL over multi-subnet mesh topologies - 8+ GB/s bandwidth over 100Gbps RDMA - Successfully tested with distributed LLM inference (Mistral-7B) - Custom subnet-aware NIC selection - Background handshake thread for deadlock-free connection setup
9.6 KiB
NCCL Mesh Plugin Architecture
This document provides a deep dive into the architecture and implementation of the NCCL Mesh Plugin.
Overview
The NCCL Mesh Plugin is a custom network transport that enables NCCL to work with direct-connect RDMA mesh topologies where each node pair is on a different subnet. This is a configuration that standard NCCL plugins cannot handle.
The Problem
Standard NCCL Networking
NCCL's built-in network plugins assume one of two scenarios:
- InfiniBand Fabric: All nodes connected through IB switches, sharing a single subnet
- TCP/IP Sockets: Standard IP networking with routing
Our Topology
Node A (192.168.100.2, 192.168.101.2)
/ \
192.168.100.x 192.168.101.x
/ \
Node C Node B
(192.168.100.3, (192.168.101.3,
192.168.102.3) 192.168.102.2)
\ /
\ 192.168.102.x /
\ /
\--------------/
Each link is on a different subnet:
- A↔B: 192.168.101.0/24
- A↔C: 192.168.100.0/24
- B↔C: 192.168.102.0/24
This means:
- No single IP can reach all peers
- Standard IB plugin fails (expects single subnet)
- TCP socket plugin would need IP routing (adds latency)
Solution Architecture
Key Insight
Each node has multiple NICs, each on a different subnet. When connecting to a peer, we must:
- Determine which subnet the peer is on
- Use the local NIC on that same subnet
- Establish RDMA connection over that specific NIC pair
Handle Structure
The NCCL handle is expanded to advertise all local addresses:
struct mesh_handle {
uint32_t magic; // Validation
uint8_t num_addrs; // Number of addresses
uint16_t handshake_port; // TCP port for QP exchange
struct mesh_addr_entry {
uint32_t ip; // IP address (network order)
uint32_t mask; // Subnet mask
uint32_t qp_num; // Queue Pair number
uint8_t nic_idx; // Index into local NIC array
} addrs[MESH_MAX_ADDRS];
};
Connection Flow
Phase 1: Listen
ncclResult_t mesh_listen(int dev, void *handle, void **listenComm) {
// 1. Create QPs on ALL local NICs
for (int i = 0; i < num_nics; i++) {
create_qp_on_nic(&nics[i]);
}
// 2. Start background handshake thread
pthread_create(&thread, handshake_thread_func, lcomm);
// 3. Fill handle with ALL addresses
for (int i = 0; i < num_nics; i++) {
handle->addrs[i].ip = nics[i].ip_addr;
handle->addrs[i].mask = nics[i].netmask;
handle->addrs[i].qp_num = qps[i]->qp_num;
}
}
Phase 2: Connect
ncclResult_t mesh_connect(int dev, void *handle, void **sendComm) {
// 1. Search peer's addresses for reachable one
for (int i = 0; i < handle->num_addrs; i++) {
uint32_t peer_subnet = handle->addrs[i].ip & handle->addrs[i].mask;
// Find local NIC on same subnet
for (int j = 0; j < num_local_nics; j++) {
if (local_nics[j].subnet == peer_subnet) {
selected_nic = &local_nics[j];
selected_peer_addr = &handle->addrs[i];
break;
}
}
}
// 2. Create QP on selected NIC
create_qp_on_nic(selected_nic);
// 3. Exchange QP info via TCP handshake
send_handshake(peer_ip, peer_port, &local_qp_info, &remote_qp_info);
// 4. Connect QP to peer's QP
connect_qp(local_qp, remote_qp_info);
}
Phase 3: Accept
ncclResult_t mesh_accept(void *listenComm, void **recvComm) {
// Get pre-connected QP from handshake thread's queue
pthread_mutex_lock(&queue_mutex);
while (queue_empty) {
pthread_cond_wait(&queue_cond, &queue_mutex);
}
entry = dequeue();
pthread_mutex_unlock(&queue_mutex);
// Return the ready connection
rcomm->qp = entry->local_qp;
rcomm->nic = entry->nic;
}
Background Handshake Thread
The handshake thread solves a critical deadlock problem:
Without thread:
Rank 0: connect() → TCP connect to Rank 1 → blocks waiting for accept()
Rank 1: connect() → TCP connect to Rank 0 → blocks waiting for accept()
// DEADLOCK: Neither can call accept() because both stuck in connect()
With thread:
Rank 0: listen() starts thread → thread waits for TCP connections
Rank 1: listen() starts thread → thread waits for TCP connections
Rank 0: connect() → TCP connects to Rank 1's thread → gets response → returns
Rank 1: connect() → TCP connects to Rank 0's thread → gets response → returns
Rank 0: accept() → gets QP from queue (filled by thread) → returns
Rank 1: accept() → gets QP from queue (filled by thread) → returns
// SUCCESS: Thread handles incoming connections asynchronously
RDMA Queue Pair Setup
Each connection requires proper QP state transitions:
RESET → INIT → RTR → RTS
int mesh_connect_qp(struct ibv_qp *qp, struct mesh_nic *nic,
struct mesh_handle *remote) {
// RESET → INIT
qp_attr.qp_state = IBV_QPS_INIT;
qp_attr.pkey_index = 0;
qp_attr.port_num = nic->port_num;
qp_attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE |
IBV_ACCESS_REMOTE_READ |
IBV_ACCESS_LOCAL_WRITE;
ibv_modify_qp(qp, &qp_attr, ...);
// INIT → RTR (Ready to Receive)
qp_attr.qp_state = IBV_QPS_RTR;
qp_attr.path_mtu = IBV_MTU_4096;
qp_attr.dest_qp_num = remote->qp_num;
qp_attr.rq_psn = remote->psn;
qp_attr.ah_attr.dlid = remote->lid; // 0 for RoCE
qp_attr.ah_attr.grh.dgid = remote->gid; // Peer's GID
ibv_modify_qp(qp, &qp_attr, ...);
// RTR → RTS (Ready to Send)
qp_attr.qp_state = IBV_QPS_RTS;
qp_attr.sq_psn = local_psn;
qp_attr.timeout = 14;
qp_attr.retry_cnt = 7;
qp_attr.rnr_retry = 7;
ibv_modify_qp(qp, &qp_attr, ...);
}
Data Transfer
Send Path
ncclResult_t mesh_isend(void *sendComm, void *data, int size,
void *mhandle, void **request) {
struct ibv_send_wr wr = {
.wr_id = (uint64_t)req,
.sg_list = &sge,
.num_sge = 1,
.opcode = IBV_WR_SEND,
.send_flags = IBV_SEND_SIGNALED,
};
sge.addr = (uint64_t)data;
sge.length = size;
sge.lkey = mr->lkey;
ibv_post_send(comm->qp, &wr, &bad_wr);
}
Receive Path
ncclResult_t mesh_irecv(void *recvComm, int n, void **data,
int *sizes, void **mhandles, void **request) {
struct ibv_recv_wr wr = {
.wr_id = (uint64_t)req,
.sg_list = &sge,
.num_sge = 1,
};
sge.addr = (uint64_t)data[0];
sge.length = sizes[0];
sge.lkey = mr->lkey;
ibv_post_recv(comm->qp, &wr, &bad_wr);
}
Completion Polling
ncclResult_t mesh_test(void *request, int *done, int *sizes) {
struct ibv_wc wc;
int ret = ibv_poll_cq(req->cq, 1, &wc);
if (ret > 0) {
if (wc.status == IBV_WC_SUCCESS) {
*done = 1;
if (sizes) *sizes = wc.byte_len;
} else {
// Handle error
}
} else {
*done = 0; // Not complete yet
}
}
Memory Registration
RDMA requires memory to be registered with the NIC:
ncclResult_t mesh_regMr(void *comm, void *data, size_t size,
int type, void **mhandle) {
int access = IBV_ACCESS_LOCAL_WRITE |
IBV_ACCESS_REMOTE_WRITE |
IBV_ACCESS_REMOTE_READ;
mrh->mr = ibv_reg_mr(nic->pd, data, size, access);
*mhandle = mrh;
}
Note: Current implementation uses host memory staging. GPU memory is copied to host, sent via RDMA, then copied back to GPU on the receiver. GPUDirect RDMA would eliminate these copies.
Performance Considerations
Current Bottlenecks
- Host Memory Staging: GPU↔Host copies add latency
- Single QP: One Queue Pair per connection limits parallelism
- Completion Signaling: Every operation signals completion
Achieved Performance
- 8+ GB/s effective bandwidth
- ~64% of 100 Gbps line rate
- Sufficient for distributed ML workloads
Future Optimizations
- GPUDirect RDMA: Register GPU memory directly
- Multi-QP: Multiple QPs per connection
- Selective Signaling: Signal every N operations
- Inline Data: Small messages in WQE
File Structure
nccl-mesh-plugin/
├── src/
│ └── mesh_plugin.c # Main implementation (~1400 lines)
├── include/
│ └── mesh_plugin.h # Data structures and declarations
├── nccl/
│ ├── net.h # NCCL net plugin interface
│ ├── net_v8.h # v8 properties structure
│ └── err.h # NCCL error codes
└── Makefile
Debugging
Enable debug output:
export NCCL_DEBUG=INFO
export NCCL_MESH_DEBUG=1
Common issues:
- "No local NIC found": Subnet mismatch, check IP configuration
- "Handshake timeout": Firewall blocking TCP, check ports
- "QP transition failed": GID index wrong, try different
NCCL_MESH_GID_INDEX - "WC error status=12": Transport retry exceeded, check RDMA connectivity
Conclusion
The NCCL Mesh Plugin demonstrates that with careful engineering, NCCL can be extended to support unconventional network topologies. The key innovations—multi-address handles, subnet-aware NIC selection, and asynchronous handshaking—provide a template for other custom NCCL transports.