- Remove incorrect 'host memory staging' limitation - Add section explaining NVLink-C2C unified memory benefits - Zero-copy RDMA works automatically on DGX Spark
9.7 KiB
NCCL Mesh Plugin Architecture
This document provides a deep dive into the architecture and implementation of the NCCL Mesh Plugin.
Overview
The NCCL Mesh Plugin is a custom network transport that enables NCCL to work with direct-connect RDMA mesh topologies where each node pair is on a different subnet. This is a configuration that standard NCCL plugins cannot handle.
The Problem
Standard NCCL Networking
NCCL's built-in network plugins assume one of two scenarios:
- InfiniBand Fabric: All nodes connected through IB switches, sharing a single subnet
- TCP/IP Sockets: Standard IP networking with routing
Our Topology
Node A (192.168.100.2, 192.168.101.2)
/ \
192.168.100.x 192.168.101.x
/ \
Node C Node B
(192.168.100.3, (192.168.101.3,
192.168.102.3) 192.168.102.2)
\ /
\ 192.168.102.x /
\ /
\--------------/
Each link is on a different subnet:
- A↔B: 192.168.101.0/24
- A↔C: 192.168.100.0/24
- B↔C: 192.168.102.0/24
This means:
- No single IP can reach all peers
- Standard IB plugin fails (expects single subnet)
- TCP socket plugin would need IP routing (adds latency)
Solution Architecture
Key Insight
Each node has multiple NICs, each on a different subnet. When connecting to a peer, we must:
- Determine which subnet the peer is on
- Use the local NIC on that same subnet
- Establish RDMA connection over that specific NIC pair
Handle Structure
The NCCL handle is expanded to advertise all local addresses:
struct mesh_handle {
uint32_t magic; // Validation
uint8_t num_addrs; // Number of addresses
uint16_t handshake_port; // TCP port for QP exchange
struct mesh_addr_entry {
uint32_t ip; // IP address (network order)
uint32_t mask; // Subnet mask
uint32_t qp_num; // Queue Pair number
uint8_t nic_idx; // Index into local NIC array
} addrs[MESH_MAX_ADDRS];
};
Connection Flow
Phase 1: Listen
ncclResult_t mesh_listen(int dev, void *handle, void **listenComm) {
// 1. Create QPs on ALL local NICs
for (int i = 0; i < num_nics; i++) {
create_qp_on_nic(&nics[i]);
}
// 2. Start background handshake thread
pthread_create(&thread, handshake_thread_func, lcomm);
// 3. Fill handle with ALL addresses
for (int i = 0; i < num_nics; i++) {
handle->addrs[i].ip = nics[i].ip_addr;
handle->addrs[i].mask = nics[i].netmask;
handle->addrs[i].qp_num = qps[i]->qp_num;
}
}
Phase 2: Connect
ncclResult_t mesh_connect(int dev, void *handle, void **sendComm) {
// 1. Search peer's addresses for reachable one
for (int i = 0; i < handle->num_addrs; i++) {
uint32_t peer_subnet = handle->addrs[i].ip & handle->addrs[i].mask;
// Find local NIC on same subnet
for (int j = 0; j < num_local_nics; j++) {
if (local_nics[j].subnet == peer_subnet) {
selected_nic = &local_nics[j];
selected_peer_addr = &handle->addrs[i];
break;
}
}
}
// 2. Create QP on selected NIC
create_qp_on_nic(selected_nic);
// 3. Exchange QP info via TCP handshake
send_handshake(peer_ip, peer_port, &local_qp_info, &remote_qp_info);
// 4. Connect QP to peer's QP
connect_qp(local_qp, remote_qp_info);
}
Phase 3: Accept
ncclResult_t mesh_accept(void *listenComm, void **recvComm) {
// Get pre-connected QP from handshake thread's queue
pthread_mutex_lock(&queue_mutex);
while (queue_empty) {
pthread_cond_wait(&queue_cond, &queue_mutex);
}
entry = dequeue();
pthread_mutex_unlock(&queue_mutex);
// Return the ready connection
rcomm->qp = entry->local_qp;
rcomm->nic = entry->nic;
}
Background Handshake Thread
The handshake thread solves a critical deadlock problem:
Without thread:
Rank 0: connect() → TCP connect to Rank 1 → blocks waiting for accept()
Rank 1: connect() → TCP connect to Rank 0 → blocks waiting for accept()
// DEADLOCK: Neither can call accept() because both stuck in connect()
With thread:
Rank 0: listen() starts thread → thread waits for TCP connections
Rank 1: listen() starts thread → thread waits for TCP connections
Rank 0: connect() → TCP connects to Rank 1's thread → gets response → returns
Rank 1: connect() → TCP connects to Rank 0's thread → gets response → returns
Rank 0: accept() → gets QP from queue (filled by thread) → returns
Rank 1: accept() → gets QP from queue (filled by thread) → returns
// SUCCESS: Thread handles incoming connections asynchronously
RDMA Queue Pair Setup
Each connection requires proper QP state transitions:
RESET → INIT → RTR → RTS
int mesh_connect_qp(struct ibv_qp *qp, struct mesh_nic *nic,
struct mesh_handle *remote) {
// RESET → INIT
qp_attr.qp_state = IBV_QPS_INIT;
qp_attr.pkey_index = 0;
qp_attr.port_num = nic->port_num;
qp_attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE |
IBV_ACCESS_REMOTE_READ |
IBV_ACCESS_LOCAL_WRITE;
ibv_modify_qp(qp, &qp_attr, ...);
// INIT → RTR (Ready to Receive)
qp_attr.qp_state = IBV_QPS_RTR;
qp_attr.path_mtu = IBV_MTU_4096;
qp_attr.dest_qp_num = remote->qp_num;
qp_attr.rq_psn = remote->psn;
qp_attr.ah_attr.dlid = remote->lid; // 0 for RoCE
qp_attr.ah_attr.grh.dgid = remote->gid; // Peer's GID
ibv_modify_qp(qp, &qp_attr, ...);
// RTR → RTS (Ready to Send)
qp_attr.qp_state = IBV_QPS_RTS;
qp_attr.sq_psn = local_psn;
qp_attr.timeout = 14;
qp_attr.retry_cnt = 7;
qp_attr.rnr_retry = 7;
ibv_modify_qp(qp, &qp_attr, ...);
}
Data Transfer
Send Path
ncclResult_t mesh_isend(void *sendComm, void *data, int size,
void *mhandle, void **request) {
struct ibv_send_wr wr = {
.wr_id = (uint64_t)req,
.sg_list = &sge,
.num_sge = 1,
.opcode = IBV_WR_SEND,
.send_flags = IBV_SEND_SIGNALED,
};
sge.addr = (uint64_t)data;
sge.length = size;
sge.lkey = mr->lkey;
ibv_post_send(comm->qp, &wr, &bad_wr);
}
Receive Path
ncclResult_t mesh_irecv(void *recvComm, int n, void **data,
int *sizes, void **mhandles, void **request) {
struct ibv_recv_wr wr = {
.wr_id = (uint64_t)req,
.sg_list = &sge,
.num_sge = 1,
};
sge.addr = (uint64_t)data[0];
sge.length = sizes[0];
sge.lkey = mr->lkey;
ibv_post_recv(comm->qp, &wr, &bad_wr);
}
Completion Polling
ncclResult_t mesh_test(void *request, int *done, int *sizes) {
struct ibv_wc wc;
int ret = ibv_poll_cq(req->cq, 1, &wc);
if (ret > 0) {
if (wc.status == IBV_WC_SUCCESS) {
*done = 1;
if (sizes) *sizes = wc.byte_len;
} else {
// Handle error
}
} else {
*done = 0; // Not complete yet
}
}
Memory Registration
RDMA requires memory to be registered with the NIC:
ncclResult_t mesh_regMr(void *comm, void *data, size_t size,
int type, void **mhandle) {
int access = IBV_ACCESS_LOCAL_WRITE |
IBV_ACCESS_REMOTE_WRITE |
IBV_ACCESS_REMOTE_READ;
mrh->mr = ibv_reg_mr(nic->pd, data, size, access);
*mhandle = mrh;
}
Unified Memory Note: On Grace Hopper / DGX Spark systems, GPU and CPU share the same physical memory via NVLink-C2C. This means RDMA registration works directly on GPU-accessible memory without any staging copies - we get GPUDirect-like semantics automatically.
Performance Considerations
Achieved Performance
- 8+ GB/s effective bandwidth
- ~64% of 100 Gbps line rate
- Zero-copy on unified memory (Grace Hopper)
- Sufficient for distributed ML workloads
Current Bottlenecks
- Single QP: One Queue Pair per connection limits parallelism
- Completion Signaling: Every operation signals completion
- Protocol Overhead: RC transport has per-message overhead
Future Optimizations
- Multi-QP: Multiple QPs per connection for parallelism
- Selective Signaling: Signal every N operations
- Inline Data: Small messages embedded in WQE
File Structure
nccl-mesh-plugin/
├── src/
│ └── mesh_plugin.c # Main implementation (~1400 lines)
├── include/
│ └── mesh_plugin.h # Data structures and declarations
├── nccl/
│ ├── net.h # NCCL net plugin interface
│ ├── net_v8.h # v8 properties structure
│ └── err.h # NCCL error codes
└── Makefile
Debugging
Enable debug output:
export NCCL_DEBUG=INFO
export NCCL_MESH_DEBUG=1
Common issues:
- "No local NIC found": Subnet mismatch, check IP configuration
- "Handshake timeout": Firewall blocking TCP, check ports
- "QP transition failed": GID index wrong, try different
NCCL_MESH_GID_INDEX - "WC error status=12": Transport retry exceeded, check RDMA connectivity
Conclusion
The NCCL Mesh Plugin demonstrates that with careful engineering, NCCL can be extended to support unconventional network topologies. The key innovations—multi-address handles, subnet-aware NIC selection, and asynchronous handshaking—provide a template for other custom NCCL transports.