nccl-mesh-plugin/docs/ARCHITECTURE.md
autoscriptlabs 031bc48953 Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies
- Enables NCCL over multi-subnet mesh topologies
- 8+ GB/s bandwidth over 100Gbps RDMA
- Successfully tested with distributed LLM inference (Mistral-7B)
- Custom subnet-aware NIC selection
- Background handshake thread for deadlock-free connection setup
2026-01-09 14:09:33 -05:00

9.6 KiB

NCCL Mesh Plugin Architecture

This document provides a deep dive into the architecture and implementation of the NCCL Mesh Plugin.

Overview

The NCCL Mesh Plugin is a custom network transport that enables NCCL to work with direct-connect RDMA mesh topologies where each node pair is on a different subnet. This is a configuration that standard NCCL plugins cannot handle.

The Problem

Standard NCCL Networking

NCCL's built-in network plugins assume one of two scenarios:

  1. InfiniBand Fabric: All nodes connected through IB switches, sharing a single subnet
  2. TCP/IP Sockets: Standard IP networking with routing

Our Topology

     Node A (192.168.100.2, 192.168.101.2)
              /                \
    192.168.100.x         192.168.101.x
            /                    \
    Node C                      Node B
(192.168.100.3,            (192.168.101.3,
 192.168.102.3)             192.168.102.2)
            \                    /
             \   192.168.102.x  /
              \                /
               \--------------/

Each link is on a different subnet:

  • A↔B: 192.168.101.0/24
  • A↔C: 192.168.100.0/24
  • B↔C: 192.168.102.0/24

This means:

  • No single IP can reach all peers
  • Standard IB plugin fails (expects single subnet)
  • TCP socket plugin would need IP routing (adds latency)

Solution Architecture

Key Insight

Each node has multiple NICs, each on a different subnet. When connecting to a peer, we must:

  1. Determine which subnet the peer is on
  2. Use the local NIC on that same subnet
  3. Establish RDMA connection over that specific NIC pair

Handle Structure

The NCCL handle is expanded to advertise all local addresses:

struct mesh_handle {
    uint32_t magic;              // Validation
    uint8_t  num_addrs;          // Number of addresses
    uint16_t handshake_port;     // TCP port for QP exchange
    
    struct mesh_addr_entry {
        uint32_t ip;             // IP address (network order)
        uint32_t mask;           // Subnet mask
        uint32_t qp_num;         // Queue Pair number
        uint8_t  nic_idx;        // Index into local NIC array
    } addrs[MESH_MAX_ADDRS];
};

Connection Flow

Phase 1: Listen

ncclResult_t mesh_listen(int dev, void *handle, void **listenComm) {
    // 1. Create QPs on ALL local NICs
    for (int i = 0; i < num_nics; i++) {
        create_qp_on_nic(&nics[i]);
    }
    
    // 2. Start background handshake thread
    pthread_create(&thread, handshake_thread_func, lcomm);
    
    // 3. Fill handle with ALL addresses
    for (int i = 0; i < num_nics; i++) {
        handle->addrs[i].ip = nics[i].ip_addr;
        handle->addrs[i].mask = nics[i].netmask;
        handle->addrs[i].qp_num = qps[i]->qp_num;
    }
}

Phase 2: Connect

ncclResult_t mesh_connect(int dev, void *handle, void **sendComm) {
    // 1. Search peer's addresses for reachable one
    for (int i = 0; i < handle->num_addrs; i++) {
        uint32_t peer_subnet = handle->addrs[i].ip & handle->addrs[i].mask;
        
        // Find local NIC on same subnet
        for (int j = 0; j < num_local_nics; j++) {
            if (local_nics[j].subnet == peer_subnet) {
                selected_nic = &local_nics[j];
                selected_peer_addr = &handle->addrs[i];
                break;
            }
        }
    }
    
    // 2. Create QP on selected NIC
    create_qp_on_nic(selected_nic);
    
    // 3. Exchange QP info via TCP handshake
    send_handshake(peer_ip, peer_port, &local_qp_info, &remote_qp_info);
    
    // 4. Connect QP to peer's QP
    connect_qp(local_qp, remote_qp_info);
}

Phase 3: Accept

ncclResult_t mesh_accept(void *listenComm, void **recvComm) {
    // Get pre-connected QP from handshake thread's queue
    pthread_mutex_lock(&queue_mutex);
    while (queue_empty) {
        pthread_cond_wait(&queue_cond, &queue_mutex);
    }
    entry = dequeue();
    pthread_mutex_unlock(&queue_mutex);
    
    // Return the ready connection
    rcomm->qp = entry->local_qp;
    rcomm->nic = entry->nic;
}

Background Handshake Thread

The handshake thread solves a critical deadlock problem:

Without thread:

Rank 0: connect() → TCP connect to Rank 1 → blocks waiting for accept()
Rank 1: connect() → TCP connect to Rank 0 → blocks waiting for accept()
// DEADLOCK: Neither can call accept() because both stuck in connect()

With thread:

Rank 0: listen() starts thread → thread waits for TCP connections
Rank 1: listen() starts thread → thread waits for TCP connections
Rank 0: connect() → TCP connects to Rank 1's thread → gets response → returns
Rank 1: connect() → TCP connects to Rank 0's thread → gets response → returns
Rank 0: accept() → gets QP from queue (filled by thread) → returns
Rank 1: accept() → gets QP from queue (filled by thread) → returns
// SUCCESS: Thread handles incoming connections asynchronously

RDMA Queue Pair Setup

Each connection requires proper QP state transitions:

RESET → INIT → RTR → RTS
int mesh_connect_qp(struct ibv_qp *qp, struct mesh_nic *nic,
                    struct mesh_handle *remote) {
    // RESET → INIT
    qp_attr.qp_state = IBV_QPS_INIT;
    qp_attr.pkey_index = 0;
    qp_attr.port_num = nic->port_num;
    qp_attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE | 
                              IBV_ACCESS_REMOTE_READ |
                              IBV_ACCESS_LOCAL_WRITE;
    ibv_modify_qp(qp, &qp_attr, ...);
    
    // INIT → RTR (Ready to Receive)
    qp_attr.qp_state = IBV_QPS_RTR;
    qp_attr.path_mtu = IBV_MTU_4096;
    qp_attr.dest_qp_num = remote->qp_num;
    qp_attr.rq_psn = remote->psn;
    qp_attr.ah_attr.dlid = remote->lid;  // 0 for RoCE
    qp_attr.ah_attr.grh.dgid = remote->gid;  // Peer's GID
    ibv_modify_qp(qp, &qp_attr, ...);
    
    // RTR → RTS (Ready to Send)
    qp_attr.qp_state = IBV_QPS_RTS;
    qp_attr.sq_psn = local_psn;
    qp_attr.timeout = 14;
    qp_attr.retry_cnt = 7;
    qp_attr.rnr_retry = 7;
    ibv_modify_qp(qp, &qp_attr, ...);
}

Data Transfer

Send Path

ncclResult_t mesh_isend(void *sendComm, void *data, int size,
                        void *mhandle, void **request) {
    struct ibv_send_wr wr = {
        .wr_id = (uint64_t)req,
        .sg_list = &sge,
        .num_sge = 1,
        .opcode = IBV_WR_SEND,
        .send_flags = IBV_SEND_SIGNALED,
    };
    
    sge.addr = (uint64_t)data;
    sge.length = size;
    sge.lkey = mr->lkey;
    
    ibv_post_send(comm->qp, &wr, &bad_wr);
}

Receive Path

ncclResult_t mesh_irecv(void *recvComm, int n, void **data,
                        int *sizes, void **mhandles, void **request) {
    struct ibv_recv_wr wr = {
        .wr_id = (uint64_t)req,
        .sg_list = &sge,
        .num_sge = 1,
    };
    
    sge.addr = (uint64_t)data[0];
    sge.length = sizes[0];
    sge.lkey = mr->lkey;
    
    ibv_post_recv(comm->qp, &wr, &bad_wr);
}

Completion Polling

ncclResult_t mesh_test(void *request, int *done, int *sizes) {
    struct ibv_wc wc;
    
    int ret = ibv_poll_cq(req->cq, 1, &wc);
    if (ret > 0) {
        if (wc.status == IBV_WC_SUCCESS) {
            *done = 1;
            if (sizes) *sizes = wc.byte_len;
        } else {
            // Handle error
        }
    } else {
        *done = 0;  // Not complete yet
    }
}

Memory Registration

RDMA requires memory to be registered with the NIC:

ncclResult_t mesh_regMr(void *comm, void *data, size_t size,
                        int type, void **mhandle) {
    int access = IBV_ACCESS_LOCAL_WRITE | 
                 IBV_ACCESS_REMOTE_WRITE |
                 IBV_ACCESS_REMOTE_READ;
    
    mrh->mr = ibv_reg_mr(nic->pd, data, size, access);
    *mhandle = mrh;
}

Note: Current implementation uses host memory staging. GPU memory is copied to host, sent via RDMA, then copied back to GPU on the receiver. GPUDirect RDMA would eliminate these copies.

Performance Considerations

Current Bottlenecks

  1. Host Memory Staging: GPU↔Host copies add latency
  2. Single QP: One Queue Pair per connection limits parallelism
  3. Completion Signaling: Every operation signals completion

Achieved Performance

  • 8+ GB/s effective bandwidth
  • ~64% of 100 Gbps line rate
  • Sufficient for distributed ML workloads

Future Optimizations

  1. GPUDirect RDMA: Register GPU memory directly
  2. Multi-QP: Multiple QPs per connection
  3. Selective Signaling: Signal every N operations
  4. Inline Data: Small messages in WQE

File Structure

nccl-mesh-plugin/
├── src/
│   └── mesh_plugin.c      # Main implementation (~1400 lines)
├── include/
│   └── mesh_plugin.h      # Data structures and declarations
├── nccl/
│   ├── net.h              # NCCL net plugin interface
│   ├── net_v8.h           # v8 properties structure
│   └── err.h              # NCCL error codes
└── Makefile

Debugging

Enable debug output:

export NCCL_DEBUG=INFO
export NCCL_MESH_DEBUG=1

Common issues:

  1. "No local NIC found": Subnet mismatch, check IP configuration
  2. "Handshake timeout": Firewall blocking TCP, check ports
  3. "QP transition failed": GID index wrong, try different NCCL_MESH_GID_INDEX
  4. "WC error status=12": Transport retry exceeded, check RDMA connectivity

Conclusion

The NCCL Mesh Plugin demonstrates that with careful engineering, NCCL can be extended to support unconventional network topologies. The key innovations—multi-address handles, subnet-aware NIC selection, and asynchronous handshaking—provide a template for other custom NCCL transports.