nccl-mesh-plugin/docs/ARCHITECTURE.md

# NCCL Mesh Plugin Architecture

This document provides a deep dive into the architecture and implementation of the NCCL Mesh Plugin.

## Overview

The NCCL Mesh Plugin is a custom network transport that enables NCCL to work with direct-connect RDMA mesh topologies where each node pair is on a different subnet. This is a configuration that standard NCCL plugins cannot handle.

## The Problem

### Standard NCCL Networking

NCCL's built-in network plugins assume one of two scenarios:

1. **InfiniBand Fabric**: All nodes connected through IB switches, sharing a single subnet
2. **TCP/IP Sockets**: Standard IP networking with routing

### Our Topology

```
     Node A (192.168.100.2, 192.168.101.2)
              /                \
    192.168.100.x         192.168.101.x
            /                    \
    Node C                      Node B
(192.168.100.3,            (192.168.101.3,
 192.168.102.3)             192.168.102.2)
            \                    /
             \   192.168.102.x  /
              \                /
               \--------------/
```

Each link is on a **different subnet**:
- A↔B: 192.168.101.0/24
- A↔C: 192.168.100.0/24
- B↔C: 192.168.102.0/24

This means:
- No single IP can reach all peers
- Standard IB plugin fails (expects single subnet)
- TCP socket plugin would need IP routing (adds latency)

## Solution Architecture

### Key Insight

Each node has **multiple NICs**, each on a different subnet. When connecting to a peer, we must:
1. Determine which subnet the peer is on
2. Use the local NIC on that same subnet
3. Establish RDMA connection over that specific NIC pair

### Handle Structure

The NCCL handle is expanded to advertise **all** local addresses:

```c
struct mesh_handle {
    uint32_t magic;              // Validation
    uint8_t  num_addrs;          // Number of addresses
    uint16_t handshake_port;     // TCP port for QP exchange

    struct mesh_addr_entry {
        uint32_t ip;             // IP address (network order)
        uint32_t mask;           // Subnet mask
        uint32_t qp_num;         // Queue Pair number
        uint8_t  nic_idx;        // Index into local NIC array
    } addrs[MESH_MAX_ADDRS];
};
```

### Connection Flow

#### Phase 1: Listen

```c
ncclResult_t mesh_listen(int dev, void *handle, void **listenComm) {
    // 1. Create QPs on ALL local NICs
    for (int i = 0; i < num_nics; i++) {
        create_qp_on_nic(&nics[i]);
    }

    // 2. Start background handshake thread
    pthread_create(&thread, handshake_thread_func, lcomm);

    // 3. Fill handle with ALL addresses
    for (int i = 0; i < num_nics; i++) {
        handle->addrs[i].ip = nics[i].ip_addr;
        handle->addrs[i].mask = nics[i].netmask;
        handle->addrs[i].qp_num = qps[i]->qp_num;
    }
}
```

#### Phase 2: Connect

```c
ncclResult_t mesh_connect(int dev, void *handle, void **sendComm) {
    // 1. Search peer's addresses for reachable one
    for (int i = 0; i < handle->num_addrs; i++) {
        uint32_t peer_subnet = handle->addrs[i].ip & handle->addrs[i].mask;

        // Find local NIC on same subnet
        for (int j = 0; j < num_local_nics; j++) {
            if (local_nics[j].subnet == peer_subnet) {
                selected_nic = &local_nics[j];
                selected_peer_addr = &handle->addrs[i];
                break;
            }
        }
    }

    // 2. Create QP on selected NIC
    create_qp_on_nic(selected_nic);

    // 3. Exchange QP info via TCP handshake
    send_handshake(peer_ip, peer_port, &local_qp_info, &remote_qp_info);

    // 4. Connect QP to peer's QP
    connect_qp(local_qp, remote_qp_info);
}
```

#### Phase 3: Accept

```c
ncclResult_t mesh_accept(void *listenComm, void **recvComm) {
    // Get pre-connected QP from handshake thread's queue
    pthread_mutex_lock(&queue_mutex);
    while (queue_empty) {
        pthread_cond_wait(&queue_cond, &queue_mutex);
    }
    entry = dequeue();
    pthread_mutex_unlock(&queue_mutex);

    // Return the ready connection
    rcomm->qp = entry->local_qp;
    rcomm->nic = entry->nic;
}
```

### Background Handshake Thread

The handshake thread solves a critical deadlock problem:

**Without thread:**
```
Rank 0: connect() → TCP connect to Rank 1 → blocks waiting for accept()
Rank 1: connect() → TCP connect to Rank 0 → blocks waiting for accept()
// DEADLOCK: Neither can call accept() because both stuck in connect()
```

**With thread:**
```
Rank 0: listen() starts thread → thread waits for TCP connections
Rank 1: listen() starts thread → thread waits for TCP connections
Rank 0: connect() → TCP connects to Rank 1's thread → gets response → returns
Rank 1: connect() → TCP connects to Rank 0's thread → gets response → returns
Rank 0: accept() → gets QP from queue (filled by thread) → returns
Rank 1: accept() → gets QP from queue (filled by thread) → returns
// SUCCESS: Thread handles incoming connections asynchronously
```

### RDMA Queue Pair Setup

Each connection requires proper QP state transitions:

```
RESET → INIT → RTR → RTS
```

```c
int mesh_connect_qp(struct ibv_qp *qp, struct mesh_nic *nic,
                    struct mesh_handle *remote) {
    // RESET → INIT
    qp_attr.qp_state = IBV_QPS_INIT;
    qp_attr.pkey_index = 0;
    qp_attr.port_num = nic->port_num;
    qp_attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE |
                              IBV_ACCESS_REMOTE_READ |
                              IBV_ACCESS_LOCAL_WRITE;
    ibv_modify_qp(qp, &qp_attr, ...);

    // INIT → RTR (Ready to Receive)
    qp_attr.qp_state = IBV_QPS_RTR;
    qp_attr.path_mtu = IBV_MTU_4096;
    qp_attr.dest_qp_num = remote->qp_num;
    qp_attr.rq_psn = remote->psn;
    qp_attr.ah_attr.dlid = remote->lid;  // 0 for RoCE
    qp_attr.ah_attr.grh.dgid = remote->gid;  // Peer's GID
    ibv_modify_qp(qp, &qp_attr, ...);

    // RTR → RTS (Ready to Send)
    qp_attr.qp_state = IBV_QPS_RTS;
    qp_attr.sq_psn = local_psn;
    qp_attr.timeout = 14;
    qp_attr.retry_cnt = 7;
    qp_attr.rnr_retry = 7;
    ibv_modify_qp(qp, &qp_attr, ...);
}
```

### Data Transfer

#### Send Path

```c
ncclResult_t mesh_isend(void *sendComm, void *data, int size,
                        void *mhandle, void **request) {
    struct ibv_send_wr wr = {
        .wr_id = (uint64_t)req,
        .sg_list = &sge,
        .num_sge = 1,
        .opcode = IBV_WR_SEND,
        .send_flags = IBV_SEND_SIGNALED,
    };

    sge.addr = (uint64_t)data;
    sge.length = size;
    sge.lkey = mr->lkey;

    ibv_post_send(comm->qp, &wr, &bad_wr);
}
```

#### Receive Path

```c
ncclResult_t mesh_irecv(void *recvComm, int n, void **data,
                        int *sizes, void **mhandles, void **request) {
    struct ibv_recv_wr wr = {
        .wr_id = (uint64_t)req,
        .sg_list = &sge,
        .num_sge = 1,
    };

    sge.addr = (uint64_t)data[0];
    sge.length = sizes[0];
    sge.lkey = mr->lkey;

    ibv_post_recv(comm->qp, &wr, &bad_wr);
}
```

#### Completion Polling

```c
ncclResult_t mesh_test(void *request, int *done, int *sizes) {
    struct ibv_wc wc;

    int ret = ibv_poll_cq(req->cq, 1, &wc);
    if (ret > 0) {
        if (wc.status == IBV_WC_SUCCESS) {
            *done = 1;
            if (sizes) *sizes = wc.byte_len;
        } else {
            // Handle error
        }
    } else {
        *done = 0;  // Not complete yet
    }
}
```

## Memory Registration

RDMA requires memory to be registered with the NIC:

```c
ncclResult_t mesh_regMr(void *comm, void *data, size_t size,
                        int type, void **mhandle) {
    int access = IBV_ACCESS_LOCAL_WRITE |
                 IBV_ACCESS_REMOTE_WRITE |
                 IBV_ACCESS_REMOTE_READ;

    mrh->mr = ibv_reg_mr(nic->pd, data, size, access);
    *mhandle = mrh;
}
```

**Note**: Current implementation uses host memory staging. GPU memory is copied to host, sent via RDMA, then copied back to GPU on the receiver. GPUDirect RDMA would eliminate these copies.

## Performance Considerations

### Current Bottlenecks

1. **Host Memory Staging**: GPU↔Host copies add latency
2. **Single QP**: One Queue Pair per connection limits parallelism
3. **Completion Signaling**: Every operation signals completion

### Achieved Performance

- **8+ GB/s** effective bandwidth
- **~64%** of 100 Gbps line rate
- Sufficient for distributed ML workloads

### Future Optimizations

1. **GPUDirect RDMA**: Register GPU memory directly
2. **Multi-QP**: Multiple QPs per connection
3. **Selective Signaling**: Signal every N operations
4. **Inline Data**: Small messages in WQE

## File Structure

```
nccl-mesh-plugin/
├── src/
│   └── mesh_plugin.c      # Main implementation (~1400 lines)
├── include/
│   └── mesh_plugin.h      # Data structures and declarations
├── nccl/
│   ├── net.h              # NCCL net plugin interface
│   ├── net_v8.h           # v8 properties structure
│   └── err.h              # NCCL error codes
└── Makefile
```

## Debugging

Enable debug output:

```bash
export NCCL_DEBUG=INFO
export NCCL_MESH_DEBUG=1
```

Common issues:

1. **"No local NIC found"**: Subnet mismatch, check IP configuration
2. **"Handshake timeout"**: Firewall blocking TCP, check ports
3. **"QP transition failed"**: GID index wrong, try different `NCCL_MESH_GID_INDEX`
4. **"WC error status=12"**: Transport retry exceeded, check RDMA connectivity

## Conclusion

The NCCL Mesh Plugin demonstrates that with careful engineering, NCCL can be extended to support unconventional network topologies. The key innovations—multi-address handles, subnet-aware NIC selection, and asynchronous handshaking—provide a template for other custom NCCL transports.