mirror of https://github.com/autoscriptlabs/nccl-mesh-plugin.git synced 2026-01-11 11:34:06 +00:00

No description

Find a file

autoscriptlabs 031bc48953 Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies - Enables NCCL over multi-subnet mesh topologies - 8+ GB/s bandwidth over 100Gbps RDMA - Successfully tested with distributed LLM inference (Mistral-7B) - Custom subnet-aware NIC selection - Background handshake thread for deadlock-free connection setup		2026-01-09 14:09:33 -05:00
docs	Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies	2026-01-09 14:09:33 -05:00
examples	Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies	2026-01-09 14:09:33 -05:00
include	Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies	2026-01-09 14:09:33 -05:00
nccl	Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies	2026-01-09 14:09:33 -05:00
src	Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies	2026-01-09 14:09:33 -05:00
LICENSE	Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies	2026-01-09 14:09:33 -05:00
Makefile	Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies	2026-01-09 14:09:33 -05:00
README.md	Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies	2026-01-09 14:09:33 -05:00

README.md

NCCL Mesh Plugin

Custom NCCL network plugin enabling distributed ML over direct-connect RDMA mesh topologies.

🎯 What This Does

This plugin enables NCCL (NVIDIA Collective Communications Library) to work with direct-connect mesh topologies where each node pair is on a different subnet. Standard NCCL plugins assume either:

A switched InfiniBand fabric (all nodes on same subnet)
TCP/IP networking (slow, high latency)

Neither works for direct-cabled RDMA meshes. This plugin does.

🔧 The Problem We Solved

                    ┌─────────────┐
                    │   Spark-A   │
                    │  (titanic)  │
                    └──────┬──────┘
           192.168.101.x   │   192.168.100.x
              (100Gbps)    │      (100Gbps)
                    ┌──────┴──────┐
                    │             │
              ┌─────┴─────┐ ┌─────┴─────┐
              │  Spark-B  │ │  Spark-C  │
              │ (iceberg) │ │(carpathia)│
              └─────┬─────┘ └─────┬─────┘
                    │             │
                    └──────┬──────┘
                    192.168.102.x
                      (100Gbps)

Three DGX Spark workstations connected in a triangle mesh with direct 100Gbps RDMA cables. Each link is on a different subnet - a configuration NVIDIA never intended to support.

🚀 Results

Metric	Value
Effective Bandwidth	8+ GB/s
Line Rate Utilization	~64%
Topology	3-node triangle mesh
Link Speed	100 Gbps per link

Successfully ran distributed LLM inference (Mistral-7B) across all 3 nodes using NCCL over this custom topology.

🏗️ Architecture

Key Innovations

Multi-Address Handle Exchange
- Each node advertises ALL its subnet IPs in the NCCL handle
- Connector searches for reachable addresses by subnet matching
Subnet-Aware NIC Selection
- connect() finds the local NIC on the same subnet as the peer
- Automatic routing without IP forwarding or bridges
Background Handshake Thread
- Eliminates deadlock when both ranks call connect() simultaneously
- TCP-based QP info exchange runs asynchronously
Bidirectional QP Exchange
- Each connection creates fresh Queue Pairs on both sides
- No QP reuse across multiple NCCL channels

RDMA Implementation

Raw InfiniBand Verbs API (libibverbs)
Reliable Connected (RC) Queue Pairs
RoCE v2 over Ethernet
Host memory staging (GPU→Host→RDMA→Host→GPU)

📦 Installation

Prerequisites

# Ubuntu/Debian
sudo apt-get install libibverbs-dev librdmacm-dev

# Verify RDMA devices
ibv_devices

Build

git clone https://github.com/yourusername/nccl-mesh-plugin.git
cd nccl-mesh-plugin
make

Use

export LD_LIBRARY_PATH=$(pwd):$LD_LIBRARY_PATH
export NCCL_NET_PLUGIN=mesh
export NCCL_DEBUG=INFO  # or WARN for less output

# Run your distributed job
python your_distributed_script.py

🧪 Testing

Basic All-Reduce Test

import torch
import torch.distributed as dist

dist.init_process_group('nccl', rank=RANK, world_size=3,
    init_method='tcp://MASTER_IP:29500')

t = torch.ones(1000, device='cuda')
dist.all_reduce(t)
print(f'Result: {t[0]}')  # Should print 3.0

dist.destroy_process_group()

Bandwidth Benchmark

import torch
import torch.distributed as dist
import time

dist.init_process_group('nccl', rank=RANK, world_size=3,
    init_method='tcp://MASTER_IP:29500')

t = torch.ones(1024*1024*64, device='cuda')  # 256MB

# Warmup
for _ in range(5):
    dist.all_reduce(t)
torch.cuda.synchronize()

# Benchmark
start = time.time()
for _ in range(20):
    dist.all_reduce(t)
torch.cuda.synchronize()
elapsed = time.time() - start

print(f'Bandwidth: {(256*20/1024)/elapsed:.2f} GB/s')

🔬 How It Works

Connection Flow

Rank 0 (listen)                    Rank 1 (connect)
     │                                   │
     ▼                                   │
 listen()                                │
 ├─ Create QPs on ALL NICs               │
 ├─ Start handshake thread               │
 ├─ Return handle with all IPs           │
     │                                   │
     │◄──────── handle exchange ────────►│
     │                                   │
     │                                   ▼
     │                              connect()
     │                              ├─ Find matching subnet
     │                              ├─ Create QP on that NIC
     │                              ├─ TCP handshake ──────────►│
     │                                   │                      │
     │◄────────────────────────────────────────── QP info ─────┤
     │                                   │                      │
     ▼                                   ▼                      ▼
 accept()                           Connect QP            [handshake thread]
 ├─ Get QP from queue               to peer's QP          ├─ Accept TCP
 └─ Return recv_comm                     │                ├─ Create new QP
                                         │                ├─ Connect QPs
                                         │                └─ Queue for accept()
                                         │
                                    ┌────┴────┐
                                    │ RDMA OK │
                                    └─────────┘

Subnet Matching

// For each peer address in handle
for (int i = 0; i < handle->num_addrs; i++) {
    uint32_t peer_ip = handle->addrs[i].ip;
    
    // Find local NIC on same subnet
    for (int j = 0; j < num_nics; j++) {
        if ((peer_ip & nic[j].netmask) == nic[j].subnet) {
            // Found matching NIC!
            selected_nic = &nic[j];
            break;
        }
    }
}

⚙️ Configuration

Environment Variable	Default	Description
`NCCL_NET_PLUGIN`	-	Set to `mesh` to use this plugin
`NCCL_DEBUG`	`WARN`	Set to `INFO` for detailed logs
`NCCL_MESH_GID_INDEX`	`3`	RoCE GID index to use
`NCCL_MESH_DEBUG`	`0`	Enable plugin debug output

🚧 Limitations

Host memory staging: GPU memory goes through host (no GPUDirect RDMA yet)
Single QP per connection: No multi-rail aggregation
No relay routing: Non-adjacent nodes can't communicate (fine for fully-connected mesh)
RoCE v2 only: No InfiniBand support (Ethernet only)

🗺️ Roadmap

GPUDirect RDMA support (bypass host memory)
Multi-QP per connection for higher bandwidth
Adaptive routing for partial meshes
Performance tuning (inline data, signaling)

📚 References

📄 License

MIT License - see LICENSE file.

🙏 Acknowledgments

Built to connect three DGX Spark workstations that NVIDIA never intended to be clustered. Sometimes the best solutions come from ignoring "supported configurations."

"The future of distributed AI computing is here." - Mistral-7B, running on this very plugin