No description
Find a file
autoscriptlabs 201ec9e321 docs: clarify unified memory advantage on Grace Hopper
- Remove incorrect 'host memory staging' limitation
- Add section explaining NVLink-C2C unified memory benefits
- Zero-copy RDMA works automatically on DGX Spark
2026-01-09 14:16:48 -05:00
docs docs: clarify unified memory advantage on Grace Hopper 2026-01-09 14:16:48 -05:00
examples Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies 2026-01-09 14:09:33 -05:00
include Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies 2026-01-09 14:09:33 -05:00
nccl Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies 2026-01-09 14:09:33 -05:00
src Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies 2026-01-09 14:09:33 -05:00
LICENSE Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies 2026-01-09 14:09:33 -05:00
Makefile Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies 2026-01-09 14:09:33 -05:00
README.md docs: clarify unified memory advantage on Grace Hopper 2026-01-09 14:16:48 -05:00

NCCL Mesh Plugin

Custom NCCL network plugin enabling distributed ML over direct-connect RDMA mesh topologies.

License: MIT

🎯 What This Does

This plugin enables NCCL (NVIDIA Collective Communications Library) to work with direct-connect mesh topologies where each node pair is on a different subnet. Standard NCCL plugins assume either:

  • A switched InfiniBand fabric (all nodes on same subnet)
  • TCP/IP networking (slow, high latency)

Neither works for direct-cabled RDMA meshes. This plugin does.

🔧 The Problem We Solved

                    ┌─────────────┐
                    │   Spark-A   │
                    │  (titanic)  │
                    └──────┬──────┘
           192.168.101.x   │   192.168.100.x
              (100Gbps)    │      (100Gbps)
                    ┌──────┴──────┐
                    │             │
              ┌─────┴─────┐ ┌─────┴─────┐
              │  Spark-B  │ │  Spark-C  │
              │ (iceberg) │ │(carpathia)│
              └─────┬─────┘ └─────┬─────┘
                    │             │
                    └──────┬──────┘
                    192.168.102.x
                      (100Gbps)

Three NVIDIA DGX Spark workstations connected in a triangle mesh with direct 100Gbps RDMA cables. Each link is on a different subnet - a configuration NVIDIA never intended to support.

🚀 Results

Metric Value
Effective Bandwidth 8+ GB/s
Line Rate Utilization ~64%
Topology 3-node triangle mesh
Link Speed 100 Gbps per link

Successfully ran distributed LLM inference (Mistral-7B) across all 3 nodes using NCCL over this custom topology.

Unified Memory Advantage

On Grace Hopper / DGX Spark systems, the GPU and CPU share the same physical memory via NVLink-C2C. This unified memory architecture means:

  • No staging copies - RDMA operates directly on GPU-accessible memory
  • GPUDirect-like performance - Without additional kernel modules or configuration
  • Simplified memory management - Register once, use everywhere

The 8+ GB/s bandwidth is the real deal, not bottlenecked by GPU↔Host transfers.

🏗️ Architecture

Key Innovations

  1. Multi-Address Handle Exchange

    • Each node advertises ALL its subnet IPs in the NCCL handle
    • Connector searches for reachable addresses by subnet matching
  2. Subnet-Aware NIC Selection

    • connect() finds the local NIC on the same subnet as the peer
    • Automatic routing without IP forwarding or bridges
  3. Background Handshake Thread

    • Eliminates deadlock when both ranks call connect() simultaneously
    • TCP-based QP info exchange runs asynchronously
  4. Bidirectional QP Exchange

    • Each connection creates fresh Queue Pairs on both sides
    • No QP reuse across multiple NCCL channels

RDMA Implementation

  • Raw InfiniBand Verbs API (libibverbs)
  • Reliable Connected (RC) Queue Pairs
  • RoCE v2 over Ethernet
  • Zero-copy on unified memory systems

📦 Installation

Prerequisites

# Ubuntu/Debian
sudo apt-get install libibverbs-dev librdmacm-dev

# Verify RDMA devices
ibv_devices

Build

git clone https://github.com/autoscriptlabs/nccl-mesh-plugin.git
cd nccl-mesh-plugin
make

Use

export LD_LIBRARY_PATH=$(pwd):$LD_LIBRARY_PATH
export NCCL_NET_PLUGIN=mesh
export NCCL_DEBUG=INFO  # or WARN for less output

# Run your distributed job
python your_distributed_script.py

🧪 Testing

Basic All-Reduce Test

import torch
import torch.distributed as dist

dist.init_process_group('nccl', rank=RANK, world_size=3,
    init_method='tcp://MASTER_IP:29500')

t = torch.ones(1000, device='cuda')
dist.all_reduce(t)
print(f'Result: {t[0]}')  # Should print 3.0

dist.destroy_process_group()

Bandwidth Benchmark

import torch
import torch.distributed as dist
import time

dist.init_process_group('nccl', rank=RANK, world_size=3,
    init_method='tcp://MASTER_IP:29500')

t = torch.ones(1024*1024*64, device='cuda')  # 256MB

# Warmup
for _ in range(5):
    dist.all_reduce(t)
torch.cuda.synchronize()

# Benchmark
start = time.time()
for _ in range(20):
    dist.all_reduce(t)
torch.cuda.synchronize()
elapsed = time.time() - start

print(f'Bandwidth: {(256*20/1024)/elapsed:.2f} GB/s')

🔬 How It Works

Connection Flow

Rank 0 (listen)                    Rank 1 (connect)
     │                                   │
     ▼                                   │
 listen()                                │
 ├─ Create QPs on ALL NICs               │
 ├─ Start handshake thread               │
 ├─ Return handle with all IPs           │
     │                                   │
     │◄──────── handle exchange ────────►│
     │                                   │
     │                                   ▼
     │                              connect()
     │                              ├─ Find matching subnet
     │                              ├─ Create QP on that NIC
     │                              ├─ TCP handshake ──────────►│
     │                                   │                      │
     │◄────────────────────────────────────────── QP info ─────┤
     │                                   │                      │
     ▼                                   ▼                      ▼
 accept()                           Connect QP            [handshake thread]
 ├─ Get QP from queue               to peer's QP          ├─ Accept TCP
 └─ Return recv_comm                     │                ├─ Create new QP
                                         │                ├─ Connect QPs
                                         │                └─ Queue for accept()
                                         │
                                    ┌────┴────┐
                                    │ RDMA OK │
                                    └─────────┘

Subnet Matching

// For each peer address in handle
for (int i = 0; i < handle->num_addrs; i++) {
    uint32_t peer_ip = handle->addrs[i].ip;
    
    // Find local NIC on same subnet
    for (int j = 0; j < num_nics; j++) {
        if ((peer_ip & nic[j].netmask) == nic[j].subnet) {
            // Found matching NIC!
            selected_nic = &nic[j];
            break;
        }
    }
}

⚙️ Configuration

Environment Variable Default Description
NCCL_NET_PLUGIN - Set to mesh to use this plugin
NCCL_DEBUG WARN Set to INFO for detailed logs
NCCL_MESH_GID_INDEX 3 RoCE GID index to use
NCCL_MESH_DEBUG 0 Enable plugin debug output

🚧 Current Limitations

  • Single QP per connection - No multi-rail aggregation yet
  • No relay routing - Non-adjacent nodes can't communicate (fine for fully-connected mesh)
  • RoCE v2 only - Ethernet-based RDMA, no native InfiniBand support

🗺️ Roadmap

  • Multi-QP per connection for higher bandwidth
  • Adaptive routing for partial mesh topologies
  • Performance tuning (inline data, selective signaling)
  • Support for non-unified-memory systems with explicit GPUDirect RDMA

🛠️ Hardware Tested

Component Specification
Nodes 3x NVIDIA DGX Spark
CPU NVIDIA Grace (ARM64)
GPU NVIDIA Blackwell
Memory Unified (NVLink-C2C)
NICs ConnectX-7 (100GbE)
Cables Direct-attach QSFP56

📚 References

📄 License

MIT License - see LICENSE file.

🙏 Acknowledgments

Built to connect three DGX Spark workstations that NVIDIA never intended to cluster. Sometimes the best solutions come from ignoring "supported configurations."


"The future of distributed AI computing is here." — Mistral-7B, running distributed inference on this very plugin