nccl-mesh-plugin/docs/SETUP.md
autoscriptlabs 031bc48953 Initial release: NCCL Mesh Plugin for direct-connect RDMA topologies
- Enables NCCL over multi-subnet mesh topologies
- 8+ GB/s bandwidth over 100Gbps RDMA
- Successfully tested with distributed LLM inference (Mistral-7B)
- Custom subnet-aware NIC selection
- Background handshake thread for deadlock-free connection setup
2026-01-09 14:09:33 -05:00

249 lines
4.6 KiB
Markdown

# Hardware Setup Guide
This guide covers setting up a direct-connect RDMA mesh topology with multiple nodes.
## Overview
Our reference setup uses three NVIDIA DGX Spark workstations connected in a triangle mesh topology. Each pair of nodes has a dedicated 100 Gbps RDMA link on its own subnet.
## Hardware Requirements
- 3+ nodes with RDMA-capable NICs (ConnectX-6/7 recommended)
- Direct-attach cables (QSFP56 for 100GbE)
- Each node needs N-1 NICs for N nodes in a fully-connected mesh
## Network Topology
### Triangle Mesh (3 Nodes)
```
Node A
/ \
NIC1 NIC2
| |
192.168.101.x 192.168.100.x
| |
NIC1 NIC1
| |
Node B ---- Node C
NIC2
192.168.102.x
```
### IP Address Assignment
| Link | Subnet | Node A | Node B | Node C |
|------|--------|--------|--------|--------|
| A↔B | 192.168.101.0/24 | .2 | .3 | - |
| A↔C | 192.168.100.0/24 | .2 | - | .3 |
| B↔C | 192.168.102.0/24 | - | .2 | .3 |
## Network Configuration
### 1. Identify NICs
```bash
# List RDMA devices
ibv_devices
# List network interfaces with RDMA
ls -la /sys/class/infiniband/*/device/net/
```
### 2. Configure IP Addresses
On **Node A** (example):
```bash
# Link to Node B
sudo ip addr add 192.168.101.2/24 dev enp1s0f0np0
sudo ip link set enp1s0f0np0 up
# Link to Node C
sudo ip addr add 192.168.100.2/24 dev enp1s0f1np1
sudo ip link set enp1s0f1np1 up
```
On **Node B**:
```bash
# Link to Node A
sudo ip addr add 192.168.101.3/24 dev enp1s0f0np0
sudo ip link set enp1s0f0np0 up
# Link to Node C
sudo ip addr add 192.168.102.2/24 dev enp1s0f1np1
sudo ip link set enp1s0f1np1 up
```
On **Node C**:
```bash
# Link to Node A
sudo ip addr add 192.168.100.3/24 dev enp1s0f0np0
sudo ip link set enp1s0f0np0 up
# Link to Node B
sudo ip addr add 192.168.102.3/24 dev enp1s0f1np1
sudo ip link set enp1s0f1np1 up
```
### 3. Make Configuration Persistent
Create netplan config (Ubuntu):
```yaml
# /etc/netplan/99-rdma-mesh.yaml
network:
version: 2
ethernets:
enp1s0f0np0:
addresses:
- 192.168.101.2/24 # Adjust per node
enp1s0f1np1:
addresses:
- 192.168.100.2/24 # Adjust per node
```
Apply:
```bash
sudo netplan apply
```
## Verify Connectivity
### 1. Ping Test
From Node A:
```bash
ping 192.168.101.3 # Node B
ping 192.168.100.3 # Node C
```
### 2. RDMA Test
```bash
# On Node B (server)
ib_send_bw -d rocep1s0f0 -x 3
# On Node A (client)
ib_send_bw -d rocep1s0f0 -x 3 192.168.101.3
```
Expected output: ~12 GB/s for 100GbE
### 3. Verify GID Index
```bash
# Show GID table
show_gids
# Find RoCE v2 GID (usually index 3)
ibv_devinfo -v | grep -A5 GID
```
## RoCE Configuration
### Enable RoCE v2
```bash
# Check current mode
cat /sys/class/infiniband/rocep*/ports/1/gid_attrs/types/*
# Enable RoCE v2 (if needed)
echo "RoCE v2" | sudo tee /sys/class/infiniband/rocep1s0f0/ports/1/gid_attrs/types/0
```
### Configure ECN (Optional but Recommended)
```bash
# Enable ECN for RoCE
sudo sysctl -w net.ipv4.tcp_ecn=1
# Configure PFC (Priority Flow Control) on switch if applicable
```
## Firewall Configuration
Open ports for NCCL communication:
```bash
# TCP ports for handshake (dynamic, 40000-50000 range)
sudo ufw allow 40000:50000/tcp
# Or disable firewall for mesh interfaces
sudo ufw allow in on enp1s0f0np0
sudo ufw allow in on enp1s0f1np1
```
## Troubleshooting
### No RDMA Devices Found
```bash
# Load kernel modules
sudo modprobe ib_core
sudo modprobe mlx5_core
sudo modprobe mlx5_ib
# Check dmesg
dmesg | grep -i mlx
```
### Link Not Coming Up
```bash
# Check physical connection
ethtool enp1s0f0np0
# Check for errors
ip -s link show enp1s0f0np0
```
### RDMA Connection Fails
```bash
# Verify GID is populated
cat /sys/class/infiniband/rocep1s0f0/ports/1/gids/3
# Check RDMA CM
rdma link show
```
### Wrong GID Index
Try different GID indices:
```bash
export NCCL_MESH_GID_INDEX=0 # or 1, 2, 3...
```
## Scaling Beyond 3 Nodes
For N nodes in a fully-connected mesh:
- Each node needs N-1 NICs
- Total links: N*(N-1)/2
- Each link on unique subnet
For 4 nodes:
```
A
/|\
B-+-C
\|/
D
```
- 6 links, 6 subnets
- Each node needs 3 NICs
For larger clusters, consider a **partial mesh** or **fat-tree** topology with relay routing (not yet implemented in this plugin).
## Reference: DGX Spark Mesh
Our tested configuration:
| Hostname | Management IP | Mesh IPs |
|----------|--------------|----------|
| titanic (A) | 10.0.0.170 | 192.168.100.2, 192.168.101.2 |
| iceberg (B) | 10.0.0.171 | 192.168.101.3, 192.168.102.2 |
| carpathia (C) | 10.0.0.172 | 192.168.100.3, 192.168.102.3 |