mirror of
https://github.com/autoscriptlabs/nccl-mesh-plugin.git
synced 2026-01-11 11:34:06 +00:00
- Enables NCCL over multi-subnet mesh topologies - 8+ GB/s bandwidth over 100Gbps RDMA - Successfully tested with distributed LLM inference (Mistral-7B) - Custom subnet-aware NIC selection - Background handshake thread for deadlock-free connection setup
249 lines
4.6 KiB
Markdown
249 lines
4.6 KiB
Markdown
# Hardware Setup Guide
|
|
|
|
This guide covers setting up a direct-connect RDMA mesh topology with multiple nodes.
|
|
|
|
## Overview
|
|
|
|
Our reference setup uses three NVIDIA DGX Spark workstations connected in a triangle mesh topology. Each pair of nodes has a dedicated 100 Gbps RDMA link on its own subnet.
|
|
|
|
## Hardware Requirements
|
|
|
|
- 3+ nodes with RDMA-capable NICs (ConnectX-6/7 recommended)
|
|
- Direct-attach cables (QSFP56 for 100GbE)
|
|
- Each node needs N-1 NICs for N nodes in a fully-connected mesh
|
|
|
|
## Network Topology
|
|
|
|
### Triangle Mesh (3 Nodes)
|
|
|
|
```
|
|
Node A
|
|
/ \
|
|
NIC1 NIC2
|
|
| |
|
|
192.168.101.x 192.168.100.x
|
|
| |
|
|
NIC1 NIC1
|
|
| |
|
|
Node B ---- Node C
|
|
NIC2
|
|
192.168.102.x
|
|
```
|
|
|
|
### IP Address Assignment
|
|
|
|
| Link | Subnet | Node A | Node B | Node C |
|
|
|------|--------|--------|--------|--------|
|
|
| A↔B | 192.168.101.0/24 | .2 | .3 | - |
|
|
| A↔C | 192.168.100.0/24 | .2 | - | .3 |
|
|
| B↔C | 192.168.102.0/24 | - | .2 | .3 |
|
|
|
|
## Network Configuration
|
|
|
|
### 1. Identify NICs
|
|
|
|
```bash
|
|
# List RDMA devices
|
|
ibv_devices
|
|
|
|
# List network interfaces with RDMA
|
|
ls -la /sys/class/infiniband/*/device/net/
|
|
```
|
|
|
|
### 2. Configure IP Addresses
|
|
|
|
On **Node A** (example):
|
|
|
|
```bash
|
|
# Link to Node B
|
|
sudo ip addr add 192.168.101.2/24 dev enp1s0f0np0
|
|
sudo ip link set enp1s0f0np0 up
|
|
|
|
# Link to Node C
|
|
sudo ip addr add 192.168.100.2/24 dev enp1s0f1np1
|
|
sudo ip link set enp1s0f1np1 up
|
|
```
|
|
|
|
On **Node B**:
|
|
|
|
```bash
|
|
# Link to Node A
|
|
sudo ip addr add 192.168.101.3/24 dev enp1s0f0np0
|
|
sudo ip link set enp1s0f0np0 up
|
|
|
|
# Link to Node C
|
|
sudo ip addr add 192.168.102.2/24 dev enp1s0f1np1
|
|
sudo ip link set enp1s0f1np1 up
|
|
```
|
|
|
|
On **Node C**:
|
|
|
|
```bash
|
|
# Link to Node A
|
|
sudo ip addr add 192.168.100.3/24 dev enp1s0f0np0
|
|
sudo ip link set enp1s0f0np0 up
|
|
|
|
# Link to Node B
|
|
sudo ip addr add 192.168.102.3/24 dev enp1s0f1np1
|
|
sudo ip link set enp1s0f1np1 up
|
|
```
|
|
|
|
### 3. Make Configuration Persistent
|
|
|
|
Create netplan config (Ubuntu):
|
|
|
|
```yaml
|
|
# /etc/netplan/99-rdma-mesh.yaml
|
|
network:
|
|
version: 2
|
|
ethernets:
|
|
enp1s0f0np0:
|
|
addresses:
|
|
- 192.168.101.2/24 # Adjust per node
|
|
enp1s0f1np1:
|
|
addresses:
|
|
- 192.168.100.2/24 # Adjust per node
|
|
```
|
|
|
|
Apply:
|
|
```bash
|
|
sudo netplan apply
|
|
```
|
|
|
|
## Verify Connectivity
|
|
|
|
### 1. Ping Test
|
|
|
|
From Node A:
|
|
```bash
|
|
ping 192.168.101.3 # Node B
|
|
ping 192.168.100.3 # Node C
|
|
```
|
|
|
|
### 2. RDMA Test
|
|
|
|
```bash
|
|
# On Node B (server)
|
|
ib_send_bw -d rocep1s0f0 -x 3
|
|
|
|
# On Node A (client)
|
|
ib_send_bw -d rocep1s0f0 -x 3 192.168.101.3
|
|
```
|
|
|
|
Expected output: ~12 GB/s for 100GbE
|
|
|
|
### 3. Verify GID Index
|
|
|
|
```bash
|
|
# Show GID table
|
|
show_gids
|
|
|
|
# Find RoCE v2 GID (usually index 3)
|
|
ibv_devinfo -v | grep -A5 GID
|
|
```
|
|
|
|
## RoCE Configuration
|
|
|
|
### Enable RoCE v2
|
|
|
|
```bash
|
|
# Check current mode
|
|
cat /sys/class/infiniband/rocep*/ports/1/gid_attrs/types/*
|
|
|
|
# Enable RoCE v2 (if needed)
|
|
echo "RoCE v2" | sudo tee /sys/class/infiniband/rocep1s0f0/ports/1/gid_attrs/types/0
|
|
```
|
|
|
|
### Configure ECN (Optional but Recommended)
|
|
|
|
```bash
|
|
# Enable ECN for RoCE
|
|
sudo sysctl -w net.ipv4.tcp_ecn=1
|
|
|
|
# Configure PFC (Priority Flow Control) on switch if applicable
|
|
```
|
|
|
|
## Firewall Configuration
|
|
|
|
Open ports for NCCL communication:
|
|
|
|
```bash
|
|
# TCP ports for handshake (dynamic, 40000-50000 range)
|
|
sudo ufw allow 40000:50000/tcp
|
|
|
|
# Or disable firewall for mesh interfaces
|
|
sudo ufw allow in on enp1s0f0np0
|
|
sudo ufw allow in on enp1s0f1np1
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### No RDMA Devices Found
|
|
|
|
```bash
|
|
# Load kernel modules
|
|
sudo modprobe ib_core
|
|
sudo modprobe mlx5_core
|
|
sudo modprobe mlx5_ib
|
|
|
|
# Check dmesg
|
|
dmesg | grep -i mlx
|
|
```
|
|
|
|
### Link Not Coming Up
|
|
|
|
```bash
|
|
# Check physical connection
|
|
ethtool enp1s0f0np0
|
|
|
|
# Check for errors
|
|
ip -s link show enp1s0f0np0
|
|
```
|
|
|
|
### RDMA Connection Fails
|
|
|
|
```bash
|
|
# Verify GID is populated
|
|
cat /sys/class/infiniband/rocep1s0f0/ports/1/gids/3
|
|
|
|
# Check RDMA CM
|
|
rdma link show
|
|
```
|
|
|
|
### Wrong GID Index
|
|
|
|
Try different GID indices:
|
|
|
|
```bash
|
|
export NCCL_MESH_GID_INDEX=0 # or 1, 2, 3...
|
|
```
|
|
|
|
## Scaling Beyond 3 Nodes
|
|
|
|
For N nodes in a fully-connected mesh:
|
|
- Each node needs N-1 NICs
|
|
- Total links: N*(N-1)/2
|
|
- Each link on unique subnet
|
|
|
|
For 4 nodes:
|
|
```
|
|
A
|
|
/|\
|
|
B-+-C
|
|
\|/
|
|
D
|
|
```
|
|
- 6 links, 6 subnets
|
|
- Each node needs 3 NICs
|
|
|
|
For larger clusters, consider a **partial mesh** or **fat-tree** topology with relay routing (not yet implemented in this plugin).
|
|
|
|
## Reference: DGX Spark Mesh
|
|
|
|
Our tested configuration:
|
|
|
|
| Hostname | Management IP | Mesh IPs |
|
|
|----------|--------------|----------|
|
|
| titanic (A) | 10.0.0.170 | 192.168.100.2, 192.168.101.2 |
|
|
| iceberg (B) | 10.0.0.171 | 192.168.101.3, 192.168.102.2 |
|
|
| carpathia (C) | 10.0.0.172 | 192.168.100.3, 192.168.102.3 |
|