docs: clarify unified memory advantage on Grace Hopper

- Remove incorrect 'host memory staging' limitation
- Add section explaining NVLink-C2C unified memory benefits
- Zero-copy RDMA works automatically on DGX Spark
This commit is contained in:
autoscriptlabs 2026-01-09 14:16:48 -05:00
parent 031bc48953
commit 201ec9e321
2 changed files with 44 additions and 24 deletions

View file

@ -33,7 +33,7 @@ Neither works for direct-cabled RDMA meshes. This plugin does.
(100Gbps)
```
**Three DGX Spark workstations** connected in a triangle mesh with direct 100Gbps RDMA cables. Each link is on a **different subnet** - a configuration NVIDIA never intended to support.
**Three NVIDIA DGX Spark workstations** connected in a triangle mesh with direct 100Gbps RDMA cables. Each link is on a **different subnet** - a configuration NVIDIA never intended to support.
## 🚀 Results
@ -46,6 +46,16 @@ Neither works for direct-cabled RDMA meshes. This plugin does.
Successfully ran **distributed LLM inference** (Mistral-7B) across all 3 nodes using NCCL over this custom topology.
## ⚡ Unified Memory Advantage
On **Grace Hopper / DGX Spark** systems, the GPU and CPU share the same physical memory via NVLink-C2C. This unified memory architecture means:
- **No staging copies** - RDMA operates directly on GPU-accessible memory
- **GPUDirect-like performance** - Without additional kernel modules or configuration
- **Simplified memory management** - Register once, use everywhere
The 8+ GB/s bandwidth is the real deal, not bottlenecked by GPU↔Host transfers.
## 🏗️ Architecture
### Key Innovations
@ -71,7 +81,7 @@ Successfully ran **distributed LLM inference** (Mistral-7B) across all 3 nodes u
- Raw InfiniBand Verbs API (libibverbs)
- Reliable Connected (RC) Queue Pairs
- RoCE v2 over Ethernet
- Host memory staging (GPU→Host→RDMA→Host→GPU)
- Zero-copy on unified memory systems
## 📦 Installation
@ -88,7 +98,7 @@ ibv_devices
### Build
```bash
git clone https://github.com/yourusername/nccl-mesh-plugin.git
git clone https://github.com/autoscriptlabs/nccl-mesh-plugin.git
cd nccl-mesh-plugin
make
```
@ -211,19 +221,29 @@ for (int i = 0; i < handle->num_addrs; i++) {
| `NCCL_MESH_GID_INDEX` | `3` | RoCE GID index to use |
| `NCCL_MESH_DEBUG` | `0` | Enable plugin debug output |
## 🚧 Limitations
## 🚧 Current Limitations
- **Host memory staging**: GPU memory goes through host (no GPUDirect RDMA yet)
- **Single QP per connection**: No multi-rail aggregation
- **No relay routing**: Non-adjacent nodes can't communicate (fine for fully-connected mesh)
- **RoCE v2 only**: No InfiniBand support (Ethernet only)
- **Single QP per connection** - No multi-rail aggregation yet
- **No relay routing** - Non-adjacent nodes can't communicate (fine for fully-connected mesh)
- **RoCE v2 only** - Ethernet-based RDMA, no native InfiniBand support
## 🗺️ Roadmap
- [ ] GPUDirect RDMA support (bypass host memory)
- [ ] Multi-QP per connection for higher bandwidth
- [ ] Adaptive routing for partial meshes
- [ ] Performance tuning (inline data, signaling)
- [ ] Adaptive routing for partial mesh topologies
- [ ] Performance tuning (inline data, selective signaling)
- [ ] Support for non-unified-memory systems with explicit GPUDirect RDMA
## 🛠️ Hardware Tested
| Component | Specification |
|-----------|--------------|
| Nodes | 3x NVIDIA DGX Spark |
| CPU | NVIDIA Grace (ARM64) |
| GPU | NVIDIA Blackwell |
| Memory | Unified (NVLink-C2C) |
| NICs | ConnectX-7 (100GbE) |
| Cables | Direct-attach QSFP56 |
## 📚 References
@ -237,8 +257,8 @@ MIT License - see [LICENSE](LICENSE) file.
## 🙏 Acknowledgments
Built to connect three DGX Spark workstations that NVIDIA never intended to be clustered. Sometimes the best solutions come from ignoring "supported configurations."
Built to connect three DGX Spark workstations that NVIDIA never intended to cluster. Sometimes the best solutions come from ignoring "supported configurations."
---
*"The future of distributed AI computing is here."* - Mistral-7B, running on this very plugin
*"The future of distributed AI computing is here."* — Mistral-7B, running distributed inference on this very plugin

View file

@ -278,28 +278,28 @@ ncclResult_t mesh_regMr(void *comm, void *data, size_t size,
}
```
**Note**: Current implementation uses host memory staging. GPU memory is copied to host, sent via RDMA, then copied back to GPU on the receiver. GPUDirect RDMA would eliminate these copies.
**Unified Memory Note**: On Grace Hopper / DGX Spark systems, GPU and CPU share the same physical memory via NVLink-C2C. This means RDMA registration works directly on GPU-accessible memory without any staging copies - we get GPUDirect-like semantics automatically.
## Performance Considerations
### Current Bottlenecks
1. **Host Memory Staging**: GPU↔Host copies add latency
2. **Single QP**: One Queue Pair per connection limits parallelism
3. **Completion Signaling**: Every operation signals completion
### Achieved Performance
- **8+ GB/s** effective bandwidth
- **~64%** of 100 Gbps line rate
- Zero-copy on unified memory (Grace Hopper)
- Sufficient for distributed ML workloads
### Current Bottlenecks
1. **Single QP**: One Queue Pair per connection limits parallelism
2. **Completion Signaling**: Every operation signals completion
3. **Protocol Overhead**: RC transport has per-message overhead
### Future Optimizations
1. **GPUDirect RDMA**: Register GPU memory directly
2. **Multi-QP**: Multiple QPs per connection
3. **Selective Signaling**: Signal every N operations
4. **Inline Data**: Small messages in WQE
1. **Multi-QP**: Multiple QPs per connection for parallelism
2. **Selective Signaling**: Signal every N operations
3. **Inline Data**: Small messages embedded in WQE
## File Structure