mirror of
https://github.com/autoscriptlabs/nccl-mesh-plugin.git
synced 2026-01-11 11:34:06 +00:00
docs: clarify unified memory advantage on Grace Hopper
- Remove incorrect 'host memory staging' limitation - Add section explaining NVLink-C2C unified memory benefits - Zero-copy RDMA works automatically on DGX Spark
This commit is contained in:
parent
031bc48953
commit
201ec9e321
2 changed files with 44 additions and 24 deletions
46
README.md
46
README.md
|
|
@ -33,7 +33,7 @@ Neither works for direct-cabled RDMA meshes. This plugin does.
|
|||
(100Gbps)
|
||||
```
|
||||
|
||||
**Three DGX Spark workstations** connected in a triangle mesh with direct 100Gbps RDMA cables. Each link is on a **different subnet** - a configuration NVIDIA never intended to support.
|
||||
**Three NVIDIA DGX Spark workstations** connected in a triangle mesh with direct 100Gbps RDMA cables. Each link is on a **different subnet** - a configuration NVIDIA never intended to support.
|
||||
|
||||
## 🚀 Results
|
||||
|
||||
|
|
@ -46,6 +46,16 @@ Neither works for direct-cabled RDMA meshes. This plugin does.
|
|||
|
||||
Successfully ran **distributed LLM inference** (Mistral-7B) across all 3 nodes using NCCL over this custom topology.
|
||||
|
||||
## ⚡ Unified Memory Advantage
|
||||
|
||||
On **Grace Hopper / DGX Spark** systems, the GPU and CPU share the same physical memory via NVLink-C2C. This unified memory architecture means:
|
||||
|
||||
- **No staging copies** - RDMA operates directly on GPU-accessible memory
|
||||
- **GPUDirect-like performance** - Without additional kernel modules or configuration
|
||||
- **Simplified memory management** - Register once, use everywhere
|
||||
|
||||
The 8+ GB/s bandwidth is the real deal, not bottlenecked by GPU↔Host transfers.
|
||||
|
||||
## 🏗️ Architecture
|
||||
|
||||
### Key Innovations
|
||||
|
|
@ -71,7 +81,7 @@ Successfully ran **distributed LLM inference** (Mistral-7B) across all 3 nodes u
|
|||
- Raw InfiniBand Verbs API (libibverbs)
|
||||
- Reliable Connected (RC) Queue Pairs
|
||||
- RoCE v2 over Ethernet
|
||||
- Host memory staging (GPU→Host→RDMA→Host→GPU)
|
||||
- Zero-copy on unified memory systems
|
||||
|
||||
## 📦 Installation
|
||||
|
||||
|
|
@ -88,7 +98,7 @@ ibv_devices
|
|||
### Build
|
||||
|
||||
```bash
|
||||
git clone https://github.com/yourusername/nccl-mesh-plugin.git
|
||||
git clone https://github.com/autoscriptlabs/nccl-mesh-plugin.git
|
||||
cd nccl-mesh-plugin
|
||||
make
|
||||
```
|
||||
|
|
@ -211,19 +221,29 @@ for (int i = 0; i < handle->num_addrs; i++) {
|
|||
| `NCCL_MESH_GID_INDEX` | `3` | RoCE GID index to use |
|
||||
| `NCCL_MESH_DEBUG` | `0` | Enable plugin debug output |
|
||||
|
||||
## 🚧 Limitations
|
||||
## 🚧 Current Limitations
|
||||
|
||||
- **Host memory staging**: GPU memory goes through host (no GPUDirect RDMA yet)
|
||||
- **Single QP per connection**: No multi-rail aggregation
|
||||
- **No relay routing**: Non-adjacent nodes can't communicate (fine for fully-connected mesh)
|
||||
- **RoCE v2 only**: No InfiniBand support (Ethernet only)
|
||||
- **Single QP per connection** - No multi-rail aggregation yet
|
||||
- **No relay routing** - Non-adjacent nodes can't communicate (fine for fully-connected mesh)
|
||||
- **RoCE v2 only** - Ethernet-based RDMA, no native InfiniBand support
|
||||
|
||||
## 🗺️ Roadmap
|
||||
|
||||
- [ ] GPUDirect RDMA support (bypass host memory)
|
||||
- [ ] Multi-QP per connection for higher bandwidth
|
||||
- [ ] Adaptive routing for partial meshes
|
||||
- [ ] Performance tuning (inline data, signaling)
|
||||
- [ ] Adaptive routing for partial mesh topologies
|
||||
- [ ] Performance tuning (inline data, selective signaling)
|
||||
- [ ] Support for non-unified-memory systems with explicit GPUDirect RDMA
|
||||
|
||||
## 🛠️ Hardware Tested
|
||||
|
||||
| Component | Specification |
|
||||
|-----------|--------------|
|
||||
| Nodes | 3x NVIDIA DGX Spark |
|
||||
| CPU | NVIDIA Grace (ARM64) |
|
||||
| GPU | NVIDIA Blackwell |
|
||||
| Memory | Unified (NVLink-C2C) |
|
||||
| NICs | ConnectX-7 (100GbE) |
|
||||
| Cables | Direct-attach QSFP56 |
|
||||
|
||||
## 📚 References
|
||||
|
||||
|
|
@ -237,8 +257,8 @@ MIT License - see [LICENSE](LICENSE) file.
|
|||
|
||||
## 🙏 Acknowledgments
|
||||
|
||||
Built to connect three DGX Spark workstations that NVIDIA never intended to be clustered. Sometimes the best solutions come from ignoring "supported configurations."
|
||||
Built to connect three DGX Spark workstations that NVIDIA never intended to cluster. Sometimes the best solutions come from ignoring "supported configurations."
|
||||
|
||||
---
|
||||
|
||||
*"The future of distributed AI computing is here."* - Mistral-7B, running on this very plugin
|
||||
*"The future of distributed AI computing is here."* — Mistral-7B, running distributed inference on this very plugin
|
||||
|
|
|
|||
|
|
@ -278,28 +278,28 @@ ncclResult_t mesh_regMr(void *comm, void *data, size_t size,
|
|||
}
|
||||
```
|
||||
|
||||
**Note**: Current implementation uses host memory staging. GPU memory is copied to host, sent via RDMA, then copied back to GPU on the receiver. GPUDirect RDMA would eliminate these copies.
|
||||
**Unified Memory Note**: On Grace Hopper / DGX Spark systems, GPU and CPU share the same physical memory via NVLink-C2C. This means RDMA registration works directly on GPU-accessible memory without any staging copies - we get GPUDirect-like semantics automatically.
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Current Bottlenecks
|
||||
|
||||
1. **Host Memory Staging**: GPU↔Host copies add latency
|
||||
2. **Single QP**: One Queue Pair per connection limits parallelism
|
||||
3. **Completion Signaling**: Every operation signals completion
|
||||
|
||||
### Achieved Performance
|
||||
|
||||
- **8+ GB/s** effective bandwidth
|
||||
- **~64%** of 100 Gbps line rate
|
||||
- Zero-copy on unified memory (Grace Hopper)
|
||||
- Sufficient for distributed ML workloads
|
||||
|
||||
### Current Bottlenecks
|
||||
|
||||
1. **Single QP**: One Queue Pair per connection limits parallelism
|
||||
2. **Completion Signaling**: Every operation signals completion
|
||||
3. **Protocol Overhead**: RC transport has per-message overhead
|
||||
|
||||
### Future Optimizations
|
||||
|
||||
1. **GPUDirect RDMA**: Register GPU memory directly
|
||||
2. **Multi-QP**: Multiple QPs per connection
|
||||
3. **Selective Signaling**: Signal every N operations
|
||||
4. **Inline Data**: Small messages in WQE
|
||||
1. **Multi-QP**: Multiple QPs per connection for parallelism
|
||||
2. **Selective Signaling**: Signal every N operations
|
||||
3. **Inline Data**: Small messages embedded in WQE
|
||||
|
||||
## File Structure
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue