diff --git a/README.md b/README.md index a769a6d..dc7a65f 100644 --- a/README.md +++ b/README.md @@ -33,7 +33,7 @@ Neither works for direct-cabled RDMA meshes. This plugin does. (100Gbps) ``` -**Three DGX Spark workstations** connected in a triangle mesh with direct 100Gbps RDMA cables. Each link is on a **different subnet** - a configuration NVIDIA never intended to support. +**Three NVIDIA DGX Spark workstations** connected in a triangle mesh with direct 100Gbps RDMA cables. Each link is on a **different subnet** - a configuration NVIDIA never intended to support. ## 🚀 Results @@ -46,6 +46,16 @@ Neither works for direct-cabled RDMA meshes. This plugin does. Successfully ran **distributed LLM inference** (Mistral-7B) across all 3 nodes using NCCL over this custom topology. +## ⚡ Unified Memory Advantage + +On **Grace Hopper / DGX Spark** systems, the GPU and CPU share the same physical memory via NVLink-C2C. This unified memory architecture means: + +- **No staging copies** - RDMA operates directly on GPU-accessible memory +- **GPUDirect-like performance** - Without additional kernel modules or configuration +- **Simplified memory management** - Register once, use everywhere + +The 8+ GB/s bandwidth is the real deal, not bottlenecked by GPU↔Host transfers. + ## 🏗️ Architecture ### Key Innovations @@ -71,7 +81,7 @@ Successfully ran **distributed LLM inference** (Mistral-7B) across all 3 nodes u - Raw InfiniBand Verbs API (libibverbs) - Reliable Connected (RC) Queue Pairs - RoCE v2 over Ethernet -- Host memory staging (GPU→Host→RDMA→Host→GPU) +- Zero-copy on unified memory systems ## 📦 Installation @@ -88,7 +98,7 @@ ibv_devices ### Build ```bash -git clone https://github.com/yourusername/nccl-mesh-plugin.git +git clone https://github.com/autoscriptlabs/nccl-mesh-plugin.git cd nccl-mesh-plugin make ``` @@ -211,19 +221,29 @@ for (int i = 0; i < handle->num_addrs; i++) { | `NCCL_MESH_GID_INDEX` | `3` | RoCE GID index to use | | `NCCL_MESH_DEBUG` | `0` | Enable plugin debug output | -## 🚧 Limitations +## 🚧 Current Limitations -- **Host memory staging**: GPU memory goes through host (no GPUDirect RDMA yet) -- **Single QP per connection**: No multi-rail aggregation -- **No relay routing**: Non-adjacent nodes can't communicate (fine for fully-connected mesh) -- **RoCE v2 only**: No InfiniBand support (Ethernet only) +- **Single QP per connection** - No multi-rail aggregation yet +- **No relay routing** - Non-adjacent nodes can't communicate (fine for fully-connected mesh) +- **RoCE v2 only** - Ethernet-based RDMA, no native InfiniBand support ## 🗺️ Roadmap -- [ ] GPUDirect RDMA support (bypass host memory) - [ ] Multi-QP per connection for higher bandwidth -- [ ] Adaptive routing for partial meshes -- [ ] Performance tuning (inline data, signaling) +- [ ] Adaptive routing for partial mesh topologies +- [ ] Performance tuning (inline data, selective signaling) +- [ ] Support for non-unified-memory systems with explicit GPUDirect RDMA + +## 🛠️ Hardware Tested + +| Component | Specification | +|-----------|--------------| +| Nodes | 3x NVIDIA DGX Spark | +| CPU | NVIDIA Grace (ARM64) | +| GPU | NVIDIA Blackwell | +| Memory | Unified (NVLink-C2C) | +| NICs | ConnectX-7 (100GbE) | +| Cables | Direct-attach QSFP56 | ## 📚 References @@ -237,8 +257,8 @@ MIT License - see [LICENSE](LICENSE) file. ## 🙏 Acknowledgments -Built to connect three DGX Spark workstations that NVIDIA never intended to be clustered. Sometimes the best solutions come from ignoring "supported configurations." +Built to connect three DGX Spark workstations that NVIDIA never intended to cluster. Sometimes the best solutions come from ignoring "supported configurations." --- -*"The future of distributed AI computing is here."* - Mistral-7B, running on this very plugin +*"The future of distributed AI computing is here."* — Mistral-7B, running distributed inference on this very plugin diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index 9400c51..2534d36 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -278,28 +278,28 @@ ncclResult_t mesh_regMr(void *comm, void *data, size_t size, } ``` -**Note**: Current implementation uses host memory staging. GPU memory is copied to host, sent via RDMA, then copied back to GPU on the receiver. GPUDirect RDMA would eliminate these copies. +**Unified Memory Note**: On Grace Hopper / DGX Spark systems, GPU and CPU share the same physical memory via NVLink-C2C. This means RDMA registration works directly on GPU-accessible memory without any staging copies - we get GPUDirect-like semantics automatically. ## Performance Considerations -### Current Bottlenecks - -1. **Host Memory Staging**: GPU↔Host copies add latency -2. **Single QP**: One Queue Pair per connection limits parallelism -3. **Completion Signaling**: Every operation signals completion - ### Achieved Performance - **8+ GB/s** effective bandwidth - **~64%** of 100 Gbps line rate +- Zero-copy on unified memory (Grace Hopper) - Sufficient for distributed ML workloads +### Current Bottlenecks + +1. **Single QP**: One Queue Pair per connection limits parallelism +2. **Completion Signaling**: Every operation signals completion +3. **Protocol Overhead**: RC transport has per-message overhead + ### Future Optimizations -1. **GPUDirect RDMA**: Register GPU memory directly -2. **Multi-QP**: Multiple QPs per connection -3. **Selective Signaling**: Signal every N operations -4. **Inline Data**: Small messages in WQE +1. **Multi-QP**: Multiple QPs per connection for parallelism +2. **Selective Signaling**: Signal every N operations +3. **Inline Data**: Small messages embedded in WQE ## File Structure