docs: clarify unified memory advantage on Grace Hopper

- Remove incorrect 'host memory staging' limitation - Add section explaining NVLink-C2C unified memory benefits - Zero-copy RDMA works automatically on DGX Spark
2026-01-11 11:34:06 +00:00 · 2026-01-09 14:16:48 -05:00 · 2026-01-09 14:16:48 -05:00 · 201ec9e321
commit 201ec9e321
parent 031bc48953
2 changed files with 44 additions and 24 deletions
--- a/README.md
+++ b/README.md
@ -33,7 +33,7 @@ Neither works for direct-cabled RDMA meshes. This plugin does.
                      (100Gbps)
 ```

-**Three DGX Spark workstations** connected in a triangle mesh with direct 100Gbps RDMA cables. Each link is on a **different subnet** - a configuration NVIDIA never intended to support.
+**Three NVIDIA DGX Spark workstations** connected in a triangle mesh with direct 100Gbps RDMA cables. Each link is on a **different subnet** - a configuration NVIDIA never intended to support.

 ## 🚀 Results

@ -46,6 +46,16 @@ Neither works for direct-cabled RDMA meshes. This plugin does.

 Successfully ran **distributed LLM inference** (Mistral-7B) across all 3 nodes using NCCL over this custom topology.

+## ⚡ Unified Memory Advantage
+
+On **Grace Hopper / DGX Spark** systems, the GPU and CPU share the same physical memory via NVLink-C2C. This unified memory architecture means:
+
+- **No staging copies** - RDMA operates directly on GPU-accessible memory
+- **GPUDirect-like performance** - Without additional kernel modules or configuration
+- **Simplified memory management** - Register once, use everywhere
+
+The 8+ GB/s bandwidth is the real deal, not bottlenecked by GPU↔Host transfers.
+
 ## 🏗️ Architecture

 ### Key Innovations
@ -71,7 +81,7 @@ Successfully ran **distributed LLM inference** (Mistral-7B) across all 3 nodes u
 - Raw InfiniBand Verbs API (libibverbs)
 - Reliable Connected (RC) Queue Pairs
 - RoCE v2 over Ethernet
- Host memory staging (GPU→Host→RDMA→Host→GPU)
+- Zero-copy on unified memory systems

 ## 📦 Installation

@ -88,7 +98,7 @@ ibv_devices
 ### Build

 ```bash
-git clone https://github.com/yourusername/nccl-mesh-plugin.git
+git clone https://github.com/autoscriptlabs/nccl-mesh-plugin.git
 cd nccl-mesh-plugin
 make
 ```
@ -211,19 +221,29 @@ for (int i = 0; i < handle->num_addrs; i++) {
 | `NCCL_MESH_GID_INDEX` | `3` | RoCE GID index to use |
 | `NCCL_MESH_DEBUG` | `0` | Enable plugin debug output |

-## 🚧 Limitations
+## 🚧 Current Limitations

- **Host memory staging**: GPU memory goes through host (no GPUDirect RDMA yet)
- **Single QP per connection**: No multi-rail aggregation
- **No relay routing**: Non-adjacent nodes can't communicate (fine for fully-connected mesh)
- **RoCE v2 only**: No InfiniBand support (Ethernet only)
+- **Single QP per connection** - No multi-rail aggregation yet
+- **No relay routing** - Non-adjacent nodes can't communicate (fine for fully-connected mesh)
+- **RoCE v2 only** - Ethernet-based RDMA, no native InfiniBand support

 ## 🗺️ Roadmap

- [ ] GPUDirect RDMA support (bypass host memory)
 - [ ] Multi-QP per connection for higher bandwidth
- [ ] Adaptive routing for partial meshes
- [ ] Performance tuning (inline data, signaling)
+- [ ] Adaptive routing for partial mesh topologies
+- [ ] Performance tuning (inline data, selective signaling)
+- [ ] Support for non-unified-memory systems with explicit GPUDirect RDMA
+
+## 🛠️ Hardware Tested
+
+| Component | Specification |
+|-----------|--------------|
+| Nodes | 3x NVIDIA DGX Spark |
+| CPU | NVIDIA Grace (ARM64) |
+| GPU | NVIDIA Blackwell |
+| Memory | Unified (NVLink-C2C) |
+| NICs | ConnectX-7 (100GbE) |
+| Cables | Direct-attach QSFP56 |

 ## 📚 References

@ -237,8 +257,8 @@ MIT License - see [LICENSE](LICENSE) file.

 ## 🙏 Acknowledgments

-Built to connect three DGX Spark workstations that NVIDIA never intended to be clustered. Sometimes the best solutions come from ignoring "supported configurations."
+Built to connect three DGX Spark workstations that NVIDIA never intended to cluster. Sometimes the best solutions come from ignoring "supported configurations."

 ---

-*"The future of distributed AI computing is here."* - Mistral-7B, running on this very plugin
+*"The future of distributed AI computing is here."* — Mistral-7B, running distributed inference on this very plugin
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@ -278,28 +278,28 @@ ncclResult_t mesh_regMr(void *comm, void *data, size_t size,
 }
 ```

-**Note**: Current implementation uses host memory staging. GPU memory is copied to host, sent via RDMA, then copied back to GPU on the receiver. GPUDirect RDMA would eliminate these copies.
+**Unified Memory Note**: On Grace Hopper / DGX Spark systems, GPU and CPU share the same physical memory via NVLink-C2C. This means RDMA registration works directly on GPU-accessible memory without any staging copies - we get GPUDirect-like semantics automatically.

 ## Performance Considerations

-### Current Bottlenecks
-
-1. **Host Memory Staging**: GPU↔Host copies add latency
-2. **Single QP**: One Queue Pair per connection limits parallelism
-3. **Completion Signaling**: Every operation signals completion
-
 ### Achieved Performance

 - **8+ GB/s** effective bandwidth
 - **~64%** of 100 Gbps line rate
+- Zero-copy on unified memory (Grace Hopper)
 - Sufficient for distributed ML workloads

+### Current Bottlenecks
+
+1. **Single QP**: One Queue Pair per connection limits parallelism
+2. **Completion Signaling**: Every operation signals completion
+3. **Protocol Overhead**: RC transport has per-message overhead
+
 ### Future Optimizations

-1. **GPUDirect RDMA**: Register GPU memory directly
-2. **Multi-QP**: Multiple QPs per connection
-3. **Selective Signaling**: Signal every N operations
-4. **Inline Data**: Small messages in WQE
+1. **Multi-QP**: Multiple QPs per connection for parallelism
+2. **Selective Signaling**: Signal every N operations
+3. **Inline Data**: Small messages embedded in WQE

 ## File Structure