docs: clarify unified memory advantage on Grace Hopper

- Remove incorrect 'host memory staging' limitation - Add section explaining NVLink-C2C unified memory benefits - Zero-copy RDMA works automatically on DGX Spark
2026-01-11 11:34:06 +00:00 · 2026-01-09 14:16:48 -05:00 · 2026-01-09 14:16:48 -05:00 · 201ec9e321
commit 201ec9e321
parent 031bc48953
2 changed files with 44 additions and 24 deletions
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@ -278,28 +278,28 @@ ncclResult_t mesh_regMr(void *comm, void *data, size_t size,
 }
 ```

-**Note**: Current implementation uses host memory staging. GPU memory is copied to host, sent via RDMA, then copied back to GPU on the receiver. GPUDirect RDMA would eliminate these copies.
+**Unified Memory Note**: On Grace Hopper / DGX Spark systems, GPU and CPU share the same physical memory via NVLink-C2C. This means RDMA registration works directly on GPU-accessible memory without any staging copies - we get GPUDirect-like semantics automatically.

 ## Performance Considerations

-### Current Bottlenecks
-
-1. **Host Memory Staging**: GPU↔Host copies add latency
-2. **Single QP**: One Queue Pair per connection limits parallelism
-3. **Completion Signaling**: Every operation signals completion
-
 ### Achieved Performance

 - **8+ GB/s** effective bandwidth
 - **~64%** of 100 Gbps line rate
+- Zero-copy on unified memory (Grace Hopper)
 - Sufficient for distributed ML workloads

+### Current Bottlenecks
+
+1. **Single QP**: One Queue Pair per connection limits parallelism
+2. **Completion Signaling**: Every operation signals completion
+3. **Protocol Overhead**: RC transport has per-message overhead
+
 ### Future Optimizations

-1. **GPUDirect RDMA**: Register GPU memory directly
-2. **Multi-QP**: Multiple QPs per connection
-3. **Selective Signaling**: Signal every N operations
-4. **Inline Data**: Small messages in WQE
+1. **Multi-QP**: Multiple QPs per connection for parallelism
+2. **Selective Signaling**: Signal every N operations
+3. **Inline Data**: Small messages embedded in WQE

 ## File Structure