diff --git a/README.md b/README.md
index a769a6d..dc7a65f 100644
--- a/README.md
+++ b/README.md
@@ -33,7 +33,7 @@ Neither works for direct-cabled RDMA meshes. This plugin does.
                       (100Gbps)
 ```
 
-**Three DGX Spark workstations** connected in a triangle mesh with direct 100Gbps RDMA cables. Each link is on a **different subnet** - a configuration NVIDIA never intended to support.
+**Three NVIDIA DGX Spark workstations** connected in a triangle mesh with direct 100Gbps RDMA cables. Each link is on a **different subnet** - a configuration NVIDIA never intended to support.
 
 ## 🚀 Results
 
@@ -46,6 +46,16 @@ Neither works for direct-cabled RDMA meshes. This plugin does.
 
 Successfully ran **distributed LLM inference** (Mistral-7B) across all 3 nodes using NCCL over this custom topology.
 
+## ⚡ Unified Memory Advantage
+
+On **Grace Hopper / DGX Spark** systems, the GPU and CPU share the same physical memory via NVLink-C2C. This unified memory architecture means:
+
+- **No staging copies** - RDMA operates directly on GPU-accessible memory
+- **GPUDirect-like performance** - Without additional kernel modules or configuration
+- **Simplified memory management** - Register once, use everywhere
+
+The 8+ GB/s bandwidth is the real deal, not bottlenecked by GPU↔Host transfers.
+
 ## 🏗️ Architecture
 
 ### Key Innovations
@@ -71,7 +81,7 @@ Successfully ran **distributed LLM inference** (Mistral-7B) across all 3 nodes u
 - Raw InfiniBand Verbs API (libibverbs)
 - Reliable Connected (RC) Queue Pairs
 - RoCE v2 over Ethernet
-- Host memory staging (GPU→Host→RDMA→Host→GPU)
+- Zero-copy on unified memory systems
 
 ## 📦 Installation
 
@@ -88,7 +98,7 @@ ibv_devices
 ### Build
 
 ```bash
-git clone https://github.com/yourusername/nccl-mesh-plugin.git
+git clone https://github.com/autoscriptlabs/nccl-mesh-plugin.git
 cd nccl-mesh-plugin
 make
 ```
@@ -211,19 +221,29 @@ for (int i = 0; i < handle->num_addrs; i++) {
 | `NCCL_MESH_GID_INDEX` | `3` | RoCE GID index to use |
 | `NCCL_MESH_DEBUG` | `0` | Enable plugin debug output |
 
-## 🚧 Limitations
+## 🚧 Current Limitations
 
-- **Host memory staging**: GPU memory goes through host (no GPUDirect RDMA yet)
-- **Single QP per connection**: No multi-rail aggregation
-- **No relay routing**: Non-adjacent nodes can't communicate (fine for fully-connected mesh)
-- **RoCE v2 only**: No InfiniBand support (Ethernet only)
+- **Single QP per connection** - No multi-rail aggregation yet
+- **No relay routing** - Non-adjacent nodes can't communicate (fine for fully-connected mesh)
+- **RoCE v2 only** - Ethernet-based RDMA, no native InfiniBand support
 
 ## 🗺️ Roadmap
 
-- [ ] GPUDirect RDMA support (bypass host memory)
 - [ ] Multi-QP per connection for higher bandwidth
-- [ ] Adaptive routing for partial meshes
-- [ ] Performance tuning (inline data, signaling)
+- [ ] Adaptive routing for partial mesh topologies
+- [ ] Performance tuning (inline data, selective signaling)
+- [ ] Support for non-unified-memory systems with explicit GPUDirect RDMA
+
+## 🛠️ Hardware Tested
+
+| Component | Specification |
+|-----------|--------------|
+| Nodes | 3x NVIDIA DGX Spark |
+| CPU | NVIDIA Grace (ARM64) |
+| GPU | NVIDIA Blackwell |
+| Memory | Unified (NVLink-C2C) |
+| NICs | ConnectX-7 (100GbE) |
+| Cables | Direct-attach QSFP56 |
 
 ## 📚 References
 
@@ -237,8 +257,8 @@ MIT License - see [LICENSE](LICENSE) file.
 
 ## 🙏 Acknowledgments
 
-Built to connect three DGX Spark workstations that NVIDIA never intended to be clustered. Sometimes the best solutions come from ignoring "supported configurations."
+Built to connect three DGX Spark workstations that NVIDIA never intended to cluster. Sometimes the best solutions come from ignoring "supported configurations."
 
 ---
 
-*"The future of distributed AI computing is here."* - Mistral-7B, running on this very plugin
+*"The future of distributed AI computing is here."* — Mistral-7B, running distributed inference on this very plugin
diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
index 9400c51..2534d36 100644
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@@ -278,28 +278,28 @@ ncclResult_t mesh_regMr(void *comm, void *data, size_t size,
 }
 ```
 
-**Note**: Current implementation uses host memory staging. GPU memory is copied to host, sent via RDMA, then copied back to GPU on the receiver. GPUDirect RDMA would eliminate these copies.
+**Unified Memory Note**: On Grace Hopper / DGX Spark systems, GPU and CPU share the same physical memory via NVLink-C2C. This means RDMA registration works directly on GPU-accessible memory without any staging copies - we get GPUDirect-like semantics automatically.
 
 ## Performance Considerations
 
-### Current Bottlenecks
-
-1. **Host Memory Staging**: GPU↔Host copies add latency
-2. **Single QP**: One Queue Pair per connection limits parallelism
-3. **Completion Signaling**: Every operation signals completion
-
 ### Achieved Performance
 
 - **8+ GB/s** effective bandwidth
 - **~64%** of 100 Gbps line rate
+- Zero-copy on unified memory (Grace Hopper)
 - Sufficient for distributed ML workloads
 
+### Current Bottlenecks
+
+1. **Single QP**: One Queue Pair per connection limits parallelism
+2. **Completion Signaling**: Every operation signals completion
+3. **Protocol Overhead**: RC transport has per-message overhead
+
 ### Future Optimizations
 
-1. **GPUDirect RDMA**: Register GPU memory directly
-2. **Multi-QP**: Multiple QPs per connection
-3. **Selective Signaling**: Signal every N operations
-4. **Inline Data**: Small messages in WQE
+1. **Multi-QP**: Multiple QPs per connection for parallelism
+2. **Selective Signaling**: Signal every N operations
+3. **Inline Data**: Small messages embedded in WQE
 
 ## File Structure