Why does NVIDIA build CPUs? Because the path between CPU and GPU is everything for huge models. Walk through the Grace ARM CPU, NVLink-C2C coherent memory, the GH200 Grace-Hopper and GB200 Grace-Blackwell superchips, and DGX Spark — the workstation form factor of the same idea.
Grace is NVIDIA's bet that the future of AI compute is not "a CPU somewhere with GPUs hanging off PCIe", but a tightly-integrated compute element where CPU and GPU share one coherent memory fabric. This deck walks the architecture and its product line.
By 2021, NVIDIA's frontier-model customers were hitting a wall that no faster GPU could fix: the path between the host CPU and the GPU. The GPU did the matmul, but the CPU still owned the page tables, the file system, the network stack, and the data-prep pipeline — and every byte that crossed between them went through PCIe.
PCIe Gen5 x16 maxes out around 64 GB/s. An H100 with HBM3 reads its weights at ~3.4 TB/s. The host↔device path is 50× narrower than the on-package memory path. Anything that touches the CPU — checkpoint load, tokenizer prep, DataLoader, optimizer state spill — pays this gap.
Even a "GPU workload" needs the host CPU for: data loading and shuffling, page-table management for unified memory, NIC and storage drivers, scheduling, host-side reductions, and Python orchestration. You can't just delete the CPU.
Build a custom ARM CPU tightly coupled to the GPU via a coherent chip-to-chip link. CPU memory becomes a slow tier of GPU memory. Page faults migrate on demand. The CPU↔GPU boundary stops being a copy boundary; it becomes a cache miss.
Apple's unified-memory architecture proved that CPU + GPU sharing one pool of LPDDR is a power and bandwidth win at consumer scale. Grace is the same idea at datacenter scale: one LPDDR5x pool, one coherent address space, one socket.
If the CPU↔GPU link is 50× slower than HBM, then anything that doesn't fit in HBM is paying that 50× penalty. NVLink-C2C makes the link only ~5× slower than HBM, which means CPU memory becomes a usable second tier of VRAM — not a cliff.
Grace is a 72-core ARM v9.2-a server CPU built on TSMC 4N (the same node as Hopper). Cores are ARM Neoverse V2 — the same Cortex-X-derived design used in AWS Graviton 3/4 and several other ARM datacenter parts.
| Product | LPDDR5x capacity | LPDDR5x bandwidth | Typical use |
|---|---|---|---|
| Grace (single die) | 480 GB on-package | ~500 GB/s | Paired with H100/H200 (GH200), each Grace in GB200 |
| Grace CPU Superchip | 2× 480 = 960 GB | 2× ~500 GB/s | Two Grace dies on one module via NVLink-C2C, no GPU; HPC nodes (uncommon) |
NVIDIA chose ARM not for ISA reasons but for customisation. ARM lets them keep their proprietary mesh, drop in a custom system-MMU, and graft on the C2C port without licensing renegotiation. The Neoverse V2 core itself is unmodified; everything around it is custom NVIDIA silicon.
NVLink-C2C ("chip-to-chip") is the link that makes Grace useful as a GPU companion. It connects the Grace die and the paired GPU die over a short on-package trace at 900 GB/s bidirectional, with full cache coherency.
| Fabric | Where | Bandwidth | Coherent? | Notes |
|---|---|---|---|---|
| NVLink-C2C | Grace ↔ GPU on one module | 900 GB/s bidir | Yes | Cross-vendor design (NVIDIA ARM CPU + NVIDIA GPU) |
| Intel UPI | Xeon ↔ Xeon socket | ~40 GB/s per link | Yes | CPU↔CPU only; no GPU integration |
| AMD Infinity Fabric | EPYC ↔ EPYC, MI↔MI | ~100 GB/s per link | Yes | Single-vendor; MI300 has on-package CPU+GPU variants |
| CXL 3.0 | Generic accelerator over PCIe | ~64 GB/s per x16 | Yes (CXL.cache) | Open standard, but slower than C2C |
| UCIe | Open chiplet standard | varies | Optional | Specifies the PHY; coherency is up to the protocol layer |
The crucial mental shift: C2C is fabric, not network. There's no driver, no DMA descriptor ring, no doorbell. From software's view, Grace memory and GPU memory are one heap. The only thing the runtime decides is placement hints — "prefer GPU" or "prefer CPU" — and the hardware migrates pages accordingly.
GH200 was the first product to ship with Grace + a paired GPU on one module. The pairing is one Grace CPU + one Hopper GPU (an H100 or H200-class die), connected by NVLink-C2C.
The crucial point: the spill from HBM to LPDDR5x happens without software changes. CUDA's unified memory paging treats the LPDDR pool as a slow tier of VRAM. A model that previously needed two H100s (one for weights, one for KV cache + optimizer state) fits one GH200 because the spill tier is now coherent and fast.
If your model fits in HBM with comfortable margin, GH200 is a side-grade vs an SXM H100 server. If your model almost fits but spills 30–100 GB into checkpoint or KV, GH200 turns "I need a second node and tensor parallel" into "one box". That's the operational win NVIDIA is selling.
GB200 is the Blackwell-era replacement for GH200, but with a key topology change: one Grace + two B200 GPUs on one module. Each B200 has its own dedicated 900 GB/s C2C link to Grace; the two B200s talk to each other over NVLink 5 at 1.8 TB/s.
| Resource | Per GB200 superchip |
|---|---|
| Grace dies | 1 |
| B200 GPU dies | 2 (each is a 2-die package = 4 reticle-limit dies total) |
| HBM3e (GPU side) | 2× 192 GB = 384 GB |
| LPDDR5x (CPU side) | 480 GB |
| Total unified memory | 864 GB |
| FP4 dense compute | ~18 PFLOPS |
| FP8 dense compute | ~9 PFLOPS |
| Grace↔GPU bandwidth | 2× 900 = 1.8 TB/s C2C |
| GPU↔GPU bandwidth (in-package) | 1.8 TB/s NVLink 5 |
| Module weight | ~18 kg |
| Module TDP | ~2700 W |
| Cooling | Liquid only |
GB200 is not really a server CPU; it's the unit cell of the rack. A GB200 NVL72 rack contains 36 of these (2 per compute tray × 18 trays), glued together by NVLink Switch trays into a single 72-GPU coherent NVLink domain. Everything else — cooling design, power delivery, copper backplane — flows from this choice.
EGM (Extended GPU Memory) is the formal name for the mode where the GPU treats Grace's LPDDR5x as additional VRAM at C2C bandwidth. It's part hardware (the C2C ATS) and part driver: the CUDA runtime exposes Grace memory as one giant pool of GPU-accessible memory, paged on demand.
From a CUDA kernel's viewpoint, EGM memory is indistinguishable from regular VRAM — the same pointers, the same load/store instructions. The difference is purely in where the page resides:
// All allocations go through the unified-memory allocator.
// On Grace-Hopper / Grace-Blackwell, the backing store can be HBM or LPDDR5x.
size_t bytes = 200ULL * 1024 * 1024 * 1024; // 200 GB — bigger than HBM
float* x;
cudaMallocManaged(&x, bytes);
// Tell the driver our preferred placement — "prefer GPU" or "prefer CPU".
// Without a hint, the runtime decides based on first-touch and access counts.
cudaMemAdvise(x, bytes, cudaMemAdviseSetPreferredLocation, 0); // device 0 = GPU
// Migration is automatic on access:
// pages touched by the GPU live in HBM
// pages touched by Grace cores live in LPDDR5x
// pages cold for > N accesses get evicted to the other tier
my_kernel<<<blocks, threads>>>(x, bytes); // just works — no copies, no streams
// You can also pre-fetch a hot range to HBM before launch:
cudaMemPrefetchAsync(x, hot_bytes, 0, stream);
Loading a 200 GB FP8 checkpoint into a single GH200 happens at C2C speed (~900 GB/s) once it is in Grace memory — vs ~32 GB/s over PCIe Gen5 to a discrete H100. Checkpoint cold-starts drop from minutes to seconds.
For long-context (128k+) inference with many concurrent users, the KV cache often dwarfs the weights. Spilling cold KV pages to LPDDR5x at 480 GB/s is "good enough" for prefix-cached requests; HBM holds only the active KV.
FP32 master weights, AdamW moments, and gradient buffers can be 6–8× the size of FP8 weights. Pinning these to Grace LPDDR5x while the forward/backward pass uses HBM doubles the trainable model size per node.
RAG indexes (FAISS, NeMo Retriever) are millions of float32 vectors that are read once per query. They fit comfortably in 480 GB of LPDDR5x and stream across C2C as needed — no need for an Optane-style tier.
If your access pattern thrashes pages back and forth (e.g. interleaved GPU↔CPU updates on the same buffer), every miss costs a round trip over C2C. Always set cudaMemAdvise hints for your hot data structures, and use nvprof --print-gpu-trace to confirm pages aren't migrating each launch.
Deck 09 (Blackwell) and deck 11 (Datacenter Platforms) covered NVL72 in detail; this is the recap from Grace's perspective.
NVL72 is the first NVIDIA product designed as a rack-scale single computer. The unit of purchase, the unit of scheduling in CUDA / NCCL, and the unit of failure domain are all the rack. A 1T-parameter MoE model is meant to be trained or served on one or two NVL72s — not on a sea of HGX trays linked by IB.
Grace runs full ARM64 Linux. By 2026 this is unremarkable for distros and most major workloads — but there are a few corners where you still need to check.
| Layer | ARM64 status on Grace |
|---|---|
| OS | Ubuntu 22.04 / 24.04, RHEL 9, SLES 15 SP5+, Rocky 9 — all official |
| CUDA toolkit | ARM64 builds since CUDA 11.4; current toolchain is fully ARM64 native |
| cuDNN / NCCL / TensorRT | ARM64 builds shipped alongside x86_64 since 2022 |
| PyTorch | aarch64+cuda wheels standard since PyTorch 2.0; nightly builds first-class |
| JAX | ARM64 + CUDA support official since 2023 |
| Triton | ARM64 builds, autotune cache works |
| NGC catalog | All major NVIDIA images ship linux/arm64 manifests |
| vLLM / TensorRT-LLM | ARM64 wheels and containers since 2024 |
The single most common ARM64-on-Grace problem is container image architecture. A lot of community ML images are built for linux/amd64 only, and pulling them onto Grace either fails or silently runs under emulation (which is catastrophically slow).
# Inspect a manifest list for both architectures:
docker buildx imagetools inspect vllm/vllm-openai:latest
# Look for entries like:
# Manifests:
# Name: ...@sha256:... Platform: linux/amd64
# Name: ...@sha256:... Platform: linux/arm64
# Force the arch on pull (catches "wrong arch" early):
docker pull --platform=linux/arm64 vllm/vllm-openai:latest
# Build multi-arch images yourself:
docker buildx build \
--platform linux/amd64,linux/arm64 \
--push -t myorg/myapp:1.0 .
ldd.nsys and nsight-compute have ARM64 builds. perf works the same as on x86. PMU events differ (Neoverse V2-specific) — perf list first.docker buildx with QEMU is fine for arm64 from x86 CI runners.If your stack is "NGC image → PyTorch → HuggingFace model", you will not notice you are on ARM64. If your stack involves compiled-from-source third-party CUDA extensions, you may spend an afternoon producing ARM64 wheels. None of this is hard; it is just one more matrix dimension in CI.
DGX Spark (formerly Project DIGITS) is the consumer-form-factor cousin of GB200. It packages GB10 — one Grace + one small Blackwell GPU on a single SoC — into a desk-sized box, with the same software stack as the rack.
The architectural payoff: code that runs on Spark runs on a GB200 NVL72 with no rewrites. CUDA features detect by capability, not by SKU; cudaMallocManaged behaves identically (just with bigger HBM tiers); ARM64 wheels are the same artifact. Spark exists so the development cycle does not require the rack.
A 5090 has more bandwidth (~1.8 TB/s GDDR7) and more raw compute, so for fits-in-32-GB workloads it wins on tok/s. Spark wins when you need capacity: 128 GB unified beats 32 GB VRAM + slow PCIe to system RAM. Pick by VRAM headroom, not by FLOPS.
Grace + C2C is a real architectural shift, but it is not a free win on every workload. Here is where the win is real and where it is not.
If your weights + KV cache + optimizer state spill 30–200 GB out of HBM, GH200 / GB200 keep that spill at C2C bandwidth (~900 GB/s) instead of PCIe (~64 GB/s). 10× speed-up on that tier alone.
72 high-IPC Neoverse V2 cores with SVE2 chew through tokenization, jpeg decode, augmentations — without burning host↔device PCIe bandwidth, because the result lands in shared memory the GPU can already read.
A 200 GB checkpoint streams from NVMe via Grace into LPDDR5x at PCIe Gen5 speed, then the GPU pages from LPDDR5x at C2C speed. Cold load time often drops from minutes to under a minute.
KV cache for 128k+ contexts at high concurrency dwarfs weights. Spilling cold KV pages to LPDDR5x lets you serve 4–8× more concurrent long-context users per HBM-byte than a discrete GPU could.
Hundreds of GB of FAISS / NeMo Retriever index sit happily in LPDDR5x. The GPU pulls hot shards over C2C; queries don't touch PCIe.
Grace cores running NumPy / Pandas / DALI compete favourably with x86 once the SVE2 paths are linked. Useful when the host stages are not the bottleneck on x86 either, but become a win because they avoid the PCIe copy.
Anecdotal numbers from NVIDIA's 2024 disclosures and customer benchmarks: a GH200 delivers roughly 2× perf/W versus an x86 + PCIe + H100 PCIe equivalent on transformer-training workloads. The win is not from FLOPS — the GPU dies are the same — but from skipping PCIe round-trips and from LPDDR5x being substantially more efficient than DDR5.
Grace is best understood as a memory-system upgrade for the GPU, not as a CPU upgrade. The ARM cores are good but not magical. The win comes from one coherent address space and 900 GB/s between CPU and GPU. If your workload doesn't cross that boundary much, Grace is a side-grade. If it does, it is a step change.
Pick a system and a workload. The picker tells you whether the system fits, where the bottleneck is, and gives a one-line verdict.