NVIDIA GPU Architectures Series — Presentation 15

Grace — NVIDIA's ARM Datacenter CPU and the GH200 / GB200 Superchips

Why does NVIDIA build CPUs? Because the path between CPU and GPU is everything for huge models. Walk through the Grace ARM CPU, NVLink-C2C coherent memory, the GH200 Grace-Hopper and GB200 Grace-Blackwell superchips, and DGX Spark — the workstation form factor of the same idea.

GraceARM v9Neoverse V2 NVLink-C2CGH200GB200 NVL72EGM DGX Spark
ARM cores C2C GH200 GB200 NVL72 Spark
00

Topics We'll Cover

Grace is NVIDIA's bet that the future of AI compute is not "a CPU somewhere with GPUs hanging off PCIe", but a tightly-integrated compute element where CPU and GPU share one coherent memory fabric. This deck walks the architecture and its product line.

01

Why NVIDIA Built a CPU

By 2021, NVIDIA's frontier-model customers were hitting a wall that no faster GPU could fix: the path between the host CPU and the GPU. The GPU did the matmul, but the CPU still owned the page tables, the file system, the network stack, and the data-prep pipeline — and every byte that crossed between them went through PCIe.

The x86 + PCIe bottleneck

PCIe Gen5 x16 maxes out around 64 GB/s. An H100 with HBM3 reads its weights at ~3.4 TB/s. The host↔device path is 50× narrower than the on-package memory path. Anything that touches the CPU — checkpoint load, tokenizer prep, DataLoader, optimizer state spill — pays this gap.

The CPU still owns the work

Even a "GPU workload" needs the host CPU for: data loading and shuffling, page-table management for unified memory, NIC and storage drivers, scheduling, host-side reductions, and Python orchestration. You can't just delete the CPU.

NVIDIA's bet (2021)

Build a custom ARM CPU tightly coupled to the GPU via a coherent chip-to-chip link. CPU memory becomes a slow tier of GPU memory. Page faults migrate on demand. The CPU↔GPU boundary stops being a copy boundary; it becomes a cache miss.

Inspired by Apple M-series

Apple's unified-memory architecture proved that CPU + GPU sharing one pool of LPDDR is a power and bandwidth win at consumer scale. Grace is the same idea at datacenter scale: one LPDDR5x pool, one coherent address space, one socket.

CPU↔GPU path bandwidth — GB/s, log-ish scale PCIe Gen4 x16 ~32 GB/s PCIe Gen5 x16 ~64 GB/s NVLink-C2C (Grace-GPU) 900 GB/s Grace LPDDR5x ~480 GB/s H100 HBM3 ~3350 GB/s B200 HBM3e ~8000 GB/s PCIe is two orders of magnitude slower than HBM — C2C closes most of the gap.
The one-line argument

If the CPU↔GPU link is 50× slower than HBM, then anything that doesn't fit in HBM is paying that 50× penalty. NVLink-C2C makes the link only ~5× slower than HBM, which means CPU memory becomes a usable second tier of VRAM — not a cliff.

02

Grace — The CPU

Grace is a 72-core ARM v9.2-a server CPU built on TSMC 4N (the same node as Hopper). Cores are ARM Neoverse V2 — the same Cortex-X-derived design used in AWS Graviton 3/4 and several other ARM datacenter parts.

Cores & ISA

  • 72× Neoverse V2 (ARMv9.2-a)
  • 4-wide out-of-order, deep ROB
  • Per-core: 64 KB L1-I, 64 KB L1-D, 1 MB L2
  • SVE2 with 2× 128-bit vector pipes
  • NEON / AdvSIMD compatibility
  • BF16 / FP16 vector dot-product

Caches & mesh

  • ~117 MB shared L3 on a scalable coherent mesh
  • Mesh-based interconnect — not a ring
  • Hardware coherency to the paired GPU
  • System-MMU for two-level translation

Memory

  • On-package LPDDR5x (no DIMMs)
  • Per-Grace capacity: 480 GB
  • Per-Grace bandwidth: ~500 GB/s
  • ECC throughout
  • Lower power per GB than DDR5

I/O & power

  • PCIe Gen5 for NICs / NVMe / accelerators
  • NVLink-C2C as primary GPU path (900 GB/s)
  • TDP ~250–500 W for the full Grace die
  • System Management on a dedicated controller

Grace products

ProductLPDDR5x capacityLPDDR5x bandwidthTypical use
Grace (single die)480 GB on-package~500 GB/sPaired with H100/H200 (GH200), each Grace in GB200
Grace CPU Superchip2× 480 = 960 GB2× ~500 GB/sTwo Grace dies on one module via NVLink-C2C, no GPU; HPC nodes (uncommon)
Grace die — schematic floor-plan 72× Neoverse V2 (ARMv9.2-a, SVE2) ~117 MB L3 on scalable coherent mesh LPDDR5x ~240 GB ~240 GB/s LPDDR5x ~240 GB ~240 GB/s PCIe Gen5 + C2C
Why Neoverse V2 specifically

NVIDIA chose ARM not for ISA reasons but for customisation. ARM lets them keep their proprietary mesh, drop in a custom system-MMU, and graft on the C2C port without licensing renegotiation. The Neoverse V2 core itself is unmodified; everything around it is custom NVIDIA silicon.

03

NVLink-C2C — The Bridge

NVLink-C2C ("chip-to-chip") is the link that makes Grace useful as a GPU companion. It connects the Grace die and the paired GPU die over a short on-package trace at 900 GB/s bidirectional, with full cache coherency.

Grace + GPU on one module — the C2C link GRACE CPU 72× Neoverse V2 117 MB L3 SVE2 vector LPDDR5x ~480 GB / ~480 GB/s NVLink-C2C 900 GB/s coherent HOPPER / BLACKWELL SMs + Tensor Cores L2 + TMA HBM3 / HBM3e (~3.4–8 TB/s)

What "coherent" actually means here

How C2C compares to other CPU↔CPU / die-to-die fabrics

FabricWhereBandwidthCoherent?Notes
NVLink-C2CGrace ↔ GPU on one module900 GB/s bidirYesCross-vendor design (NVIDIA ARM CPU + NVIDIA GPU)
Intel UPIXeon ↔ Xeon socket~40 GB/s per linkYesCPU↔CPU only; no GPU integration
AMD Infinity FabricEPYC ↔ EPYC, MI↔MI~100 GB/s per linkYesSingle-vendor; MI300 has on-package CPU+GPU variants
CXL 3.0Generic accelerator over PCIe~64 GB/s per x16Yes (CXL.cache)Open standard, but slower than C2C
UCIeOpen chiplet standardvariesOptionalSpecifies the PHY; coherency is up to the protocol layer
It's not a NIC

The crucial mental shift: C2C is fabric, not network. There's no driver, no DMA descriptor ring, no doorbell. From software's view, Grace memory and GPU memory are one heap. The only thing the runtime decides is placement hints — "prefer GPU" or "prefer CPU" — and the hardware migrates pages accordingly.

04

GH200 — The Grace-Hopper Superchip

GH200 was the first product to ship with Grace + a paired GPU on one module. The pairing is one Grace CPU + one Hopper GPU (an H100 or H200-class die), connected by NVLink-C2C.

GH200 480GB

  • Grace 480 GB LPDDR5x
  • 96 GB HBM3 (Hopper H100 die)
  • Total unified: 576 GB
  • Aggregate memory bandwidth: HBM3 ~3.4 TB/s + LPDDR ~480 GB/s
  • Released 2023

GH200 with H200

  • Grace 480 GB LPDDR5x
  • 141 GB HBM3e (this is the "H200" inside)
  • Total unified: 621 GB
  • HBM3e bandwidth ~4.8 TB/s
  • Refreshed 2024 — the volume Helios / Alps part

Form factors

Why a 100B model fits a single GH200 cleanly

100B-param model, FP8
Hot weights + activations → 96–144 GB HBM3/3e (active matmul)
↓ via NVLink-C2C 900 GB/s
KV cache, optimizer state, checkpoint pages → Grace 480 GB LPDDR5x
Cold checkpoint shards → NVMe over PCIe Gen5

The crucial point: the spill from HBM to LPDDR5x happens without software changes. CUDA's unified memory paging treats the LPDDR pool as a slow tier of VRAM. A model that previously needed two H100s (one for weights, one for KV cache + optimizer state) fits one GH200 because the spill tier is now coherent and fast.

The win is for "almost-fits" models

If your model fits in HBM with comfortable margin, GH200 is a side-grade vs an SXM H100 server. If your model almost fits but spills 30–100 GB into checkpoint or KV, GH200 turns "I need a second node and tensor parallel" into "one box". That's the operational win NVIDIA is selling.

05

GB200 — The Grace-Blackwell Superchip

GB200 is the Blackwell-era replacement for GH200, but with a key topology change: one Grace + two B200 GPUs on one module. Each B200 has its own dedicated 900 GB/s C2C link to Grace; the two B200s talk to each other over NVLink 5 at 1.8 TB/s.

GB200 superchip — 1 Grace + 2 B200 GRACE 480 GB LPDDR5x · 480 GB/s B200 #1 192 GB HBM3e ~8 TB/s FP4 / FP8 tensor cores B200 #2 192 GB HBM3e ~8 TB/s FP4 / FP8 tensor cores C2C 900 GB/s C2C 900 GB/s NVLink 5 — 1.8 TB/s

Numbers per superchip

ResourcePer GB200 superchip
Grace dies1
B200 GPU dies2 (each is a 2-die package = 4 reticle-limit dies total)
HBM3e (GPU side)2× 192 GB = 384 GB
LPDDR5x (CPU side)480 GB
Total unified memory864 GB
FP4 dense compute~18 PFLOPS
FP8 dense compute~9 PFLOPS
Grace↔GPU bandwidth2× 900 = 1.8 TB/s C2C
GPU↔GPU bandwidth (in-package)1.8 TB/s NVLink 5
Module weight~18 kg
Module TDP~2700 W
CoolingLiquid only
The building block of NVL72

GB200 is not really a server CPU; it's the unit cell of the rack. A GB200 NVL72 rack contains 36 of these (2 per compute tray × 18 trays), glued together by NVLink Switch trays into a single 72-GPU coherent NVLink domain. Everything else — cooling design, power delivery, copper backplane — flows from this choice.

06

EGM — Extended GPU Memory

EGM (Extended GPU Memory) is the formal name for the mode where the GPU treats Grace's LPDDR5x as additional VRAM at C2C bandwidth. It's part hardware (the C2C ATS) and part driver: the CUDA runtime exposes Grace memory as one giant pool of GPU-accessible memory, paged on demand.

What the kernel actually sees

From a CUDA kernel's viewpoint, EGM memory is indistinguishable from regular VRAM — the same pointers, the same load/store instructions. The difference is purely in where the page resides:

HBM3e (B200)
~5–8 TB/s  ·  ns latency
~192 GB
C2C path
900 GB/s
link only
LPDDR5x (Grace)
~500 GB/s · ~10× HBM latency
~480 GB
PCIe NVMe
~14 GB/s
terabytes

How a kernel allocates EGM memory

cudaMallocManaged on GH200 / GB200 — one heap, two physical tiers
// All allocations go through the unified-memory allocator.
// On Grace-Hopper / Grace-Blackwell, the backing store can be HBM or LPDDR5x.
size_t bytes = 200ULL * 1024 * 1024 * 1024;   // 200 GB — bigger than HBM
float* x;
cudaMallocManaged(&x, bytes);

// Tell the driver our preferred placement — "prefer GPU" or "prefer CPU".
// Without a hint, the runtime decides based on first-touch and access counts.
cudaMemAdvise(x, bytes, cudaMemAdviseSetPreferredLocation, 0);  // device 0 = GPU

// Migration is automatic on access:
//   pages touched by the GPU live in HBM
//   pages touched by Grace cores live in LPDDR5x
//   pages cold for > N accesses get evicted to the other tier

my_kernel<<<blocks, threads>>>(x, bytes);   // just works — no copies, no streams

// You can also pre-fetch a hot range to HBM before launch:
cudaMemPrefetchAsync(x, hot_bytes, 0, stream);

Where EGM earns its keep

Model loading

Loading a 200 GB FP8 checkpoint into a single GH200 happens at C2C speed (~900 GB/s) once it is in Grace memory — vs ~32 GB/s over PCIe Gen5 to a discrete H100. Checkpoint cold-starts drop from minutes to seconds.

KV cache offload at scale

For long-context (128k+) inference with many concurrent users, the KV cache often dwarfs the weights. Spilling cold KV pages to LPDDR5x at 480 GB/s is "good enough" for prefix-cached requests; HBM holds only the active KV.

Optimizer state during training

FP32 master weights, AdamW moments, and gradient buffers can be 6–8× the size of FP8 weights. Pinning these to Grace LPDDR5x while the forward/backward pass uses HBM doubles the trainable model size per node.

Vector search indexes

RAG indexes (FAISS, NeMo Retriever) are millions of float32 vectors that are read once per query. They fit comfortably in 480 GB of LPDDR5x and stream across C2C as needed — no need for an Optane-style tier.

When EGM hurts

If your access pattern thrashes pages back and forth (e.g. interleaved GPU↔CPU updates on the same buffer), every miss costs a round trip over C2C. Always set cudaMemAdvise hints for your hot data structures, and use nvprof --print-gpu-trace to confirm pages aren't migrating each launch.

07

NVL72 — 72 GPUs in One NVLink Domain

Deck 09 (Blackwell) and deck 11 (Datacenter Platforms) covered NVL72 in detail; this is the recap from Grace's perspective.

GB200 NVL72 — one rack as one computer PDU / coolant manifold (top) Compute tray #1 · 2× GB200 (2 Grace + 4 B200) Compute tray #2 · 2× GB200 Compute tray #3 · 2× GB200 Compute tray #4 · 2× GB200 Compute tray #5 · 2× GB200 Compute tray #6 · 2× GB200 Compute tray #7 · 2× GB200 Compute tray #8 · 2× GB200 Compute tray #9 · 2× GB200 9× NVLink Switch trays copper backplane — all 72 B200 in one NVLink-5 domain Compute tray #10 · 2× GB200 Compute tray #11 · 2× GB200 Compute tray #12 · 2× GB200 Compute tray #13 · 2× GB200 Compute tray #14 · 2× GB200 Compute tray #15–18 · 2× GB200 each cold inlet ~120 kW hot return 18 compute trays = 36 GB200 = 36 Grace + 72 B200 9 switch trays copper, not optical ~31 TB total unified memory

The numbers

A rack is the new server

NVL72 is the first NVIDIA product designed as a rack-scale single computer. The unit of purchase, the unit of scheduling in CUDA / NCCL, and the unit of failure domain are all the rack. A 1T-parameter MoE model is meant to be trained or served on one or two NVL72s — not on a sea of HGX trays linked by IB.

08

Software on Grace — ARM64 Linux Reality

Grace runs full ARM64 Linux. By 2026 this is unremarkable for distros and most major workloads — but there are a few corners where you still need to check.

What just works

LayerARM64 status on Grace
OSUbuntu 22.04 / 24.04, RHEL 9, SLES 15 SP5+, Rocky 9 — all official
CUDA toolkitARM64 builds since CUDA 11.4; current toolchain is fully ARM64 native
cuDNN / NCCL / TensorRTARM64 builds shipped alongside x86_64 since 2022
PyTorchaarch64+cuda wheels standard since PyTorch 2.0; nightly builds first-class
JAXARM64 + CUDA support official since 2023
TritonARM64 builds, autotune cache works
NGC catalogAll major NVIDIA images ship linux/arm64 manifests
vLLM / TensorRT-LLMARM64 wheels and containers since 2024

The container gotcha

The single most common ARM64-on-Grace problem is container image architecture. A lot of community ML images are built for linux/amd64 only, and pulling them onto Grace either fails or silently runs under emulation (which is catastrophically slow).

Verify your images include arm64 before deploying
# Inspect a manifest list for both architectures:
docker buildx imagetools inspect vllm/vllm-openai:latest

# Look for entries like:
#   Manifests:
#     Name:      ...@sha256:...   Platform: linux/amd64
#     Name:      ...@sha256:...   Platform: linux/arm64

# Force the arch on pull (catches "wrong arch" early):
docker pull --platform=linux/arm64 vllm/vllm-openai:latest

# Build multi-arch images yourself:
docker buildx build \
    --platform linux/amd64,linux/arm64 \
    --push -t myorg/myapp:1.0 .

Other corners worth knowing

A working rule

If your stack is "NGC image → PyTorch → HuggingFace model", you will not notice you are on ARM64. If your stack involves compiled-from-source third-party CUDA extensions, you may spend an afternoon producing ARM64 wheels. None of this is hard; it is just one more matrix dimension in CI.

09

DGX Spark — The Personal GH-Class Workstation

DGX Spark (formerly Project DIGITS) is the consumer-form-factor cousin of GB200. It packages GB10 — one Grace + one small Blackwell GPU on a single SoC — into a desk-sized box, with the same software stack as the rack.

GB10 SoC

  • 20-core ARM CPU: Cortex-X925 + Cortex-A725 hybrid (10 + 10)
  • Single Blackwell GPU tile (smaller than B200)
  • Native FP4 / FP8 tensor cores
  • ~1 PFLOP FP4 dense compute
  • One die, one package

Memory

  • 128 GB unified LPDDR5x — one pool, no separate VRAM
  • ~273 GB/s bandwidth (single-port LPDDR5x)
  • Same coherent address space model as Grace + GPU
  • No HBM — all memory is LPDDR5x

Form factor & price

  • ~Mac-mini-sized desktop unit
  • Air cooled
  • ~$3,000–$5,000 retail
  • 10 GbE on-board, USB-C
  • Runs DGX OS (Ubuntu-derived) by default

Targets

  • Hobbyist / researcher running 70B models locally
  • Prototyping pipelines that later scale to GB200
  • Edge inference for sensitive data
  • Personal copilot / agent dev box

What 128 GB unified gets you

Same software, different scale

The architectural payoff: code that runs on Spark runs on a GB200 NVL72 with no rewrites. CUDA features detect by capability, not by SKU; cudaMallocManaged behaves identically (just with bigger HBM tiers); ARM64 wheels are the same artifact. Spark exists so the development cycle does not require the rack.

Spark vs an RTX 5090

A 5090 has more bandwidth (~1.8 TB/s GDDR7) and more raw compute, so for fits-in-32-GB workloads it wins on tok/s. Spark wins when you need capacity: 128 GB unified beats 32 GB VRAM + slow PCIe to system RAM. Pick by VRAM headroom, not by FLOPS.

10

Performance — When Grace Wins, When It Doesn't

Grace + C2C is a real architectural shift, but it is not a free win on every workload. Here is where the win is real and where it is not.

Where Grace wins

Models that almost fit

If your weights + KV cache + optimizer state spill 30–200 GB out of HBM, GH200 / GB200 keep that spill at C2C bandwidth (~900 GB/s) instead of PCIe (~64 GB/s). 10× speed-up on that tier alone.

Data-prep pipelines

72 high-IPC Neoverse V2 cores with SVE2 chew through tokenization, jpeg decode, augmentations — without burning host↔device PCIe bandwidth, because the result lands in shared memory the GPU can already read.

Checkpoint / model loading

A 200 GB checkpoint streams from NVMe via Grace into LPDDR5x at PCIe Gen5 speed, then the GPU pages from LPDDR5x at C2C speed. Cold load time often drops from minutes to under a minute.

Long-context inference

KV cache for 128k+ contexts at high concurrency dwarfs weights. Spilling cold KV pages to LPDDR5x lets you serve 4–8× more concurrent long-context users per HBM-byte than a discrete GPU could.

Vector search / RAG

Hundreds of GB of FAISS / NeMo Retriever index sit happily in LPDDR5x. The GPU pulls hot shards over C2C; queries don't touch PCIe.

Pythonic preprocessing

Grace cores running NumPy / Pandas / DALI compete favourably with x86 once the SVE2 paths are linked. Useful when the host stages are not the bottleneck on x86 either, but become a win because they avoid the PCIe copy.

Where Grace doesn't help

Power efficiency

Anecdotal numbers from NVIDIA's 2024 disclosures and customer benchmarks: a GH200 delivers roughly 2× perf/W versus an x86 + PCIe + H100 PCIe equivalent on transformer-training workloads. The win is not from FLOPS — the GPU dies are the same — but from skipping PCIe round-trips and from LPDDR5x being substantially more efficient than DDR5.

Honest summary

Grace is best understood as a memory-system upgrade for the GPU, not as a CPU upgrade. The ARM cores are good but not magical. The win comes from one coherent address space and 900 GB/s between CPU and GPU. If your workload doesn't cross that boundary much, Grace is a side-grade. If it does, it is a step change.

11

Interactive: Superchip Picker

Pick a system and a workload. The picker tells you whether the system fits, where the bottleneck is, and gives a one-line verdict.

Total HBM
Total LPDDR5x
FP8 PFLOPS
Host↔device path
CPU cores