NVIDIA GPU 15 — Grace CPU and the GH200 / GB200 Superchips

00

Topics We'll Cover

Grace is NVIDIA's bet that the future of AI compute is not "a CPU somewhere with GPUs hanging off PCIe", but a tightly-integrated compute element where CPU and GPU share one coherent memory fabric. This deck walks the architecture and its product line.

Why NVIDIA Built a CPU
Grace — The CPU
NVLink-C2C — The Bridge
GH200 — The Grace-Hopper Superchip
GB200 — The Grace-Blackwell Superchip
EGM — Extended GPU Memory
NVL72 — 72 GPUs in One NVLink Domain
Software on Grace — ARM64 Linux Reality
DGX Spark — The Personal GH-Class Workstation
Performance — When Grace Wins, When It Doesn't
Interactive: Superchip Picker

01

Why NVIDIA Built a CPU

By 2021, NVIDIA's frontier-model customers were hitting a wall that no faster GPU could fix: the path between the host CPU and the GPU. The GPU did the matmul, but the CPU still owned the page tables, the file system, the network stack, and the data-prep pipeline — and every byte that crossed between them went through PCIe.

The x86 + PCIe bottleneck

PCIe Gen5 x16 maxes out around 64 GB/s. An H100 with HBM3 reads its weights at ~3.4 TB/s. The host↔device path is 50× narrower than the on-package memory path. Anything that touches the CPU — checkpoint load, tokenizer prep, DataLoader, optimizer state spill — pays this gap.

The CPU still owns the work

Even a "GPU workload" needs the host CPU for: data loading and shuffling, page-table management for unified memory, NIC and storage drivers, scheduling, host-side reductions, and Python orchestration. You can't just delete the CPU.

NVIDIA's bet (2021)

Build a custom ARM CPU tightly coupled to the GPU via a coherent chip-to-chip link. CPU memory becomes a slow tier of GPU memory. Page faults migrate on demand. The CPU↔GPU boundary stops being a copy boundary; it becomes a cache miss.

Inspired by Apple M-series

Apple's unified-memory architecture proved that CPU + GPU sharing one pool of LPDDR is a power and bandwidth win at consumer scale. Grace is the same idea at datacenter scale: one LPDDR5x pool, one coherent address space, one socket.

The one-line argument

If the CPU↔GPU link is 50× slower than HBM, then anything that doesn't fit in HBM is paying that 50× penalty. NVLink-C2C makes the link only ~5× slower than HBM, which means CPU memory becomes a usable second tier of VRAM — not a cliff.

02

Grace — The CPU

Grace is a 72-core ARM v9.2-a server CPU built on TSMC 4N (the same node as Hopper). Cores are ARM Neoverse V2 — the same Cortex-X-derived design used in AWS Graviton 3/4 and several other ARM datacenter parts.

Cores & ISA

72× Neoverse V2 (ARMv9.2-a)
4-wide out-of-order, deep ROB
Per-core: 64 KB L1-I, 64 KB L1-D, 1 MB L2
SVE2 with 2× 128-bit vector pipes
NEON / AdvSIMD compatibility
BF16 / FP16 vector dot-product

Caches & mesh

~117 MB shared L3 on a scalable coherent mesh
Mesh-based interconnect — not a ring
Hardware coherency to the paired GPU
System-MMU for two-level translation

Memory

On-package LPDDR5x (no DIMMs)
Per-Grace capacity: 480 GB
Per-Grace bandwidth: ~500 GB/s
ECC throughout
Lower power per GB than DDR5

I/O & power

PCIe Gen5 for NICs / NVMe / accelerators
NVLink-C2C as primary GPU path (900 GB/s)
TDP ~250–500 W for the full Grace die
System Management on a dedicated controller

Grace products

Product	LPDDR5x capacity	LPDDR5x bandwidth	Typical use
Grace (single die)	480 GB on-package	~500 GB/s	Paired with H100/H200 (GH200), each Grace in GB200
Grace CPU Superchip	2× 480 = 960 GB	2× ~500 GB/s	Two Grace dies on one module via NVLink-C2C, no GPU; HPC nodes (uncommon)

Why Neoverse V2 specifically

NVIDIA chose ARM not for ISA reasons but for customisation. ARM lets them keep their proprietary mesh, drop in a custom system-MMU, and graft on the C2C port without licensing renegotiation. The Neoverse V2 core itself is unmodified; everything around it is custom NVIDIA silicon.

03

NVLink-C2C — The Bridge

NVLink-C2C ("chip-to-chip") is the link that makes Grace useful as a GPU companion. It connects the Grace die and the paired GPU die over a short on-package trace at 900 GB/s bidirectional, with full cache coherency.

What "coherent" actually means here

Single 49-bit address space shared by Grace and the paired GPU. A pointer is a pointer; both sides see the same virtual addresses.
Cache coherency at the line level. Grace cores can read GPU-written data without an explicit flush; the GPU sees CPU writes.
Page migration on demand. A page can be physically resident in Grace LPDDR5x or in GPU HBM; the system-MMU migrates pages based on access pattern.
ATS (Address Translation Service) over C2C lets the GPU walk Grace's page tables directly — no IOMMU bounce.

How C2C compares to other CPU↔CPU / die-to-die fabrics

Fabric	Where	Bandwidth	Coherent?	Notes
NVLink-C2C	Grace ↔ GPU on one module	900 GB/s bidir	Yes	Cross-vendor design (NVIDIA ARM CPU + NVIDIA GPU)
Intel UPI	Xeon ↔ Xeon socket	~40 GB/s per link	Yes	CPU↔CPU only; no GPU integration
AMD Infinity Fabric	EPYC ↔ EPYC, MI↔MI	~100 GB/s per link	Yes	Single-vendor; MI300 has on-package CPU+GPU variants
CXL 3.0	Generic accelerator over PCIe	~64 GB/s per x16	Yes (CXL.cache)	Open standard, but slower than C2C
UCIe	Open chiplet standard	varies	Optional	Specifies the PHY; coherency is up to the protocol layer

It's not a NIC

The crucial mental shift: C2C is fabric, not network. There's no driver, no DMA descriptor ring, no doorbell. From software's view, Grace memory and GPU memory are one heap. The only thing the runtime decides is placement hints — "prefer GPU" or "prefer CPU" — and the hardware migrates pages accordingly.

04

GH200 — The Grace-Hopper Superchip

GH200 was the first product to ship with Grace + a paired GPU on one module. The pairing is one Grace CPU + one Hopper GPU (an H100 or H200-class die), connected by NVLink-C2C.

GH200 480GB

Grace 480 GB LPDDR5x
96 GB HBM3 (Hopper H100 die)
Total unified: 576 GB
Aggregate memory bandwidth: HBM3 ~3.4 TB/s + LPDDR ~480 GB/s
Released 2023

GH200 with H200

Grace 480 GB LPDDR5x
141 GB HBM3e (this is the "H200" inside)
Total unified: 621 GB
HBM3e bandwidth ~4.8 TB/s
Refreshed 2024 — the volume Helios / Alps part

Form factors

HGX-style baseboard with 8× GH200 — 8 superchips on one tray, NVLink between them via NVSwitch. Roughly the GH200 equivalent of an HGX H100 tray.
Single-module 1U servers — one GH200 per chassis, optimised for density and power. Used in many of the GH200-based supercomputers.
Multi-node clusters — Alps (CSCS, Switzerland) and Helios are the two flagship deployments; both rely on GH200 with the H200 (141 GB HBM3e) variant.

Why a 100B model fits a single GH200 cleanly

100B-param model, FP8

↓

Hot weights + activations → 96–144 GB HBM3/3e (active matmul)

↓ via NVLink-C2C 900 GB/s

KV cache, optimizer state, checkpoint pages → Grace 480 GB LPDDR5x

↓

Cold checkpoint shards → NVMe over PCIe Gen5

The crucial point: the spill from HBM to LPDDR5x happens without software changes. CUDA's unified memory paging treats the LPDDR pool as a slow tier of VRAM. A model that previously needed two H100s (one for weights, one for KV cache + optimizer state) fits one GH200 because the spill tier is now coherent and fast.

The win is for "almost-fits" models

If your model fits in HBM with comfortable margin, GH200 is a side-grade vs an SXM H100 server. If your model almost fits but spills 30–100 GB into checkpoint or KV, GH200 turns "I need a second node and tensor parallel" into "one box". That's the operational win NVIDIA is selling.

05

GB200 — The Grace-Blackwell Superchip

GB200 is the Blackwell-era replacement for GH200, but with a key topology change: one Grace + two B200 GPUs on one module. Each B200 has its own dedicated 900 GB/s C2C link to Grace; the two B200s talk to each other over NVLink 5 at 1.8 TB/s.

Numbers per superchip

Resource	Per GB200 superchip
Grace dies	1
B200 GPU dies	2 (each is a 2-die package = 4 reticle-limit dies total)
HBM3e (GPU side)	2× 192 GB = 384 GB
LPDDR5x (CPU side)	480 GB
Total unified memory	864 GB
FP4 dense compute	~18 PFLOPS
FP8 dense compute	~9 PFLOPS
Grace↔GPU bandwidth	2× 900 = 1.8 TB/s C2C
GPU↔GPU bandwidth (in-package)	1.8 TB/s NVLink 5
Module weight	~18 kg
Module TDP	~2700 W
Cooling	Liquid only

The building block of NVL72

GB200 is not really a server CPU; it's the unit cell of the rack. A GB200 NVL72 rack contains 36 of these (2 per compute tray × 18 trays), glued together by NVLink Switch trays into a single 72-GPU coherent NVLink domain. Everything else — cooling design, power delivery, copper backplane — flows from this choice.

06

EGM — Extended GPU Memory

EGM (Extended GPU Memory) is the formal name for the mode where the GPU treats Grace's LPDDR5x as additional VRAM at C2C bandwidth. It's part hardware (the C2C ATS) and part driver: the CUDA runtime exposes Grace memory as one giant pool of GPU-accessible memory, paged on demand.

What the kernel actually sees

From a CUDA kernel's viewpoint, EGM memory is indistinguishable from regular VRAM — the same pointers, the same load/store instructions. The difference is purely in where the page resides:

HBM3e (B200)

~5–8 TB/s · ns latency

~192 GB

C2C path

900 GB/s

link only

LPDDR5x (Grace)

~500 GB/s · ~10× HBM latency

~480 GB

PCIe NVMe

~14 GB/s

terabytes

How a kernel allocates EGM memory

cudaMallocManaged on GH200 / GB200 — one heap, two physical tiers

// All allocations go through the unified-memory allocator.
// On Grace-Hopper / Grace-Blackwell, the backing store can be HBM or LPDDR5x.
size_t bytes = 200ULL * 1024 * 1024 * 1024;   // 200 GB — bigger than HBM
float* x;
cudaMallocManaged(&x, bytes);

// Tell the driver our preferred placement — "prefer GPU" or "prefer CPU".
// Without a hint, the runtime decides based on first-touch and access counts.
cudaMemAdvise(x, bytes, cudaMemAdviseSetPreferredLocation, 0);  // device 0 = GPU

// Migration is automatic on access:
//   pages touched by the GPU live in HBM
//   pages touched by Grace cores live in LPDDR5x
//   pages cold for > N accesses get evicted to the other tier

my_kernel<<<blocks, threads>>>(x, bytes);   // just works — no copies, no streams

// You can also pre-fetch a hot range to HBM before launch:
cudaMemPrefetchAsync(x, hot_bytes, 0, stream);

Where EGM earns its keep

Model loading

Loading a 200 GB FP8 checkpoint into a single GH200 happens at C2C speed (~900 GB/s) once it is in Grace memory — vs ~32 GB/s over PCIe Gen5 to a discrete H100. Checkpoint cold-starts drop from minutes to seconds.

KV cache offload at scale

For long-context (128k+) inference with many concurrent users, the KV cache often dwarfs the weights. Spilling cold KV pages to LPDDR5x at 480 GB/s is "good enough" for prefix-cached requests; HBM holds only the active KV.

Optimizer state during training

FP32 master weights, AdamW moments, and gradient buffers can be 6–8× the size of FP8 weights. Pinning these to Grace LPDDR5x while the forward/backward pass uses HBM doubles the trainable model size per node.

Vector search indexes

RAG indexes (FAISS, NeMo Retriever) are millions of float32 vectors that are read once per query. They fit comfortably in 480 GB of LPDDR5x and stream across C2C as needed — no need for an Optane-style tier.

When EGM hurts

If your access pattern thrashes pages back and forth (e.g. interleaved GPU↔CPU updates on the same buffer), every miss costs a round trip over C2C. Always set cudaMemAdvise hints for your hot data structures, and use nvprof --print-gpu-trace to confirm pages aren't migrating each launch.

07

NVL72 — 72 GPUs in One NVLink Domain

Deck 09 (Blackwell) and deck 11 (Datacenter Platforms) covered NVL72 in detail; this is the recap from Grace's perspective.

The numbers

18 compute trays, two GB200 superchips each → 36 GB200 = 36 Grace + 72 B200 per rack (each GB200 superchip has 1 Grace + 2 B200).
9 NVLink Switch trays in the middle, copper backplane between trays.
HBM3e total: 72 × 192 GB = ~13.8 TB across all 72 B200 GPUs.
Grace LPDDR5x total: 36 × 480 GB = ~17.3 TB.
Aggregate addressable memory: ~31 TB coherent across the rack from any GPU.
NVLink 5 fabric: 1.8 TB/s per GPU pair, all 72 B200 dies in one NVLink domain. No InfiniBand needed inside a rack.
Power: ~120 kW per rack. Liquid cooling mandatory; air-cooled NVL72 is not a thing.

A rack is the new server

NVL72 is the first NVIDIA product designed as a rack-scale single computer. The unit of purchase, the unit of scheduling in CUDA / NCCL, and the unit of failure domain are all the rack. A 1T-parameter MoE model is meant to be trained or served on one or two NVL72s — not on a sea of HGX trays linked by IB.

08

Software on Grace — ARM64 Linux Reality

Grace runs full ARM64 Linux. By 2026 this is unremarkable for distros and most major workloads — but there are a few corners where you still need to check.

What just works

Layer	ARM64 status on Grace
OS	Ubuntu 22.04 / 24.04, RHEL 9, SLES 15 SP5+, Rocky 9 — all official
CUDA toolkit	ARM64 builds since CUDA 11.4; current toolchain is fully ARM64 native
cuDNN / NCCL / TensorRT	ARM64 builds shipped alongside x86_64 since 2022
PyTorch	`aarch64+cuda` wheels standard since PyTorch 2.0; nightly builds first-class
JAX	ARM64 + CUDA support official since 2023
Triton	ARM64 builds, autotune cache works
NGC catalog	All major NVIDIA images ship `linux/arm64` manifests
vLLM / TensorRT-LLM	ARM64 wheels and containers since 2024

The container gotcha

The single most common ARM64-on-Grace problem is container image architecture. A lot of community ML images are built for linux/amd64 only, and pulling them onto Grace either fails or silently runs under emulation (which is catastrophically slow).

Verify your images include arm64 before deploying

# Inspect a manifest list for both architectures:
docker buildx imagetools inspect vllm/vllm-openai:latest

# Look for entries like:
#   Manifests:
#     Name:      ...@sha256:...   Platform: linux/amd64
#     Name:      ...@sha256:...   Platform: linux/arm64

# Force the arch on pull (catches "wrong arch" early):
docker pull --platform=linux/arm64 vllm/vllm-openai:latest

# Build multi-arch images yourself:
docker buildx build \
    --platform linux/amd64,linux/arm64 \
    --push -t myorg/myapp:1.0 .

Other corners worth knowing

SVE2 vs AVX-512. Hand-tuned x86 kernels (numpy MKL paths, certain DataLoader hot paths) may run on a generic NEON path on Grace. Modern ACL (Arm Compute Library) and OpenBLAS releases now use SVE2 properly — check linkage with ldd.
Glibc / musl. Some prebuilt wheels assume specific glibc minor versions. Stick to the distro's default Python and pip-install only ARM64 wheels.
Profilers. nsys and nsight-compute have ARM64 builds. perf works the same as on x86. PMU events differ (Neoverse V2-specific) — perf list first.
Build farms. If you produce internal images, your CI must build both amd64 and arm64. docker buildx with QEMU is fine for arm64 from x86 CI runners.

A working rule

If your stack is "NGC image → PyTorch → HuggingFace model", you will not notice you are on ARM64. If your stack involves compiled-from-source third-party CUDA extensions, you may spend an afternoon producing ARM64 wheels. None of this is hard; it is just one more matrix dimension in CI.

09

DGX Spark — The Personal GH-Class Workstation

DGX Spark (formerly Project DIGITS) is the consumer-form-factor cousin of GB200. It packages GB10 — one Grace + one small Blackwell GPU on a single SoC — into a desk-sized box, with the same software stack as the rack.

GB10 SoC

20-core ARM CPU: Cortex-X925 + Cortex-A725 hybrid (10 + 10)
Single Blackwell GPU tile (smaller than B200)
Native FP4 / FP8 tensor cores
~1 PFLOP FP4 dense compute
One die, one package

Memory

128 GB unified LPDDR5x — one pool, no separate VRAM
~273 GB/s bandwidth (single-port LPDDR5x)
Same coherent address space model as Grace + GPU
No HBM — all memory is LPDDR5x

Form factor & price

~Mac-mini-sized desktop unit
Air cooled
~$3,000–$5,000 retail
10 GbE on-board, USB-C
Runs DGX OS (Ubuntu-derived) by default

Targets

Hobbyist / researcher running 70B models locally
Prototyping pipelines that later scale to GB200
Edge inference for sensitive data
Personal copilot / agent dev box

What 128 GB unified gets you

70B at INT4/FP8 with comfortable KV-cache headroom — you don't need to quantize aggressively.
120B-class models at FP4 fit cleanly. Mixtral-8×22B works comfortably.
You can hold multiple smaller models resident simultaneously (a 7B + a 13B + a vision encoder) without swap.
Dataset prep / RAG indexing locally without worrying about staying under 24 GB.

Same software, different scale

The architectural payoff: code that runs on Spark runs on a GB200 NVL72 with no rewrites. CUDA features detect by capability, not by SKU; cudaMallocManaged behaves identically (just with bigger HBM tiers); ARM64 wheels are the same artifact. Spark exists so the development cycle does not require the rack.

Spark vs an RTX 5090

A 5090 has more bandwidth (~1.8 TB/s GDDR7) and more raw compute, so for fits-in-32-GB workloads it wins on tok/s. Spark wins when you need capacity: 128 GB unified beats 32 GB VRAM + slow PCIe to system RAM. Pick by VRAM headroom, not by FLOPS.

10

Performance — When Grace Wins, When It Doesn't

Grace + C2C is a real architectural shift, but it is not a free win on every workload. Here is where the win is real and where it is not.

Where Grace wins

Models that almost fit

If your weights + KV cache + optimizer state spill 30–200 GB out of HBM, GH200 / GB200 keep that spill at C2C bandwidth (~900 GB/s) instead of PCIe (~64 GB/s). 10× speed-up on that tier alone.

Data-prep pipelines

72 high-IPC Neoverse V2 cores with SVE2 chew through tokenization, jpeg decode, augmentations — without burning host↔device PCIe bandwidth, because the result lands in shared memory the GPU can already read.

Checkpoint / model loading

A 200 GB checkpoint streams from NVMe via Grace into LPDDR5x at PCIe Gen5 speed, then the GPU pages from LPDDR5x at C2C speed. Cold load time often drops from minutes to under a minute.

Long-context inference

KV cache for 128k+ contexts at high concurrency dwarfs weights. Spilling cold KV pages to LPDDR5x lets you serve 4–8× more concurrent long-context users per HBM-byte than a discrete GPU could.

Vector search / RAG

Hundreds of GB of FAISS / NeMo Retriever index sit happily in LPDDR5x. The GPU pulls hot shards over C2C; queries don't touch PCIe.

Pythonic preprocessing

Grace cores running NumPy / Pandas / DALI compete favourably with x86 once the SVE2 paths are linked. Useful when the host stages are not the bottleneck on x86 either, but become a win because they avoid the PCIe copy.

Where Grace doesn't help

Workloads that fit cleanly in HBM and never spill. A pure 70B-FP8 inference workload that fits in 96 GB HBM doesn't benefit from EGM at all — an SXM H100 / H200 server is just as fast.
x86-only CUDA kernels. Rare in 2026 but not zero — some legacy production codes have hand-tuned host code with AVX-512 intrinsics. Porting to SVE2 is straightforward but a real task.
Single-thread Python that doesn't vectorise. Neoverse V2 single-thread perf is competitive but not always faster than a top-bin Xeon / EPYC. Pure Python code that's GIL-bound runs about the same.
Workloads dominated by NIC throughput. Grace has PCIe Gen5; it does not have a built-in BlueField-class DPU. If your bottleneck is network ingress at 200–400 Gbps, the choice of CPU is secondary.

Power efficiency

Anecdotal numbers from NVIDIA's 2024 disclosures and customer benchmarks: a GH200 delivers roughly 2× perf/W versus an x86 + PCIe + H100 PCIe equivalent on transformer-training workloads. The win is not from FLOPS — the GPU dies are the same — but from skipping PCIe round-trips and from LPDDR5x being substantially more efficient than DDR5.

Honest summary

Grace is best understood as a memory-system upgrade for the GPU, not as a CPU upgrade. The ARM cores are good but not magical. The win comes from one coherent address space and 900 GB/s between CPU and GPU. If your workload doesn't cross that boundary much, Grace is a side-grade. If it does, it is a step change.

11

Interactive: Superchip Picker

Pick a system and a workload. The picker tells you whether the system fits, where the bottleneck is, and gives a one-line verdict.

System

Workload

Total HBM

—

Total LPDDR5x

—

FP8 PFLOPS

—

Host↔device path

—

CPU cores

—