NVIDIA GPU Architectures Series — Presentation 09

Blackwell — Dual-Die, FP4, and the NVL72 Rack-Scale GPU

Two reticle-limited dies bonded into one logical GPU. 5th-generation tensor cores with MX-FP4 and NVFP4 microscaling. NVLink 5 at 1.8 TB/s per GPU and NVL72 racks that act as one giant 13.8 TB GPU. The architecture purpose-built for trillion-parameter MoE models.

B100B200GB200 NVL72MX-FP4MX-FP6 NV-HBINVLink 5RAS Decompression EngineTEE-IO
Dual-die NV-HBI TC5 MX-FP4 2nd-gen TE NVLink 5 NVL72
00

Topics We'll Cover

Blackwell is the first NVIDIA architecture where the GPU is no longer a single die. This deck walks the package, the maths, the rack, and ends with a planner you can drive yourself.

01

Blackwell in One Page

Announced at GTC March 2024, Blackwell is the first NVIDIA datacenter GPU built from two dies bonded into one logical device. It is the architecture that finally walks past the lithographic reticle limit, and it does so while doubling tensor throughput again with native FP4.

Why dual-die mattered

A single die can't exceed ~858 mm² (the EUV reticle limit). Hopper's GH100 was already hard against it at ~814 mm². To grow further, NVIDIA bonded two reticle-sized dies (~800 mm² each) across NV-HBI, the in-package fabric.

Why FP4 mattered

MX-FP4 (E2M1 + per-32-element scale) doubles inference throughput vs FP8 with minimal accuracy loss on big LLMs. The first wave of MoE models (DeepSeek-V3, Mixtral) shipped MX-FP4 quantised weights for Blackwell day-one.

Why NVLink 5 mattered

1.8 TB/s per GPU on NVLink 5 enables 72-GPU coherent NVLink domains — the NVL72 rack acts as one logical GPU with 13.8 TB of HBM3e and 130 TB/s of fabric. Trillion-parameter MoE training without InfiniBand-shaped collective bottlenecks.

The numbers, one place

PropertyBlackwell (B200)Hopper (H100 SXM)Multiplier
Die count2 (NV-HBI bonded)1
ProcessTSMC 4NP customTSMC 4Nrefined
Transistors~208 billion~80 billion2.6×
HBM capacity192 GB HBM3e80 GB HBM32.4×
HBM bandwidth8 TB/s3.35 TB/s2.4×
FP8 dense4.5 PFLOPS2.0 PFLOPS2.25×
FP4 dense9 PFLOPSnew
NVLink per GPU1.8 TB/s (gen 5)0.9 TB/s (gen 4)
TDP1000 W700 W1.43×
Where it landed

Volume shipments through 2024 and into 2025. The first deployments were hyperscaler training clusters (Meta, Microsoft, OpenAI) on GB200 NVL72 racks, with HGX B200 8-way baseboards arriving for enterprise shortly after. The B100, a 700 W drop-in for the H100 SXM socket, was the upgrade path for existing HGX H100 customers.

02

The Dual-Die Package — NV-HBI & HBM3e

Blackwell is the first GPU where you cannot draw the chip without showing the package. Two reticle-sized dies sit on a CoWoS-L interposer, surrounded by eight HBM3e stacks, bonded edge-to-edge across the NV-HBI link.

Panel 1 — Cross-section

CoWoS-L interposer + organic substrate Die 0 ~800 mm² Die 1 ~800 mm² NV-HBI 10 TB/s HBM3e HBM3e HBM3e HBM3e HBM3e HBM3e HBM3e HBM3e Blackwell B200 package — 192 GB total 8 stacks × 24 GB · 8 TB/s aggregate

Panel 2 — What NV-HBI buys you

  • 10 TB/s bidirectional die-to-die fabric, ~10× the bandwidth of a single HBM stack.
  • Unified L2 cache: ~64 MB combined effective, coherent across both dies.
  • Single address space: software sees one GPU. Kernels migrate work across the bridge transparently.
  • No NUMA penalty exposed to CUDA: NV-HBI latency is hidden by the L2 and the SM scheduler.
  • Compare AMD MI300X: similar chiplet idea, 8 XCDs over Infinity Fabric AP. Different scale, different topology — MI300X is a hub-and-spoke, Blackwell is two equal partners.

HBM3e details

Software transparency

From the CUDA programmer's perspective, B200 is one GPU. cudaGetDeviceCount() returns 1 per package, the SM count is reported as the union (B200 = ~160 SMs), and NVLink targets the package, not individual dies. The dual-die nature only shows up in profiler traces of NV-HBI traffic and in occasional SM-affinity heuristics inside cuBLAS / Transformer Engine.

03

Blackwell SKUs — B100, B200, GB200

Blackwell launched as a family, not a single product. The matrix below is the one you actually need when speccing a system.

SKUTDPHBMBWFP4 sparseNotes
B100700 W192 GB HBM3e8 TB/s14 PFLOPSDrop-in HGX H100 socket compatibility. Same 700 W envelope as H100 SXM. Throttled FP4/FP8 vs B200.
B2001000 W192 GB HBM3e8 TB/s18 PFLOPSFull performance Blackwell. New HGX B200 baseboard (1 kW per socket). Liquid- or air-cooled.
B200 NVL1000 W192 GB HBM3e8 TB/s18 PFLOPSInference-tuned variant for NVL platforms; same silicon, tuned firmware/clocks for token-generation workloads.
GB200~2700 W (module)2× 192 GB + 480 GB LPDDR5x2× 8 TB/s + ~0.5 TB/s36 PFLOPSSuperchip: 1× Grace ARM (72 cores, 480 GB LPDDR5x) + 2× B200 over NVLink-C2C 900 GB/s coherent.
HGX B200~8 kW board8× 192 GB = 1.5 TB64 TB/s aggregate144 PFLOPS8-GPU baseboard with NVSwitch 4. The conventional drop-in for AI servers replacing HGX H100/H200.
DGX B200~14 kW system1.5 TB HBM3e + 4 TB LPDDR564 TB/s144 PFLOPSFull NVIDIA-built system: HGX B200 + 2× Intel Xeon + 8× ConnectX-7 + 2× BlueField-3.

Picking between them

You have HGX H100 sockets

Use B100. 700 W matches your existing power and thermal envelope; you keep the baseboard and NVSwitch, just swap GPUs. ~2× the throughput at the same wattage. The pragmatic upgrade path.

You're building a new training cluster

Go straight to GB200 NVL72. Unified Grace memory + 72-GPU coherent domain + liquid cooling. Designed for ≥ 1T-parameter MoE training. ~120 kW per rack but the perf/$ is unmatched.

Enterprise inference, eight-way

HGX B200. Air- or liquid-cooled; fits in a standard 6U-8U chassis; OEM systems from Supermicro, Dell, HPE. The post-Hopper successor to HGX H100/H200 in the conventional rack.

04

5th-Gen Tensor Cores + MX Formats

Blackwell's 5th-generation tensor cores natively support the OCP Microscaling (MX) formats: small element widths paired with a shared per-block exponent. The result: FP4 throughput at FP8-class accuracy for inference.

(a) MX-FP4 (E2M1)

4-bit element: 1 sign, 2 exponent, 1 mantissa.
Per-32-element scale: 8-bit power-of-two (E8M0).
Effective ~4.25 bits/element after amortising the scale.
~2× throughput over FP8 on the same silicon. The headline format for Blackwell inference.

(b) MX-FP6 (E3M2 / E2M3)

6-bit elements with the same 8-bit shared scale per 32 elements.
Two layouts: E3M2 (more range) or E2M3 (more precision).
Intermediate accuracy & throughput — useful where FP4 starts losing too much accuracy on activations or sensitive layers.

(c) FP8 E4M3 / E5M2

The Hopper formats, still supported natively. E4M3 for forward/weights, E5M2 for backward/gradients. Per-tensor scaling (legacy) or per-block via MX (new on Blackwell). The compatibility layer.

Throughput numbers (B200, dense, per GPU)

BF16
2.25 PFLOPS
baseline
FP8
4.5 PFLOPS
2× BF16
MX-FP6
~6.75 PFLOPS
3× BF16
MX-FP4
9 PFLOPS
4× BF16
MX-FP4 + 2:4
18 PFLOPS
8× BF16

Sparse 2:4 doubles all of these. Hardware enforces a structured sparsity pattern: in every group of four contiguous weights, two must be zero. The tensor core skips the zeros; you get the second multiplier for free if you can quantise to that pattern at training time.

Why "microscaling" matters

Hopper's FP8 used a per-tensor scale — one scale per (potentially gigabyte-sized) weight matrix. Outliers in any block forced the whole tensor's scale up, killing precision in the well-behaved blocks. MX shrinks the scale's scope to 32 elements, so each microblock gets its own dynamic range. That is the whole reason FP4 works at all on real models.

MX-FP4 vs NVFP4

Blackwell's 5th-gen tensor cores support two FP4 variants. MX-FP4 is the open OCP standard: 32-element blocks with an 8-bit power-of-two (E8M0) per-block scale. NVFP4 is NVIDIA's variant: 16-element blocks with an FP8 (E4M3) per-block scale plus a per-tensor FP32 scale. NVFP4 is the default in TensorRT-LLM and Transformer Engine and is more accurate than MX-FP4 in practice; MX-FP4 wins on portability. Deep dive in deck 38.

05

2nd-Gen Transformer Engine

The Transformer Engine (TE) is NVIDIA's open-source Python library that wraps cuBLAS / cuDNN with automatic mixed-precision dispatch tuned for transformer workloads. Blackwell ships its second generation.

Hopper TE (1st gen)

  • Per-tensor FP8 scaling. One scale per matmul output tensor.
  • Scales updated by an exponential moving average of recent activations.
  • Fine for 7B–70B dense models; loses accuracy on long-tail activations and MoE expert outputs.

Blackwell TE (2nd gen)

  • Per-microblock MX scaling. Each 32-element block of weights or activations gets its own scale.
  • Scales computed at runtime from the block's max-abs — no EMA bookkeeping required.
  • Much higher accuracy at MX-FP4 than Hopper's FP8 had at its launch — because the format itself was designed for the precision/throughput tradeoff.

Software stack

PyTorch — opt-in MX-FP4 inference via TE
import torch
import transformer_engine.pytorch as te
from transformer_engine.common.recipe import Format, DelayedScaling, MXFP8BlockScaling

# Hopper-style: per-tensor FP8
fp8_recipe = DelayedScaling(margin=0, fp8_format=Format.HYBRID)

# Blackwell-style: per-microblock MX-FP4
mx_recipe  = MXFP8BlockScaling(fp8_format=Format.E4M3, block_size=32)

with te.fp8_autocast(enabled=True, fp8_recipe=mx_recipe):
    out = model(input_ids)

Integrated with PyTorch and JAX, picked up automatically by Megatron-LM, NeMo, and (downstream) by Hugging Face Transformers when the kernel backends are present.

Models in the wild

The first wave of MoE models — Mistral Mixtral, DeepSeek-V3, and the inference-only checkpoints from several frontier labs — ship MX-FP4 quantised weights targeted at Blackwell. On Hopper they fall back to FP8 emulation through MX-aware unpacking, which loses the throughput advantage but preserves the file format.

06

RAS Engine — Reliability, Availability, Serviceability

At rack scale the law of large numbers turns rare hardware faults into routine events. Blackwell adds a dedicated RAS engine — a small in-package microcontroller that watches the GPU's data paths continuously.

What it watches

  • SM ALU pipelines — injected test patterns + golden-result checks during idle slots.
  • L2 cache & HBM — ECC scrub plus secondary correlation across stacks.
  • NV-HBI link — CRC and replay statistics across the die-to-die fabric.
  • NVLink 5 — lane-level error counters, retrain triggers.
  • Voltage / thermal sensors — trends, not just thresholds.

What it does

  • Predictive analysis — an on-die model flags components whose error rates are trending up before they fail.
  • Hot-swap of work — suspected SMs are quiesced and their TMA descriptors / tensor work migrated to healthy partitions without job restart.
  • Detailed diagnostics — reports a structured fault record over DCGM so the orchestrator can drain the node cleanly.
  • Self-test on power-up — full BIST of NV-HBI, HBM, NVLink before the SM cluster is released.

Why this matters at NVL72 scale

The mean time between failures (MTBF) of any one GPU is excellent, but in a 72-GPU coherent NVLink domain the aggregate MTBF compounds. If a single GPU fails per week, an NVL72 rack faces an event roughly every 2–3 hours; without graceful migration, every event ends a multi-day training run.

In numbers

On a 16-rack (1152-GPU) Blackwell training pod, RAS-driven hot migration claws back the bulk of what would otherwise be lost wall-clock. NVIDIA's published target is > 90% effective utilisation on multi-week pre-training runs — up from roughly 75% on H100 clusters of similar scale, where every hardware fault forced a checkpoint-and-restart.

07

Decompression Engine

Blackwell ships a fixed-function decompression accelerator alongside the SMs. It speaks LZ4, Snappy, and Deflate (gzip-compatible) at 800 GB/s peak, putting it in the same bandwidth class as the HBM controller.

Storage
compressed
Decompression
Engine
800 GB/s
L2 / HBM
raw bytes
SMs
compute

Why it exists

What it isn't

It is not a general-purpose compute accelerator and not a network engine. It will not help you decompress model weights at load time (those load once; the bandwidth-savings would be one-shot). It is aimed squarely at analytics and search workloads where the same compressed data is read many times.

A surprising win

Some early benchmarks on Parquet TPC-H scans showed 3–5× end-to-end speedups over CPU-decompressed pipelines, not because GPU decompression is faster than CPU decompression in isolation, but because it removes the PCIe DMA of raw uncompressed bytes — the bus, not the algorithm, was the bottleneck.

08

NVLink 5 + NVSwitch 4

NVLink 5 doubles the per-GPU bandwidth of NVLink 4 and, in combination with NVSwitch 4, extends the maximum coherent NVLink domain to 72 GPUs.

PropertyNVLink 4 (Hopper)NVLink 5 (Blackwell)Multiplier
Per-link bandwidth50 GB/s/dir100 GB/s/dir
Links per GPU1818same
Per-GPU aggregate0.9 TB/s1.8 TB/s
Switch ports per chip641442.25×
Switch aggregate3.2 TB/s1.8 TB/s/portper-port doubled
Max coherent domain8 (HGX) / 256 (DGX SuperPOD via NVLink Switch System)72 (NVL72) over a single NVSwitch 4 fabric9× in one rack

NVL72 fabric — a single GPU, made of 72

Worked example — 405B all-reduce

A Llama-3.1 405B model with FP8 weights is ~405 GB. An all-reduce across the parameter set during a training step touches every byte twice (reduce + broadcast).

The real answer to "why NVL72?"

It is not the FLOPS — HGX B200 has plenty of FLOPS. It is the collective-operation latency. MoE training in particular hits all-to-all on every layer; the difference between 100 µs and 5 ms is the difference between a usable training pipeline and one that spends most of its time waiting on the network.

09

GB200 Grace-Blackwell Superchip + NVL72

The GB200 module is an NVIDIA-designed compute brick: 1 Grace CPU + 2 B200 GPUs, bonded with NVLink-C2C. The NVL72 rack is 36 of those bricks (18 compute trays, 2 superchips per tray) plus a switching plane.

GB200 superchip (left) and NVL72 rack (right) GB200 Superchip Grace CPU 72 ARM Neoverse V2 480 GB LPDDR5x LPDDR5x — ~0.5 TB/s B200 #0 192 GB HBM3e 8 TB/s · 9 PFLOPS FP4 HBM3e B200 #1 192 GB HBM3e 8 TB/s · 9 PFLOPS FP4 HBM3e C2C NVLink-C2C 900 GB/s coherent EGM: Grace LPDDR5x addressable from B200 over C2C NVLink 5 1.8 TB/s/GPU NVL72 Rack 18 compute trays · each tray = 2 GB200 · 36 Grace + 72 B200 total Tray 1 (2x) Tray 2 (2x) Tray 3 (2x) Tray 4 (2x) Tray 5 (2x) Tray 6 (2x) Tray 7 (2x) Tray 8 (2x) Tray 9 (2x) NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch NVSwitch Tray 10 (2x) Tray 11 (2x) Tray 12 (2x) Tray 13 (2x) Tray 14 (2x) Tray 15 (2x) Tray 16 (2x) Tray 17 (2x) Tray 18 (2x) 72 B200 + 36 Grace + 9 NVSwitch trays · 130 TB/s fabric · 13.8 TB HBM3e

What "Extended GPU Memory" actually is

Each Grace's 480 GB of LPDDR5x is mapped into the GPU's address space over the C2C link. CUDA sees it as memory with worse latency than HBM but the same coherency guarantees. For workloads with large but cold parameter tables (embeddings, retrieval indices, MoE routing tables), this is huge: your "GPU memory" goes from 192 GB to 192 + 480 = 672 GB per GB200 module (1 Grace + 2 B200), and 13.8 TB HBM + 17.3 TB LPDDR5x = ~31 TB across an NVL72 rack (36 Graces).

Designed for what

NVL72 was designed primarily for MoE / 1T-parameter training. The combination of fast all-reduce, large coherent memory, and Grace's role hosting optimiser state is unique to this generation. The first deployments trained models with effective parameter counts past 1.8T at densities that would have been impossible on HGX H100 racks.

10

Confidential Computing — TEE-IO

Hopper introduced GPU-side confidential computing: encrypted DMA, attested boot, an isolated GPU runtime. Blackwell extends that across the PCIe interface with TEE-IO — trusted execution that spans CPU and GPU as one boundary.

The boundary

The CPU's TEE (Intel TDX or AMD SEV-SNP) and the Blackwell GPU's confidential mode form a single attested enclave. The hypervisor and host OS sit outside the boundary — they can schedule, but they cannot read the encrypted memory or the device's state.

DMA buffers between CPU TEE and GPU TEE are encrypted with a key established at attestation time; the PCIe / NVLink fabric carries ciphertext only.

The threat model

Defends against a compromised hypervisor, a malicious cloud operator, and a snooping co-tenant. Does not defend against a physical attacker with arbitrary access to the package — HBM scraping is out of scope, as it always has been.

Critical for cloud LLM inference on sensitive data: medical, financial, defense. The customer's prompt and the model's KV-cache exist only in GPU TEE memory; even the cloud operator can't read them.

What the deployment looks like

The pragmatic value

Most enterprise AI conversations now have a "could we host this on third-party infrastructure?" question, and the honest answer used to be "not really, because the hypervisor sees everything." TEE-IO turns that into "yes, with a defensible attestation chain." It's the feature that makes regulated industries actually buy GPU time.

11

Interactive: NVL72 Throughput Planner

Pick a model, a precision, a batch size, and a domain. The planner sizes weights against the domain's HBM, computes aggregate compute, estimates collective time and a rough prefill TPS. Numbers are first-order; treat the verdict as "is this remotely feasible," not as a benchmark.

Reading the planner

The "fits" badge is the first question — if it's red, no amount of compute helps. The all-reduce figure is the worst-case full parameter sync; for sharded gradient updates it's typically 10–50× smaller. Prefill TPS is a roof, not a benchmark; real numbers depend heavily on attention kernels, KV layout, and (especially for MoE) routing efficiency. Use this to rule out infeasible plans, not to commit a procurement.