NVIDIA GPU Architectures 09 — Blackwell Deep Dive

00

Topics We'll Cover

Blackwell is the first NVIDIA architecture where the GPU is no longer a single die. This deck walks the package, the maths, the rack, and ends with a planner you can drive yourself.

Blackwell in One Page
The Dual-Die Package — NV-HBI & HBM3e
Blackwell SKUs — B100, B200, GB200
5th-Gen Tensor Cores + MX Formats
2nd-Gen Transformer Engine
RAS Engine
Decompression Engine
NVLink 5 + NVSwitch 4
GB200 Grace-Blackwell Superchip + NVL72
Confidential Computing — TEE-IO
Interactive: NVL72 Throughput Planner

01

Blackwell in One Page

Announced at GTC March 2024, Blackwell is the first NVIDIA datacenter GPU built from two dies bonded into one logical device. It is the architecture that finally walks past the lithographic reticle limit, and it does so while doubling tensor throughput again with native FP4.

Why dual-die mattered

A single die can't exceed ~858 mm² (the EUV reticle limit). Hopper's GH100 was already hard against it at ~814 mm². To grow further, NVIDIA bonded two reticle-sized dies (~800 mm² each) across NV-HBI, the in-package fabric.

Why FP4 mattered

MX-FP4 (E2M1 + per-32-element scale) doubles inference throughput vs FP8 with minimal accuracy loss on big LLMs. The first wave of MoE models (DeepSeek-V3, Mixtral) shipped MX-FP4 quantised weights for Blackwell day-one.

Why NVLink 5 mattered

1.8 TB/s per GPU on NVLink 5 enables 72-GPU coherent NVLink domains — the NVL72 rack acts as one logical GPU with 13.8 TB of HBM3e and 130 TB/s of fabric. Trillion-parameter MoE training without InfiniBand-shaped collective bottlenecks.

The numbers, one place

Property	Blackwell (B200)	Hopper (H100 SXM)	Multiplier
Die count	2 (NV-HBI bonded)	1	2×
Process	TSMC 4NP custom	TSMC 4N	refined
Transistors	~208 billion	~80 billion	2.6×
HBM capacity	192 GB HBM3e	80 GB HBM3	2.4×
HBM bandwidth	8 TB/s	3.35 TB/s	2.4×
FP8 dense	4.5 PFLOPS	2.0 PFLOPS	2.25×
FP4 dense	9 PFLOPS	—	new
NVLink per GPU	1.8 TB/s (gen 5)	0.9 TB/s (gen 4)	2×
TDP	1000 W	700 W	1.43×

Where it landed

Volume shipments through 2024 and into 2025. The first deployments were hyperscaler training clusters (Meta, Microsoft, OpenAI) on GB200 NVL72 racks, with HGX B200 8-way baseboards arriving for enterprise shortly after. The B100, a 700 W drop-in for the H100 SXM socket, was the upgrade path for existing HGX H100 customers.

02

The Dual-Die Package — NV-HBI & HBM3e

Blackwell is the first GPU where you cannot draw the chip without showing the package. Two reticle-sized dies sit on a CoWoS-L interposer, surrounded by eight HBM3e stacks, bonded edge-to-edge across the NV-HBI link.

Panel 1 — Cross-section

Panel 2 — What NV-HBI buys you

10 TB/s bidirectional die-to-die fabric, ~10× the bandwidth of a single HBM stack.
Unified L2 cache: ~64 MB combined effective, coherent across both dies.
Single address space: software sees one GPU. Kernels migrate work across the bridge transparently.
No NUMA penalty exposed to CUDA: NV-HBI latency is hidden by the L2 and the SM scheduler.
Compare AMD MI300X: similar chiplet idea, 8 XCDs over Infinity Fabric AP. Different scale, different topology — MI300X is a hub-and-spoke, Blackwell is two equal partners.

HBM3e details

8 stacks total, 4 per die in B200. Each stack is 24 GB (8-Hi, 3 GB/die, MR-DIMM-like 12-Hi pending).
Per-stack bandwidth ~1 TB/s → 8 TB/s aggregate on B200 (B100 throttles to 8 TB/s as well in the 700 W envelope; the limit is the controller, not the stacks).
192 GB is the headline number. That fits a full 405B FP4 model (~200 GB) on a single B200 with KV-cache to spare.

Software transparency

From the CUDA programmer's perspective, B200 is one GPU. cudaGetDeviceCount() returns 1 per package, the SM count is reported as the union (B200 = ~160 SMs), and NVLink targets the package, not individual dies. The dual-die nature only shows up in profiler traces of NV-HBI traffic and in occasional SM-affinity heuristics inside cuBLAS / Transformer Engine.

03

Blackwell SKUs — B100, B200, GB200

Blackwell launched as a family, not a single product. The matrix below is the one you actually need when speccing a system.

SKU	TDP	HBM	BW	FP4 sparse	Notes
B100	700 W	192 GB HBM3e	8 TB/s	14 PFLOPS	Drop-in HGX H100 socket compatibility. Same 700 W envelope as H100 SXM. Throttled FP4/FP8 vs B200.
B200	1000 W	192 GB HBM3e	8 TB/s	18 PFLOPS	Full performance Blackwell. New HGX B200 baseboard (1 kW per socket). Liquid- or air-cooled.
B200 NVL	1000 W	192 GB HBM3e	8 TB/s	18 PFLOPS	Inference-tuned variant for NVL platforms; same silicon, tuned firmware/clocks for token-generation workloads.
GB200	~2700 W (module)	2× 192 GB + 480 GB LPDDR5x	2× 8 TB/s + ~0.5 TB/s	36 PFLOPS	Superchip: 1× Grace ARM (72 cores, 480 GB LPDDR5x) + 2× B200 over NVLink-C2C 900 GB/s coherent.
HGX B200	~8 kW board	8× 192 GB = 1.5 TB	64 TB/s aggregate	144 PFLOPS	8-GPU baseboard with NVSwitch 4. The conventional drop-in for AI servers replacing HGX H100/H200.
DGX B200	~14 kW system	1.5 TB HBM3e + 4 TB LPDDR5	64 TB/s	144 PFLOPS	Full NVIDIA-built system: HGX B200 + 2× Intel Xeon + 8× ConnectX-7 + 2× BlueField-3.

Picking between them

You have HGX H100 sockets

Use B100. 700 W matches your existing power and thermal envelope; you keep the baseboard and NVSwitch, just swap GPUs. ~2× the throughput at the same wattage. The pragmatic upgrade path.

You're building a new training cluster

Go straight to GB200 NVL72. Unified Grace memory + 72-GPU coherent domain + liquid cooling. Designed for ≥ 1T-parameter MoE training. ~120 kW per rack but the perf/$ is unmatched.

Enterprise inference, eight-way

HGX B200. Air- or liquid-cooled; fits in a standard 6U-8U chassis; OEM systems from Supermicro, Dell, HPE. The post-Hopper successor to HGX H100/H200 in the conventional rack.

04

5th-Gen Tensor Cores + MX Formats

Blackwell's 5th-generation tensor cores natively support the OCP Microscaling (MX) formats: small element widths paired with a shared per-block exponent. The result: FP4 throughput at FP8-class accuracy for inference.

(a) MX-FP4 (E2M1)

4-bit element: 1 sign, 2 exponent, 1 mantissa.
Per-32-element scale: 8-bit power-of-two (E8M0).
Effective ~4.25 bits/element after amortising the scale.
~2× throughput over FP8 on the same silicon. The headline format for Blackwell inference.

(b) MX-FP6 (E3M2 / E2M3)

6-bit elements with the same 8-bit shared scale per 32 elements.
Two layouts: E3M2 (more range) or E2M3 (more precision).
Intermediate accuracy & throughput — useful where FP4 starts losing too much accuracy on activations or sensitive layers.

(c) FP8 E4M3 / E5M2

The Hopper formats, still supported natively. E4M3 for forward/weights, E5M2 for backward/gradients. Per-tensor scaling (legacy) or per-block via MX (new on Blackwell). The compatibility layer.

Throughput numbers (B200, dense, per GPU)

BF16

2.25 PFLOPS

baseline

FP8

4.5 PFLOPS

2× BF16

MX-FP6

~6.75 PFLOPS

3× BF16

MX-FP4

9 PFLOPS

4× BF16

MX-FP4 + 2:4

18 PFLOPS

8× BF16

Sparse 2:4 doubles all of these. Hardware enforces a structured sparsity pattern: in every group of four contiguous weights, two must be zero. The tensor core skips the zeros; you get the second multiplier for free if you can quantise to that pattern at training time.

Why "microscaling" matters

Hopper's FP8 used a per-tensor scale — one scale per (potentially gigabyte-sized) weight matrix. Outliers in any block forced the whole tensor's scale up, killing precision in the well-behaved blocks. MX shrinks the scale's scope to 32 elements, so each microblock gets its own dynamic range. That is the whole reason FP4 works at all on real models.

MX-FP4 vs NVFP4

Blackwell's 5th-gen tensor cores support two FP4 variants. MX-FP4 is the open OCP standard: 32-element blocks with an 8-bit power-of-two (E8M0) per-block scale. NVFP4 is NVIDIA's variant: 16-element blocks with an FP8 (E4M3) per-block scale plus a per-tensor FP32 scale. NVFP4 is the default in TensorRT-LLM and Transformer Engine and is more accurate than MX-FP4 in practice; MX-FP4 wins on portability. Deep dive in deck 38.

05

2nd-Gen Transformer Engine

The Transformer Engine (TE) is NVIDIA's open-source Python library that wraps cuBLAS / cuDNN with automatic mixed-precision dispatch tuned for transformer workloads. Blackwell ships its second generation.

Hopper TE (1st gen)

Per-tensor FP8 scaling. One scale per matmul output tensor.
Scales updated by an exponential moving average of recent activations.
Fine for 7B–70B dense models; loses accuracy on long-tail activations and MoE expert outputs.

Blackwell TE (2nd gen)

Per-microblock MX scaling. Each 32-element block of weights or activations gets its own scale.
Scales computed at runtime from the block's max-abs — no EMA bookkeeping required.
Much higher accuracy at MX-FP4 than Hopper's FP8 had at its launch — because the format itself was designed for the precision/throughput tradeoff.

Software stack

PyTorch — opt-in MX-FP4 inference via TE

import torch
import transformer_engine.pytorch as te
from transformer_engine.common.recipe import Format, DelayedScaling, MXFP8BlockScaling

# Hopper-style: per-tensor FP8
fp8_recipe = DelayedScaling(margin=0, fp8_format=Format.HYBRID)

# Blackwell-style: per-microblock MX-FP4
mx_recipe  = MXFP8BlockScaling(fp8_format=Format.E4M3, block_size=32)

with te.fp8_autocast(enabled=True, fp8_recipe=mx_recipe):
    out = model(input_ids)

Integrated with PyTorch and JAX, picked up automatically by Megatron-LM, NeMo, and (downstream) by Hugging Face Transformers when the kernel backends are present.

Models in the wild

The first wave of MoE models — Mistral Mixtral, DeepSeek-V3, and the inference-only checkpoints from several frontier labs — ship MX-FP4 quantised weights targeted at Blackwell. On Hopper they fall back to FP8 emulation through MX-aware unpacking, which loses the throughput advantage but preserves the file format.

06

RAS Engine — Reliability, Availability, Serviceability

At rack scale the law of large numbers turns rare hardware faults into routine events. Blackwell adds a dedicated RAS engine — a small in-package microcontroller that watches the GPU's data paths continuously.

What it watches

SM ALU pipelines — injected test patterns + golden-result checks during idle slots.
L2 cache & HBM — ECC scrub plus secondary correlation across stacks.
NV-HBI link — CRC and replay statistics across the die-to-die fabric.
NVLink 5 — lane-level error counters, retrain triggers.
Voltage / thermal sensors — trends, not just thresholds.

What it does

Predictive analysis — an on-die model flags components whose error rates are trending up before they fail.
Hot-swap of work — suspected SMs are quiesced and their TMA descriptors / tensor work migrated to healthy partitions without job restart.
Detailed diagnostics — reports a structured fault record over DCGM so the orchestrator can drain the node cleanly.
Self-test on power-up — full BIST of NV-HBI, HBM, NVLink before the SM cluster is released.

Why this matters at NVL72 scale

The mean time between failures (MTBF) of any one GPU is excellent, but in a 72-GPU coherent NVLink domain the aggregate MTBF compounds. If a single GPU fails per week, an NVL72 rack faces an event roughly every 2–3 hours; without graceful migration, every event ends a multi-day training run.

In numbers

On a 16-rack (1152-GPU) Blackwell training pod, RAS-driven hot migration claws back the bulk of what would otherwise be lost wall-clock. NVIDIA's published target is > 90% effective utilisation on multi-week pre-training runs — up from roughly 75% on H100 clusters of similar scale, where every hardware fault forced a checkpoint-and-restart.

07

Decompression Engine

Blackwell ships a fixed-function decompression accelerator alongside the SMs. It speaks LZ4, Snappy, and Deflate (gzip-compatible) at 800 GB/s peak, putting it in the same bandwidth class as the HBM controller.

Storage
compressed

→

Decompression
Engine
800 GB/s

→

L2 / HBM
raw bytes

→

SMs
compute

Why it exists

Data lakes ship compressed. Apache Parquet, ORC, columnar warehouses all use LZ4 / Snappy / Zstd at rest. Historically the GPU had to wait while the CPU decompressed and DMA'd raw bytes across PCIe.
Vector search over compressed embedding columns is memory-bandwidth-bound. Inline GPU decompression keeps the SMs busy without round-tripping through host RAM.
RAG pipelines with large document corpora benefit similarly — the index lives compressed, decompresses on-die, and is consumed by the attention kernel without leaving the GPU.

What it isn't

It is not a general-purpose compute accelerator and not a network engine. It will not help you decompress model weights at load time (those load once; the bandwidth-savings would be one-shot). It is aimed squarely at analytics and search workloads where the same compressed data is read many times.

A surprising win

Some early benchmarks on Parquet TPC-H scans showed 3–5× end-to-end speedups over CPU-decompressed pipelines, not because GPU decompression is faster than CPU decompression in isolation, but because it removes the PCIe DMA of raw uncompressed bytes — the bus, not the algorithm, was the bottleneck.

08

NVLink 5 + NVSwitch 4

NVLink 5 doubles the per-GPU bandwidth of NVLink 4 and, in combination with NVSwitch 4, extends the maximum coherent NVLink domain to 72 GPUs.

Property	NVLink 4 (Hopper)	NVLink 5 (Blackwell)	Multiplier
Per-link bandwidth	50 GB/s/dir	100 GB/s/dir	2×
Links per GPU	18	18	same
Per-GPU aggregate	0.9 TB/s	1.8 TB/s	2×
Switch ports per chip	64	144	2.25×
Switch aggregate	3.2 TB/s	1.8 TB/s/port	per-port doubled
Max coherent domain	8 (HGX) / 256 (DGX SuperPOD via NVLink Switch System)	72 (NVL72) over a single NVSwitch 4 fabric	9× in one rack

NVL72 fabric — a single GPU, made of 72

72 B200 GPUs in 18 compute trays.
9 NVSwitch 4 trays (some configs use 18; on early hardware the 9-switch topology dominated). 144 ports per chip; the rack uses thousands of NVLink 5 lanes.
130 TB/s aggregate fabric bandwidth across the rack.
13.8 TB unified HBM3e (72 × 192 GB) addressable as one logical GPU memory space, coherent across the rack.
Liquid-cooled by design; ~120 kW per rack.

Worked example — 405B all-reduce

A Llama-3.1 405B model with FP8 weights is ~405 GB. An all-reduce across the parameter set during a training step touches every byte twice (reduce + broadcast).

NVL72 (NVLink 5 fabric): 405 GB × 2 / 130 TB/s ≈ ~6 ms minimum in the bandwidth limit; in practice ~10 ms with collective overhead. For a sharded gradient (single-step partial sync) the practical figure drops to ~100 µs.
InfiniBand cluster (NDR 400 G): the same partial sync over a fat-tree InfiniBand fabric runs ~5 ms — ~50× slower than the NVLink fabric.

The real answer to "why NVL72?"

It is not the FLOPS — HGX B200 has plenty of FLOPS. It is the collective-operation latency. MoE training in particular hits all-to-all on every layer; the difference between 100 µs and 5 ms is the difference between a usable training pipeline and one that spends most of its time waiting on the network.

09

GB200 Grace-Blackwell Superchip + NVL72

The GB200 module is an NVIDIA-designed compute brick: 1 Grace CPU + 2 B200 GPUs, bonded with NVLink-C2C. The NVL72 rack is 36 of those bricks (18 compute trays, 2 superchips per tray) plus a switching plane.

What "Extended GPU Memory" actually is

Each Grace's 480 GB of LPDDR5x is mapped into the GPU's address space over the C2C link. CUDA sees it as memory with worse latency than HBM but the same coherency guarantees. For workloads with large but cold parameter tables (embeddings, retrieval indices, MoE routing tables), this is huge: your "GPU memory" goes from 192 GB to 192 + 480 = 672 GB per GB200 module (1 Grace + 2 B200), and 13.8 TB HBM + 17.3 TB LPDDR5x = ~31 TB across an NVL72 rack (36 Graces).

Designed for what

NVL72 was designed primarily for MoE / 1T-parameter training. The combination of fast all-reduce, large coherent memory, and Grace's role hosting optimiser state is unique to this generation. The first deployments trained models with effective parameter counts past 1.8T at densities that would have been impossible on HGX H100 racks.

10

Confidential Computing — TEE-IO

Hopper introduced GPU-side confidential computing: encrypted DMA, attested boot, an isolated GPU runtime. Blackwell extends that across the PCIe interface with TEE-IO — trusted execution that spans CPU and GPU as one boundary.

The boundary

The CPU's TEE (Intel TDX or AMD SEV-SNP) and the Blackwell GPU's confidential mode form a single attested enclave. The hypervisor and host OS sit outside the boundary — they can schedule, but they cannot read the encrypted memory or the device's state.

DMA buffers between CPU TEE and GPU TEE are encrypted with a key established at attestation time; the PCIe / NVLink fabric carries ciphertext only.

The threat model

Defends against a compromised hypervisor, a malicious cloud operator, and a snooping co-tenant. Does not defend against a physical attacker with arbitrary access to the package — HBM scraping is out of scope, as it always has been.

Critical for cloud LLM inference on sensitive data: medical, financial, defense. The customer's prompt and the model's KV-cache exist only in GPU TEE memory; even the cloud operator can't read them.

What the deployment looks like

CPU side: a guest VM running with Intel TDX or AMD SEV-SNP, attested by a remote verifier.
GPU side: Blackwell in confidential mode, attested via NVIDIA's Device Attestation Service (NVDS) or a customer-controlled verifier.
Joint attestation: the customer pulls a signed quote that names both the CPU TCB and the GPU firmware. They release the model decryption key only when both check out.
Performance overhead: historically 5–10% on Hopper for full confidential mode. Blackwell reduces this further by moving more of the encryption into fixed-function hardware on the PCIe path.

The pragmatic value

Most enterprise AI conversations now have a "could we host this on third-party infrastructure?" question, and the honest answer used to be "not really, because the hypervisor sees everything." TEE-IO turns that into "yes, with a defensible attestation chain." It's the feature that makes regulated industries actually buy GPU time.

11

Interactive: NVL72 Throughput Planner

Pick a model, a precision, a batch size, and a domain. The planner sizes weights against the domain's HBM, computes aggregate compute, estimates collective time and a rough prefill TPS. Numbers are first-order; treat the verdict as "is this remotely feasible," not as a benchmark.

Model size

Precision

Batch size

Domain

Reading the planner

The "fits" badge is the first question — if it's red, no amount of compute helps. The all-reduce figure is the worst-case full parameter sync; for sharded gradient updates it's typically 10–50× smaller. Prefill TPS is a roof, not a benchmark; real numbers depend heavily on attention kernels, KV layout, and (especially for MoE) routing efficiency. Use this to rule out infeasible plans, not to commit a procurement.