NVIDIA GPU 32 — Blackwell Low-Level HW (MX-FP4 / NVFP4)

00

Topics We'll Cover

Blackwell at a Glance
The Dual-Die Package & NV-HBI
B200 / B100 / GB200 SKUs
The Blackwell SM — 5th-Gen Tensor Cores
FP4 Microscaling Formats — MX-FP4 & NVFP4
2nd-Generation Transformer Engine
RAS & Decompression Engines
Process & Voltage
Memory — HBM3e at 8 TB/s, GDDR7 on Consumer
NVLink 5 & NVSwitch 4 (NVL72)
Consumer Blackwell — RTX 50 Series
Interactive: Blackwell SKU Picker

01

Blackwell at a Glance

Announced at GTC March 2024, volume in late 2024. Blackwell is NVIDIA's first multi-die GPU at the high end: two reticle-limited compute dies bonded into one logical GPU. Names get confusing — "Blackwell" is the architecture, "B100/B200/B300" are dual-die datacenter GPUs (each carrying two GB100 compute dies, sm_100), and "GB202/GB203/GB205/GB206/GB207" are single-die consumer/workstation dies (sm_120) used in the RTX 50 series.

SKU	Dies	HBM / GDDR	Bandwidth	FP4 dense (MX-FP4 / NVFP4)	TDP
B100	2× compute	192 GB HBM3e	8 TB/s	~7 PFLOPS	700 W
B200	2× compute	192 GB HBM3e	8 TB/s	9 PFLOPS	1000 W
GB200 (1 Grace + 2 B200)	1 Grace + 4 compute dies	384 GB HBM3e + 480 GB LPDDR5x	16 TB/s + 480 GB/s	~18 PFLOPS	2700 W
RTX PRO 6000 Blackwell	1 (GB202 monolithic)	96 GB GDDR7 ECC	1.79 TB/s	~3 PFLOPS	600 W
RTX 5090	1 (GB202)	32 GB GDDR7	1.79 TB/s	~3 PFLOPS	575 W

02

The Dual-Die Package & NV-HBI

Each B100/B200 contains two GB100 compute dies, each ~800 mm² on TSMC 4NP, ~104 B transistors each — ~208 B transistors total. The dies are bonded via NV-HBI (NVIDIA High-Bandwidth Interface): a wide silicon-bridge link delivering 10 TB/s die-to-die, transparent to software (the OS sees one GPU).

Why dual die

TSMC's reticle limit is ~858 mm². GH100 was already at 814 mm². To deliver more compute per package, NVIDIA had to either accept the ceiling or split the design. Two ~800 mm² dies + NV-HBI gives ~1.9× the silicon area at substantially better yield than one impossible 1600 mm² die.

Why "one GPU"

The two dies share a unified L2, a single CUDA address space, and one NVLink endpoint. Software sees one B200, not two B100s. Coherency across NV-HBI is hardware-managed; the SM scheduler load-balances across both dies.

Packaging: CoWoS-L with LSI (Local Silicon Interconnect) bridges. The bridges only sit between the two dies and between dies and HBM stacks — the rest of the substrate is cheap organic. CoWoS-L scales to larger areas than CoWoS-S at lower per-unit silicon cost.

03

B100 / B200 / B300 / GB200 SKUs

B100 (700 W)

Drop-in HGX H100 socket compatibility. Two compute dies; 192 GB HBM3e at 8 TB/s. Lower clocks than B200 to fit 700 W envelope. Allows existing HGX H100 buyers to upgrade in place. B300 / Blackwell Ultra ramps in 2025 with 288 GB HBM3e (12-high stacks) for larger models.

B200 (1000 W SXM6)

Full performance variant; same silicon, higher clocks, requires liquid-cooled HGX B200 baseboard. ~9 PFLOPS dense FP4 (MX-FP4 / NVFP4), 4.5 PFLOPS FP8, 2.25 PFLOPS BF16, 40 TFLOPS FP64. 192 GB HBM3e (8 stacks × 24 GB).

GB200 superchip (2700 W)

1× Grace ARM CPU + 2× B200 GPUs on a single board, connected by two NVLink-C2C links at 900 GB/s each. 384 GB HBM3e + 480 GB LPDDR5x = 864 GB unified memory. Liquid cooled. Building block of NVL72.

04

The Blackwell SM — 5th-Gen Tensor Cores

Per-SM compute

4 partitions, 1 warp scheduler each
128 FP32 cores, 64 INT32, 64 FP64 (1:2 of FP32 maintained)
4 tensor cores (5th gen) per SM
16 LD/ST, 16 SFU
TMA unit, RAS unit
~256 KB L1 + shared (configurable)
~256 KB register file
Cluster-aware (inherited from Hopper)

SM count

~144 SMs per compute die
2 dies per package → ~288 SMs total
B200 enables ~256 (binning for yield)
B100 enables fewer at lower clocks
GPC structure: 8 GPCs × 9 TPCs per die

05

FP4 Microscaling Formats — MX-FP4 & NVFP4

Hopper FP8 used per-tensor scaling — one scale per matrix. Blackwell's 5th-gen tensor cores push to per-microblock scaling, accelerating two FP4 variants: the open MX-FP4 standard and NVIDIA's higher-accuracy NVFP4.

Format	Element bits	Block size	Per-block scale	Effective bits/element
MX-FP8 (E4M3)	8	32	E8M0 (8-bit)	8.25
MX-FP6 (E3M2)	6	32	E8M0 (8-bit)	6.25
MX-FP6 (E2M3)	6	32	E8M0 (8-bit)	6.25
MX-FP4 (E2M1)	4	32	E8M0 (8-bit, exponent-only)	4.25
NVFP4 (E2M1)	4	16	E4M3 (FP8) + per-tensor FP32	~4.5
MX-INT8	8	32	E8M0 (8-bit)	8.25

Throughput on B200: dense FP4 (either format) is 2× FP8 ≈ 9 PFLOPS dense, ~18 PFLOPS sparse per package. MX-FP6 same throughput as FP8 (scales work for free at the tensor core).

MX-FP4 vs NVFP4

MX-FP4 is the open OCP standard: 32-element blocks, an 8-bit exponent-only (E8M0, power-of-two) scale per block. NVFP4 is NVIDIA's variant: smaller 16-element blocks with an E4M3 FP8 per-block scale plus a per-tensor FP32 scale. The finer block size and richer scale type make NVFP4 noticeably more accurate than MX-FP4 in practice, and it is the default in TensorRT-LLM and Transformer Engine for Blackwell deployments. Tensor cores accelerate both at the same headline rate.

OCP standard

The MX format family is an Open Compute Project standard (OCP MX Specification) co-defined by NVIDIA, AMD, ARM, Intel, Meta, Microsoft, Qualcomm. AMD MI355 and Intel Gaudi 3 also implement subsets — MX-FP4 itself isn't NVIDIA-locked. NVFP4 is NVIDIA-specific.

PTX

Blackwell exposes a new SM-level tensor-core PTX op family tcgen05.mma for these formats — including FP4 paths — replacing/augmenting Hopper's wgmma.

06

2nd-Generation Transformer Engine

The 2nd-gen TE handles per-microblock scaling automatically: track recent activation maxima at microblock granularity, choose scale per block on the next forward pass. Open-source library, PyTorch + JAX bindings, drop-in te.Linear replacement for torch.nn.Linear.

Key insight: at FP4, accuracy is sensitive to which block boundary aligns with which channel. The 2nd-gen TE inserts learnable rotations during fine-tune to reduce outlier sensitivity — a software trick that requires hardware support for arbitrary block-aligned tensor shapes, which Blackwell provides. TE supports both MX-FP4 (32-element E8M0 blocks) and NVFP4 (16-element E4M3 blocks + per-tensor FP32); NVFP4 is the default for Blackwell inference paths.

Real-world: Llama-3-405B and DeepSeek-V3 ship with FP4 quantised checkpoints (MX-FP4 and NVFP4) for Blackwell deployments. Accuracy loss versus BF16 reference is typically <1% on standard benchmarks, with NVFP4 generally closer to BF16 than MX-FP4.

07

RAS & Decompression Engines

RAS Engine

Reliability / Availability / Serviceability hardware unit. Continuously monitors silicon for transient errors using AI-assisted predictive analysis — identifies failing components before they take a node offline; hot-swaps work to other parts of the GPU; reports diagnostics to DCGM. Critical at NVL72 scale where MTBF compounds across 72 GPUs.

Decompression Engine

Hardware accelerator for LZ4, Snappy, Deflate (gzip-compatible). 800 GB/s peak. Use cases: data-lake ingestion (Apache Parquet pages decompress inline), vector-search index loading, RAG chunk retrieval at near-bandwidth speeds. Frees CPU and DRAM from the decompression cycle.

Also new: TEE-IO (Trusted Execution Environment IO) — the host CPU's TEE (Intel TDX, AMD SEV) and the Blackwell GPU together form a single confidential boundary; encrypted DMA moves data without trusting the hypervisor.

08

Process & Voltage

B100 / B200 / GB200 compute dies on TSMC 4NP — a refinement of 4N with denser cells and slightly improved performance. Roughly 6% better perf/W than 4N at the same voltage.

Component	Process	Trans	Area
GB100 compute die (each, ×2)	TSMC 4NP	~104 B	~800 mm²
B200 package total	TSMC 4NP + CoWoS-L	~208 B	~1600 mm² silicon
GB202 (RTX 5090, RTX PRO 6000)	TSMC 4NP monolithic	92.2 B	750 mm²
GB203 (RTX 5080)	TSMC 4NP monolithic	45.6 B	378 mm²
GB205 (RTX 5070)	TSMC 4NP monolithic	31.1 B	263 mm²

Voltage rails on B200: VDD core (~0.85 V at boost), VDDQ-HBM3e (1.1 V), VDDIO-NVLink (~0.75 V), VDD-NV-HBI (~0.65 V on-package). Liquid cooling required at 1000 W — air cooling is no longer feasible at this density.

09

Memory — HBM3e at 8 TB/s, GDDR7 on Consumer

Datacenter HBM3e

8 stacks of HBM3e (4 per compute die)
Each stack: 8-Hi or 12-Hi, 24 GB
Total: 192 GB on B200
Pin rate ~9.6 Gbps PAM4
Per-stack BW ~1 TB/s → aggregate 8 TB/s
Bus width: 8192 bits total (1024 per stack)
ECC SECDED + on-die ECC mandatory

Consumer GDDR7

RTX 5090: 32 GB GDDR7 at 28 Gbps PAM3 on 512-bit bus → 1.79 TB/s
RTX PRO 6000: 96 GB GDDR7 ECC on 512-bit bus → 1.79 TB/s
RTX 5080: 16 GB GDDR7 at 30 Gbps on 256-bit → 960 GB/s
PAM3 signalling: 3 levels per symbol, halfway between NRZ and PAM4 power/eye trade-off
~1.7× bandwidth-per-pin vs GDDR6X

10

NVLink 5 & NVSwitch 4 (NVL72)

NVLink 5 doubles per-link bandwidth versus NVLink 4: 50 GB/s/dir per link. B200 has 18 links → 900 GB/s/dir = 1.8 TB/s bidirectional per GPU. Signalling: 200 Gbps PAM4 per lane, 2 lanes per link.

Property	NVLink 4 (Hopper)	NVLink 5 (Blackwell)
Per-lane rate	100 Gbps PAM4	200 Gbps PAM4
Lanes per link	2	2
Per-link BW (1-dir)	25 GB/s	50 GB/s
Links per GPU	18	18
Aggregate per GPU (1-dir)	900 GB/s	1.8 TB/s
NVSwitch generation	NVSwitch 3	NVSwitch 4 (with NVLink-Sharp)

NVL72: 72× B200 GPUs (in 36× GB200 superchips, 36 Grace + 72 B200, on 18 compute trays) + 9 NVLink-Switch trays (NVSwitch 4). ~130 TB/s aggregate intra-rack NVLink bandwidth, ~13.8 TB unified HBM3e (72 × 192 GB), addressable as one logical GPU. Liquid cooled, ~120 kW per rack.

NVSwitch 4 adds NVLink-Sharp: in-network reductions for collective operations — the switch fabric itself performs an all-reduce sum, halving bandwidth needed. Critical at 72-GPU scale.

11

Consumer Blackwell — RTX 50 Series

The RTX 50-series consumer cards use monolithic GB202/GB203/GB205/GB206/GB207 dies — no dual-die packaging on consumer. Same 5th-gen tensor cores (FP4 enabled this time, unlike Ada's gating), 4th-gen RT cores, GDDR7 memory. Compute capability is sm_120 (datacenter B100/B200/B300 are sm_100), so consumer-class kernels are a separate compile target.

SKU	Die	SMs	Boost	Memory	BW	TDP
RTX 5090	GB202	170	2.41 GHz	32 GB GDDR7 28 Gbps	1792 GB/s	575 W
RTX PRO 6000 Blackwell	GB202 (full)	192	2.5 GHz	96 GB GDDR7 ECC 28 Gbps	1792 GB/s	600 W
RTX 5080	GB203	84	2.62 GHz	16 GB GDDR7 30 Gbps	960 GB/s	360 W
RTX 5070 Ti	GB203	70	2.45 GHz	16 GB GDDR7 28 Gbps	896 GB/s	300 W
RTX 5070	GB205	48	2.51 GHz	12 GB GDDR7 28 Gbps	672 GB/s	250 W
RTX 5060 Ti	GB206	36	2.57 GHz	16 GB GDDR7 28 Gbps	448 GB/s	180 W

Notable: RTX 5090 retains FP4 hardware — consumer Blackwell exposes the new tensor-core formats (MX-FP4 and NVFP4), unlike Ada where consumer FP8 was fused off. Brings local LLM inference at FP4 to home setups for the first time.

Connector: 12V-2×6 (PCIe 5.x), single 600 W cable on RTX 5090 / RTX PRO 6000.

12

Interactive: Blackwell SKU Picker

Blackwell SKU

Die config

—

FP4 dense

—

FP8 dense

—

BF16 dense

—

Memory

—

BW (TB/s)

—

TDP

—

NVLink

—