NVIDIA GPU Architectures Series — Presentation 32

Inside Blackwell — Dual-Die, MX-FP4 / NVFP4, NVLink 5

A low-level deep dive into NVIDIA's 2024 dual-die architecture. Two reticle-sized dies bonded by the 10 TB/s NV-HBI link, 5th-generation tensor cores with MX-FP4 and NVFP4 microscaling at ~9 PFLOPS dense, 192 GB of HBM3e at 8 TB/s, NVLink 5 at 1.8 TB/s per GPU, the GB200 superchip and NVL72 rack, plus the consumer RTX 50 series with GDDR7.

B100B200GB200 RTX 5090RTX PRO 6000 TSMC 4NPNV-HBICoWoS-L HBM3eGDDR7 MX-FP4NVFP4NVLink 5RAS
Die A NV-HBI Die B TC5 / MX-FP4 + NVFP4 HBM3e NVLink 5 NVL72
00

Topics We'll Cover

01

Blackwell at a Glance

Announced at GTC March 2024, volume in late 2024. Blackwell is NVIDIA's first multi-die GPU at the high end: two reticle-limited compute dies bonded into one logical GPU. Names get confusing — "Blackwell" is the architecture, "B100/B200/B300" are dual-die datacenter GPUs (each carrying two GB100 compute dies, sm_100), and "GB202/GB203/GB205/GB206/GB207" are single-die consumer/workstation dies (sm_120) used in the RTX 50 series.

SKUDiesHBM / GDDRBandwidthFP4 dense (MX-FP4 / NVFP4)TDP
B1002× compute192 GB HBM3e8 TB/s~7 PFLOPS700 W
B2002× compute192 GB HBM3e8 TB/s9 PFLOPS1000 W
GB200 (1 Grace + 2 B200)1 Grace + 4 compute dies384 GB HBM3e + 480 GB LPDDR5x16 TB/s + 480 GB/s~18 PFLOPS2700 W
RTX PRO 6000 Blackwell1 (GB202 monolithic)96 GB GDDR7 ECC1.79 TB/s~3 PFLOPS600 W
RTX 50901 (GB202)32 GB GDDR71.79 TB/s~3 PFLOPS575 W
02

The Dual-Die Package & NV-HBI

Each B100/B200 contains two GB100 compute dies, each ~800 mm² on TSMC 4NP, ~104 B transistors each — ~208 B transistors total. The dies are bonded via NV-HBI (NVIDIA High-Bandwidth Interface): a wide silicon-bridge link delivering 10 TB/s die-to-die, transparent to software (the OS sees one GPU).

Why dual die

TSMC's reticle limit is ~858 mm². GH100 was already at 814 mm². To deliver more compute per package, NVIDIA had to either accept the ceiling or split the design. Two ~800 mm² dies + NV-HBI gives ~1.9× the silicon area at substantially better yield than one impossible 1600 mm² die.

Why "one GPU"

The two dies share a unified L2, a single CUDA address space, and one NVLink endpoint. Software sees one B200, not two B100s. Coherency across NV-HBI is hardware-managed; the SM scheduler load-balances across both dies.

Packaging: CoWoS-L with LSI (Local Silicon Interconnect) bridges. The bridges only sit between the two dies and between dies and HBM stacks — the rest of the substrate is cheap organic. CoWoS-L scales to larger areas than CoWoS-S at lower per-unit silicon cost.

03

B100 / B200 / B300 / GB200 SKUs

B100 (700 W)

Drop-in HGX H100 socket compatibility. Two compute dies; 192 GB HBM3e at 8 TB/s. Lower clocks than B200 to fit 700 W envelope. Allows existing HGX H100 buyers to upgrade in place. B300 / Blackwell Ultra ramps in 2025 with 288 GB HBM3e (12-high stacks) for larger models.

B200 (1000 W SXM6)

Full performance variant; same silicon, higher clocks, requires liquid-cooled HGX B200 baseboard. ~9 PFLOPS dense FP4 (MX-FP4 / NVFP4), 4.5 PFLOPS FP8, 2.25 PFLOPS BF16, 40 TFLOPS FP64. 192 GB HBM3e (8 stacks × 24 GB).

GB200 superchip (2700 W)

1× Grace ARM CPU + 2× B200 GPUs on a single board, connected by two NVLink-C2C links at 900 GB/s each. 384 GB HBM3e + 480 GB LPDDR5x = 864 GB unified memory. Liquid cooled. Building block of NVL72.

04

The Blackwell SM — 5th-Gen Tensor Cores

Per-SM compute

  • 4 partitions, 1 warp scheduler each
  • 128 FP32 cores, 64 INT32, 64 FP64 (1:2 of FP32 maintained)
  • 4 tensor cores (5th gen) per SM
  • 16 LD/ST, 16 SFU
  • TMA unit, RAS unit
  • ~256 KB L1 + shared (configurable)
  • ~256 KB register file
  • Cluster-aware (inherited from Hopper)

SM count

  • ~144 SMs per compute die
  • 2 dies per package → ~288 SMs total
  • B200 enables ~256 (binning for yield)
  • B100 enables fewer at lower clocks
  • GPC structure: 8 GPCs × 9 TPCs per die
05

FP4 Microscaling Formats — MX-FP4 & NVFP4

Hopper FP8 used per-tensor scaling — one scale per matrix. Blackwell's 5th-gen tensor cores push to per-microblock scaling, accelerating two FP4 variants: the open MX-FP4 standard and NVIDIA's higher-accuracy NVFP4.

FormatElement bitsBlock sizePer-block scaleEffective bits/element
MX-FP8 (E4M3)832E8M0 (8-bit)8.25
MX-FP6 (E3M2)632E8M0 (8-bit)6.25
MX-FP6 (E2M3)632E8M0 (8-bit)6.25
MX-FP4 (E2M1)432E8M0 (8-bit, exponent-only)4.25
NVFP4 (E2M1)416E4M3 (FP8) + per-tensor FP32~4.5
MX-INT8832E8M0 (8-bit)8.25

Throughput on B200: dense FP4 (either format) is 2× FP8 ≈ 9 PFLOPS dense, ~18 PFLOPS sparse per package. MX-FP6 same throughput as FP8 (scales work for free at the tensor core).

MX-FP4 vs NVFP4

MX-FP4 is the open OCP standard: 32-element blocks, an 8-bit exponent-only (E8M0, power-of-two) scale per block. NVFP4 is NVIDIA's variant: smaller 16-element blocks with an E4M3 FP8 per-block scale plus a per-tensor FP32 scale. The finer block size and richer scale type make NVFP4 noticeably more accurate than MX-FP4 in practice, and it is the default in TensorRT-LLM and Transformer Engine for Blackwell deployments. Tensor cores accelerate both at the same headline rate.

OCP standard

The MX format family is an Open Compute Project standard (OCP MX Specification) co-defined by NVIDIA, AMD, ARM, Intel, Meta, Microsoft, Qualcomm. AMD MI355 and Intel Gaudi 3 also implement subsets — MX-FP4 itself isn't NVIDIA-locked. NVFP4 is NVIDIA-specific.

PTX

Blackwell exposes a new SM-level tensor-core PTX op family tcgen05.mma for these formats — including FP4 paths — replacing/augmenting Hopper's wgmma.

06

2nd-Generation Transformer Engine

The 2nd-gen TE handles per-microblock scaling automatically: track recent activation maxima at microblock granularity, choose scale per block on the next forward pass. Open-source library, PyTorch + JAX bindings, drop-in te.Linear replacement for torch.nn.Linear.

Key insight: at FP4, accuracy is sensitive to which block boundary aligns with which channel. The 2nd-gen TE inserts learnable rotations during fine-tune to reduce outlier sensitivity — a software trick that requires hardware support for arbitrary block-aligned tensor shapes, which Blackwell provides. TE supports both MX-FP4 (32-element E8M0 blocks) and NVFP4 (16-element E4M3 blocks + per-tensor FP32); NVFP4 is the default for Blackwell inference paths.

Real-world: Llama-3-405B and DeepSeek-V3 ship with FP4 quantised checkpoints (MX-FP4 and NVFP4) for Blackwell deployments. Accuracy loss versus BF16 reference is typically <1% on standard benchmarks, with NVFP4 generally closer to BF16 than MX-FP4.

07

RAS & Decompression Engines

RAS Engine

Reliability / Availability / Serviceability hardware unit. Continuously monitors silicon for transient errors using AI-assisted predictive analysis — identifies failing components before they take a node offline; hot-swaps work to other parts of the GPU; reports diagnostics to DCGM. Critical at NVL72 scale where MTBF compounds across 72 GPUs.

Decompression Engine

Hardware accelerator for LZ4, Snappy, Deflate (gzip-compatible). 800 GB/s peak. Use cases: data-lake ingestion (Apache Parquet pages decompress inline), vector-search index loading, RAG chunk retrieval at near-bandwidth speeds. Frees CPU and DRAM from the decompression cycle.

Also new: TEE-IO (Trusted Execution Environment IO) — the host CPU's TEE (Intel TDX, AMD SEV) and the Blackwell GPU together form a single confidential boundary; encrypted DMA moves data without trusting the hypervisor.

08

Process & Voltage

B100 / B200 / GB200 compute dies on TSMC 4NP — a refinement of 4N with denser cells and slightly improved performance. Roughly 6% better perf/W than 4N at the same voltage.

ComponentProcessTransArea
GB100 compute die (each, ×2)TSMC 4NP~104 B~800 mm²
B200 package totalTSMC 4NP + CoWoS-L~208 B~1600 mm² silicon
GB202 (RTX 5090, RTX PRO 6000)TSMC 4NP monolithic92.2 B750 mm²
GB203 (RTX 5080)TSMC 4NP monolithic45.6 B378 mm²
GB205 (RTX 5070)TSMC 4NP monolithic31.1 B263 mm²

Voltage rails on B200: VDD core (~0.85 V at boost), VDDQ-HBM3e (1.1 V), VDDIO-NVLink (~0.75 V), VDD-NV-HBI (~0.65 V on-package). Liquid cooling required at 1000 W — air cooling is no longer feasible at this density.

09

Memory — HBM3e at 8 TB/s, GDDR7 on Consumer

Datacenter HBM3e

  • 8 stacks of HBM3e (4 per compute die)
  • Each stack: 8-Hi or 12-Hi, 24 GB
  • Total: 192 GB on B200
  • Pin rate ~9.6 Gbps PAM4
  • Per-stack BW ~1 TB/s → aggregate 8 TB/s
  • Bus width: 8192 bits total (1024 per stack)
  • ECC SECDED + on-die ECC mandatory

Consumer GDDR7

  • RTX 5090: 32 GB GDDR7 at 28 Gbps PAM3 on 512-bit bus → 1.79 TB/s
  • RTX PRO 6000: 96 GB GDDR7 ECC on 512-bit bus → 1.79 TB/s
  • RTX 5080: 16 GB GDDR7 at 30 Gbps on 256-bit → 960 GB/s
  • PAM3 signalling: 3 levels per symbol, halfway between NRZ and PAM4 power/eye trade-off
  • ~1.7× bandwidth-per-pin vs GDDR6X
10

NVLink 5 & NVSwitch 4 (NVL72)

NVLink 5 doubles per-link bandwidth versus NVLink 4: 50 GB/s/dir per link. B200 has 18 links → 900 GB/s/dir = 1.8 TB/s bidirectional per GPU. Signalling: 200 Gbps PAM4 per lane, 2 lanes per link.

PropertyNVLink 4 (Hopper)NVLink 5 (Blackwell)
Per-lane rate100 Gbps PAM4200 Gbps PAM4
Lanes per link22
Per-link BW (1-dir)25 GB/s50 GB/s
Links per GPU1818
Aggregate per GPU (1-dir)900 GB/s1.8 TB/s
NVSwitch generationNVSwitch 3NVSwitch 4 (with NVLink-Sharp)

NVL72: 72× B200 GPUs (in 36× GB200 superchips, 36 Grace + 72 B200, on 18 compute trays) + 9 NVLink-Switch trays (NVSwitch 4). ~130 TB/s aggregate intra-rack NVLink bandwidth, ~13.8 TB unified HBM3e (72 × 192 GB), addressable as one logical GPU. Liquid cooled, ~120 kW per rack.

NVSwitch 4 adds NVLink-Sharp: in-network reductions for collective operations — the switch fabric itself performs an all-reduce sum, halving bandwidth needed. Critical at 72-GPU scale.

11

Consumer Blackwell — RTX 50 Series

The RTX 50-series consumer cards use monolithic GB202/GB203/GB205/GB206/GB207 dies — no dual-die packaging on consumer. Same 5th-gen tensor cores (FP4 enabled this time, unlike Ada's gating), 4th-gen RT cores, GDDR7 memory. Compute capability is sm_120 (datacenter B100/B200/B300 are sm_100), so consumer-class kernels are a separate compile target.

SKUDieSMsBoostMemoryBWTDP
RTX 5090GB2021702.41 GHz32 GB GDDR7 28 Gbps1792 GB/s575 W
RTX PRO 6000 BlackwellGB202 (full)1922.5 GHz96 GB GDDR7 ECC 28 Gbps1792 GB/s600 W
RTX 5080GB203842.62 GHz16 GB GDDR7 30 Gbps960 GB/s360 W
RTX 5070 TiGB203702.45 GHz16 GB GDDR7 28 Gbps896 GB/s300 W
RTX 5070GB205482.51 GHz12 GB GDDR7 28 Gbps672 GB/s250 W
RTX 5060 TiGB206362.57 GHz16 GB GDDR7 28 Gbps448 GB/s180 W

Notable: RTX 5090 retains FP4 hardware — consumer Blackwell exposes the new tensor-core formats (MX-FP4 and NVFP4), unlike Ada where consumer FP8 was fused off. Brings local LLM inference at FP4 to home setups for the first time.

Connector: 12V-2×6 (PCIe 5.x), single 600 W cable on RTX 5090 / RTX PRO 6000.

12

Interactive: Blackwell SKU Picker

Die config
FP4 dense
FP8 dense
BF16 dense
Memory
BW (TB/s)
TDP
NVLink