A low-level deep dive into NVIDIA's 2024 dual-die architecture. Two reticle-sized dies bonded by the 10 TB/s NV-HBI link, 5th-generation tensor cores with MX-FP4 and NVFP4 microscaling at ~9 PFLOPS dense, 192 GB of HBM3e at 8 TB/s, NVLink 5 at 1.8 TB/s per GPU, the GB200 superchip and NVL72 rack, plus the consumer RTX 50 series with GDDR7.
Announced at GTC March 2024, volume in late 2024. Blackwell is NVIDIA's first multi-die GPU at the high end: two reticle-limited compute dies bonded into one logical GPU. Names get confusing — "Blackwell" is the architecture, "B100/B200/B300" are dual-die datacenter GPUs (each carrying two GB100 compute dies, sm_100), and "GB202/GB203/GB205/GB206/GB207" are single-die consumer/workstation dies (sm_120) used in the RTX 50 series.
| SKU | Dies | HBM / GDDR | Bandwidth | FP4 dense (MX-FP4 / NVFP4) | TDP |
|---|---|---|---|---|---|
| B100 | 2× compute | 192 GB HBM3e | 8 TB/s | ~7 PFLOPS | 700 W |
| B200 | 2× compute | 192 GB HBM3e | 8 TB/s | 9 PFLOPS | 1000 W |
| GB200 (1 Grace + 2 B200) | 1 Grace + 4 compute dies | 384 GB HBM3e + 480 GB LPDDR5x | 16 TB/s + 480 GB/s | ~18 PFLOPS | 2700 W |
| RTX PRO 6000 Blackwell | 1 (GB202 monolithic) | 96 GB GDDR7 ECC | 1.79 TB/s | ~3 PFLOPS | 600 W |
| RTX 5090 | 1 (GB202) | 32 GB GDDR7 | 1.79 TB/s | ~3 PFLOPS | 575 W |
Each B100/B200 contains two GB100 compute dies, each ~800 mm² on TSMC 4NP, ~104 B transistors each — ~208 B transistors total. The dies are bonded via NV-HBI (NVIDIA High-Bandwidth Interface): a wide silicon-bridge link delivering 10 TB/s die-to-die, transparent to software (the OS sees one GPU).
TSMC's reticle limit is ~858 mm². GH100 was already at 814 mm². To deliver more compute per package, NVIDIA had to either accept the ceiling or split the design. Two ~800 mm² dies + NV-HBI gives ~1.9× the silicon area at substantially better yield than one impossible 1600 mm² die.
The two dies share a unified L2, a single CUDA address space, and one NVLink endpoint. Software sees one B200, not two B100s. Coherency across NV-HBI is hardware-managed; the SM scheduler load-balances across both dies.
Packaging: CoWoS-L with LSI (Local Silicon Interconnect) bridges. The bridges only sit between the two dies and between dies and HBM stacks — the rest of the substrate is cheap organic. CoWoS-L scales to larger areas than CoWoS-S at lower per-unit silicon cost.
Drop-in HGX H100 socket compatibility. Two compute dies; 192 GB HBM3e at 8 TB/s. Lower clocks than B200 to fit 700 W envelope. Allows existing HGX H100 buyers to upgrade in place. B300 / Blackwell Ultra ramps in 2025 with 288 GB HBM3e (12-high stacks) for larger models.
Full performance variant; same silicon, higher clocks, requires liquid-cooled HGX B200 baseboard. ~9 PFLOPS dense FP4 (MX-FP4 / NVFP4), 4.5 PFLOPS FP8, 2.25 PFLOPS BF16, 40 TFLOPS FP64. 192 GB HBM3e (8 stacks × 24 GB).
1× Grace ARM CPU + 2× B200 GPUs on a single board, connected by two NVLink-C2C links at 900 GB/s each. 384 GB HBM3e + 480 GB LPDDR5x = 864 GB unified memory. Liquid cooled. Building block of NVL72.
Hopper FP8 used per-tensor scaling — one scale per matrix. Blackwell's 5th-gen tensor cores push to per-microblock scaling, accelerating two FP4 variants: the open MX-FP4 standard and NVIDIA's higher-accuracy NVFP4.
| Format | Element bits | Block size | Per-block scale | Effective bits/element |
|---|---|---|---|---|
| MX-FP8 (E4M3) | 8 | 32 | E8M0 (8-bit) | 8.25 |
| MX-FP6 (E3M2) | 6 | 32 | E8M0 (8-bit) | 6.25 |
| MX-FP6 (E2M3) | 6 | 32 | E8M0 (8-bit) | 6.25 |
| MX-FP4 (E2M1) | 4 | 32 | E8M0 (8-bit, exponent-only) | 4.25 |
| NVFP4 (E2M1) | 4 | 16 | E4M3 (FP8) + per-tensor FP32 | ~4.5 |
| MX-INT8 | 8 | 32 | E8M0 (8-bit) | 8.25 |
Throughput on B200: dense FP4 (either format) is 2× FP8 ≈ 9 PFLOPS dense, ~18 PFLOPS sparse per package. MX-FP6 same throughput as FP8 (scales work for free at the tensor core).
MX-FP4 is the open OCP standard: 32-element blocks, an 8-bit exponent-only (E8M0, power-of-two) scale per block. NVFP4 is NVIDIA's variant: smaller 16-element blocks with an E4M3 FP8 per-block scale plus a per-tensor FP32 scale. The finer block size and richer scale type make NVFP4 noticeably more accurate than MX-FP4 in practice, and it is the default in TensorRT-LLM and Transformer Engine for Blackwell deployments. Tensor cores accelerate both at the same headline rate.
The MX format family is an Open Compute Project standard (OCP MX Specification) co-defined by NVIDIA, AMD, ARM, Intel, Meta, Microsoft, Qualcomm. AMD MI355 and Intel Gaudi 3 also implement subsets — MX-FP4 itself isn't NVIDIA-locked. NVFP4 is NVIDIA-specific.
Blackwell exposes a new SM-level tensor-core PTX op family tcgen05.mma for these formats — including FP4 paths — replacing/augmenting Hopper's wgmma.
The 2nd-gen TE handles per-microblock scaling automatically: track recent activation maxima at microblock granularity, choose scale per block on the next forward pass. Open-source library, PyTorch + JAX bindings, drop-in te.Linear replacement for torch.nn.Linear.
Key insight: at FP4, accuracy is sensitive to which block boundary aligns with which channel. The 2nd-gen TE inserts learnable rotations during fine-tune to reduce outlier sensitivity — a software trick that requires hardware support for arbitrary block-aligned tensor shapes, which Blackwell provides. TE supports both MX-FP4 (32-element E8M0 blocks) and NVFP4 (16-element E4M3 blocks + per-tensor FP32); NVFP4 is the default for Blackwell inference paths.
Real-world: Llama-3-405B and DeepSeek-V3 ship with FP4 quantised checkpoints (MX-FP4 and NVFP4) for Blackwell deployments. Accuracy loss versus BF16 reference is typically <1% on standard benchmarks, with NVFP4 generally closer to BF16 than MX-FP4.
Reliability / Availability / Serviceability hardware unit. Continuously monitors silicon for transient errors using AI-assisted predictive analysis — identifies failing components before they take a node offline; hot-swaps work to other parts of the GPU; reports diagnostics to DCGM. Critical at NVL72 scale where MTBF compounds across 72 GPUs.
Hardware accelerator for LZ4, Snappy, Deflate (gzip-compatible). 800 GB/s peak. Use cases: data-lake ingestion (Apache Parquet pages decompress inline), vector-search index loading, RAG chunk retrieval at near-bandwidth speeds. Frees CPU and DRAM from the decompression cycle.
Also new: TEE-IO (Trusted Execution Environment IO) — the host CPU's TEE (Intel TDX, AMD SEV) and the Blackwell GPU together form a single confidential boundary; encrypted DMA moves data without trusting the hypervisor.
B100 / B200 / GB200 compute dies on TSMC 4NP — a refinement of 4N with denser cells and slightly improved performance. Roughly 6% better perf/W than 4N at the same voltage.
| Component | Process | Trans | Area |
|---|---|---|---|
| GB100 compute die (each, ×2) | TSMC 4NP | ~104 B | ~800 mm² |
| B200 package total | TSMC 4NP + CoWoS-L | ~208 B | ~1600 mm² silicon |
| GB202 (RTX 5090, RTX PRO 6000) | TSMC 4NP monolithic | 92.2 B | 750 mm² |
| GB203 (RTX 5080) | TSMC 4NP monolithic | 45.6 B | 378 mm² |
| GB205 (RTX 5070) | TSMC 4NP monolithic | 31.1 B | 263 mm² |
Voltage rails on B200: VDD core (~0.85 V at boost), VDDQ-HBM3e (1.1 V), VDDIO-NVLink (~0.75 V), VDD-NV-HBI (~0.65 V on-package). Liquid cooling required at 1000 W — air cooling is no longer feasible at this density.
NVLink 5 doubles per-link bandwidth versus NVLink 4: 50 GB/s/dir per link. B200 has 18 links → 900 GB/s/dir = 1.8 TB/s bidirectional per GPU. Signalling: 200 Gbps PAM4 per lane, 2 lanes per link.
| Property | NVLink 4 (Hopper) | NVLink 5 (Blackwell) |
|---|---|---|
| Per-lane rate | 100 Gbps PAM4 | 200 Gbps PAM4 |
| Lanes per link | 2 | 2 |
| Per-link BW (1-dir) | 25 GB/s | 50 GB/s |
| Links per GPU | 18 | 18 |
| Aggregate per GPU (1-dir) | 900 GB/s | 1.8 TB/s |
| NVSwitch generation | NVSwitch 3 | NVSwitch 4 (with NVLink-Sharp) |
NVL72: 72× B200 GPUs (in 36× GB200 superchips, 36 Grace + 72 B200, on 18 compute trays) + 9 NVLink-Switch trays (NVSwitch 4). ~130 TB/s aggregate intra-rack NVLink bandwidth, ~13.8 TB unified HBM3e (72 × 192 GB), addressable as one logical GPU. Liquid cooled, ~120 kW per rack.
NVSwitch 4 adds NVLink-Sharp: in-network reductions for collective operations — the switch fabric itself performs an all-reduce sum, halving bandwidth needed. Critical at 72-GPU scale.
The RTX 50-series consumer cards use monolithic GB202/GB203/GB205/GB206/GB207 dies — no dual-die packaging on consumer. Same 5th-gen tensor cores (FP4 enabled this time, unlike Ada's gating), 4th-gen RT cores, GDDR7 memory. Compute capability is sm_120 (datacenter B100/B200/B300 are sm_100), so consumer-class kernels are a separate compile target.
| SKU | Die | SMs | Boost | Memory | BW | TDP |
|---|---|---|---|---|---|---|
| RTX 5090 | GB202 | 170 | 2.41 GHz | 32 GB GDDR7 28 Gbps | 1792 GB/s | 575 W |
| RTX PRO 6000 Blackwell | GB202 (full) | 192 | 2.5 GHz | 96 GB GDDR7 ECC 28 Gbps | 1792 GB/s | 600 W |
| RTX 5080 | GB203 | 84 | 2.62 GHz | 16 GB GDDR7 30 Gbps | 960 GB/s | 360 W |
| RTX 5070 Ti | GB203 | 70 | 2.45 GHz | 16 GB GDDR7 28 Gbps | 896 GB/s | 300 W |
| RTX 5070 | GB205 | 48 | 2.51 GHz | 12 GB GDDR7 28 Gbps | 672 GB/s | 250 W |
| RTX 5060 Ti | GB206 | 36 | 2.57 GHz | 16 GB GDDR7 28 Gbps | 448 GB/s | 180 W |
Notable: RTX 5090 retains FP4 hardware — consumer Blackwell exposes the new tensor-core formats (MX-FP4 and NVFP4), unlike Ada where consumer FP8 was fused off. Brings local LLM inference at FP4 to home setups for the first time.
Connector: 12V-2×6 (PCIe 5.x), single 600 W cable on RTX 5090 / RTX PRO 6000.