Two reticle-limited dies bonded into one logical GPU. 5th-generation tensor cores with MX-FP4 and NVFP4 microscaling. NVLink 5 at 1.8 TB/s per GPU and NVL72 racks that act as one giant 13.8 TB GPU. The architecture purpose-built for trillion-parameter MoE models.
Blackwell is the first NVIDIA architecture where the GPU is no longer a single die. This deck walks the package, the maths, the rack, and ends with a planner you can drive yourself.
Announced at GTC March 2024, Blackwell is the first NVIDIA datacenter GPU built from two dies bonded into one logical device. It is the architecture that finally walks past the lithographic reticle limit, and it does so while doubling tensor throughput again with native FP4.
A single die can't exceed ~858 mm² (the EUV reticle limit). Hopper's GH100 was already hard against it at ~814 mm². To grow further, NVIDIA bonded two reticle-sized dies (~800 mm² each) across NV-HBI, the in-package fabric.
MX-FP4 (E2M1 + per-32-element scale) doubles inference throughput vs FP8 with minimal accuracy loss on big LLMs. The first wave of MoE models (DeepSeek-V3, Mixtral) shipped MX-FP4 quantised weights for Blackwell day-one.
1.8 TB/s per GPU on NVLink 5 enables 72-GPU coherent NVLink domains — the NVL72 rack acts as one logical GPU with 13.8 TB of HBM3e and 130 TB/s of fabric. Trillion-parameter MoE training without InfiniBand-shaped collective bottlenecks.
| Property | Blackwell (B200) | Hopper (H100 SXM) | Multiplier |
|---|---|---|---|
| Die count | 2 (NV-HBI bonded) | 1 | 2× |
| Process | TSMC 4NP custom | TSMC 4N | refined |
| Transistors | ~208 billion | ~80 billion | 2.6× |
| HBM capacity | 192 GB HBM3e | 80 GB HBM3 | 2.4× |
| HBM bandwidth | 8 TB/s | 3.35 TB/s | 2.4× |
| FP8 dense | 4.5 PFLOPS | 2.0 PFLOPS | 2.25× |
| FP4 dense | 9 PFLOPS | — | new |
| NVLink per GPU | 1.8 TB/s (gen 5) | 0.9 TB/s (gen 4) | 2× |
| TDP | 1000 W | 700 W | 1.43× |
Volume shipments through 2024 and into 2025. The first deployments were hyperscaler training clusters (Meta, Microsoft, OpenAI) on GB200 NVL72 racks, with HGX B200 8-way baseboards arriving for enterprise shortly after. The B100, a 700 W drop-in for the H100 SXM socket, was the upgrade path for existing HGX H100 customers.
Blackwell is the first GPU where you cannot draw the chip without showing the package. Two reticle-sized dies sit on a CoWoS-L interposer, surrounded by eight HBM3e stacks, bonded edge-to-edge across the NV-HBI link.
From the CUDA programmer's perspective, B200 is one GPU. cudaGetDeviceCount() returns 1 per package, the SM count is reported as the union (B200 = ~160 SMs), and NVLink targets the package, not individual dies. The dual-die nature only shows up in profiler traces of NV-HBI traffic and in occasional SM-affinity heuristics inside cuBLAS / Transformer Engine.
Blackwell launched as a family, not a single product. The matrix below is the one you actually need when speccing a system.
| SKU | TDP | HBM | BW | FP4 sparse | Notes |
|---|---|---|---|---|---|
| B100 | 700 W | 192 GB HBM3e | 8 TB/s | 14 PFLOPS | Drop-in HGX H100 socket compatibility. Same 700 W envelope as H100 SXM. Throttled FP4/FP8 vs B200. |
| B200 | 1000 W | 192 GB HBM3e | 8 TB/s | 18 PFLOPS | Full performance Blackwell. New HGX B200 baseboard (1 kW per socket). Liquid- or air-cooled. |
| B200 NVL | 1000 W | 192 GB HBM3e | 8 TB/s | 18 PFLOPS | Inference-tuned variant for NVL platforms; same silicon, tuned firmware/clocks for token-generation workloads. |
| GB200 | ~2700 W (module) | 2× 192 GB + 480 GB LPDDR5x | 2× 8 TB/s + ~0.5 TB/s | 36 PFLOPS | Superchip: 1× Grace ARM (72 cores, 480 GB LPDDR5x) + 2× B200 over NVLink-C2C 900 GB/s coherent. |
| HGX B200 | ~8 kW board | 8× 192 GB = 1.5 TB | 64 TB/s aggregate | 144 PFLOPS | 8-GPU baseboard with NVSwitch 4. The conventional drop-in for AI servers replacing HGX H100/H200. |
| DGX B200 | ~14 kW system | 1.5 TB HBM3e + 4 TB LPDDR5 | 64 TB/s | 144 PFLOPS | Full NVIDIA-built system: HGX B200 + 2× Intel Xeon + 8× ConnectX-7 + 2× BlueField-3. |
Use B100. 700 W matches your existing power and thermal envelope; you keep the baseboard and NVSwitch, just swap GPUs. ~2× the throughput at the same wattage. The pragmatic upgrade path.
Go straight to GB200 NVL72. Unified Grace memory + 72-GPU coherent domain + liquid cooling. Designed for ≥ 1T-parameter MoE training. ~120 kW per rack but the perf/$ is unmatched.
HGX B200. Air- or liquid-cooled; fits in a standard 6U-8U chassis; OEM systems from Supermicro, Dell, HPE. The post-Hopper successor to HGX H100/H200 in the conventional rack.
Blackwell's 5th-generation tensor cores natively support the OCP Microscaling (MX) formats: small element widths paired with a shared per-block exponent. The result: FP4 throughput at FP8-class accuracy for inference.
4-bit element: 1 sign, 2 exponent, 1 mantissa.
Per-32-element scale: 8-bit power-of-two (E8M0).
Effective ~4.25 bits/element after amortising the scale.
~2× throughput over FP8 on the same silicon. The headline format for Blackwell inference.
6-bit elements with the same 8-bit shared scale per 32 elements.
Two layouts: E3M2 (more range) or E2M3 (more precision).
Intermediate accuracy & throughput — useful where FP4 starts losing too much accuracy on activations or sensitive layers.
The Hopper formats, still supported natively. E4M3 for forward/weights, E5M2 for backward/gradients. Per-tensor scaling (legacy) or per-block via MX (new on Blackwell). The compatibility layer.
Sparse 2:4 doubles all of these. Hardware enforces a structured sparsity pattern: in every group of four contiguous weights, two must be zero. The tensor core skips the zeros; you get the second multiplier for free if you can quantise to that pattern at training time.
Hopper's FP8 used a per-tensor scale — one scale per (potentially gigabyte-sized) weight matrix. Outliers in any block forced the whole tensor's scale up, killing precision in the well-behaved blocks. MX shrinks the scale's scope to 32 elements, so each microblock gets its own dynamic range. That is the whole reason FP4 works at all on real models.
Blackwell's 5th-gen tensor cores support two FP4 variants. MX-FP4 is the open OCP standard: 32-element blocks with an 8-bit power-of-two (E8M0) per-block scale. NVFP4 is NVIDIA's variant: 16-element blocks with an FP8 (E4M3) per-block scale plus a per-tensor FP32 scale. NVFP4 is the default in TensorRT-LLM and Transformer Engine and is more accurate than MX-FP4 in practice; MX-FP4 wins on portability. Deep dive in deck 38.
The Transformer Engine (TE) is NVIDIA's open-source Python library that wraps cuBLAS / cuDNN with automatic mixed-precision dispatch tuned for transformer workloads. Blackwell ships its second generation.
import torch
import transformer_engine.pytorch as te
from transformer_engine.common.recipe import Format, DelayedScaling, MXFP8BlockScaling
# Hopper-style: per-tensor FP8
fp8_recipe = DelayedScaling(margin=0, fp8_format=Format.HYBRID)
# Blackwell-style: per-microblock MX-FP4
mx_recipe = MXFP8BlockScaling(fp8_format=Format.E4M3, block_size=32)
with te.fp8_autocast(enabled=True, fp8_recipe=mx_recipe):
out = model(input_ids)
Integrated with PyTorch and JAX, picked up automatically by Megatron-LM, NeMo, and (downstream) by Hugging Face Transformers when the kernel backends are present.
The first wave of MoE models — Mistral Mixtral, DeepSeek-V3, and the inference-only checkpoints from several frontier labs — ship MX-FP4 quantised weights targeted at Blackwell. On Hopper they fall back to FP8 emulation through MX-aware unpacking, which loses the throughput advantage but preserves the file format.
At rack scale the law of large numbers turns rare hardware faults into routine events. Blackwell adds a dedicated RAS engine — a small in-package microcontroller that watches the GPU's data paths continuously.
The mean time between failures (MTBF) of any one GPU is excellent, but in a 72-GPU coherent NVLink domain the aggregate MTBF compounds. If a single GPU fails per week, an NVL72 rack faces an event roughly every 2–3 hours; without graceful migration, every event ends a multi-day training run.
On a 16-rack (1152-GPU) Blackwell training pod, RAS-driven hot migration claws back the bulk of what would otherwise be lost wall-clock. NVIDIA's published target is > 90% effective utilisation on multi-week pre-training runs — up from roughly 75% on H100 clusters of similar scale, where every hardware fault forced a checkpoint-and-restart.
Blackwell ships a fixed-function decompression accelerator alongside the SMs. It speaks LZ4, Snappy, and Deflate (gzip-compatible) at 800 GB/s peak, putting it in the same bandwidth class as the HBM controller.
It is not a general-purpose compute accelerator and not a network engine. It will not help you decompress model weights at load time (those load once; the bandwidth-savings would be one-shot). It is aimed squarely at analytics and search workloads where the same compressed data is read many times.
Some early benchmarks on Parquet TPC-H scans showed 3–5× end-to-end speedups over CPU-decompressed pipelines, not because GPU decompression is faster than CPU decompression in isolation, but because it removes the PCIe DMA of raw uncompressed bytes — the bus, not the algorithm, was the bottleneck.
NVLink 5 doubles the per-GPU bandwidth of NVLink 4 and, in combination with NVSwitch 4, extends the maximum coherent NVLink domain to 72 GPUs.
| Property | NVLink 4 (Hopper) | NVLink 5 (Blackwell) | Multiplier |
|---|---|---|---|
| Per-link bandwidth | 50 GB/s/dir | 100 GB/s/dir | 2× |
| Links per GPU | 18 | 18 | same |
| Per-GPU aggregate | 0.9 TB/s | 1.8 TB/s | 2× |
| Switch ports per chip | 64 | 144 | 2.25× |
| Switch aggregate | 3.2 TB/s | 1.8 TB/s/port | per-port doubled |
| Max coherent domain | 8 (HGX) / 256 (DGX SuperPOD via NVLink Switch System) | 72 (NVL72) over a single NVSwitch 4 fabric | 9× in one rack |
A Llama-3.1 405B model with FP8 weights is ~405 GB. An all-reduce across the parameter set during a training step touches every byte twice (reduce + broadcast).
It is not the FLOPS — HGX B200 has plenty of FLOPS. It is the collective-operation latency. MoE training in particular hits all-to-all on every layer; the difference between 100 µs and 5 ms is the difference between a usable training pipeline and one that spends most of its time waiting on the network.
The GB200 module is an NVIDIA-designed compute brick: 1 Grace CPU + 2 B200 GPUs, bonded with NVLink-C2C. The NVL72 rack is 36 of those bricks (18 compute trays, 2 superchips per tray) plus a switching plane.
Each Grace's 480 GB of LPDDR5x is mapped into the GPU's address space over the C2C link. CUDA sees it as memory with worse latency than HBM but the same coherency guarantees. For workloads with large but cold parameter tables (embeddings, retrieval indices, MoE routing tables), this is huge: your "GPU memory" goes from 192 GB to 192 + 480 = 672 GB per GB200 module (1 Grace + 2 B200), and 13.8 TB HBM + 17.3 TB LPDDR5x = ~31 TB across an NVL72 rack (36 Graces).
NVL72 was designed primarily for MoE / 1T-parameter training. The combination of fast all-reduce, large coherent memory, and Grace's role hosting optimiser state is unique to this generation. The first deployments trained models with effective parameter counts past 1.8T at densities that would have been impossible on HGX H100 racks.
Hopper introduced GPU-side confidential computing: encrypted DMA, attested boot, an isolated GPU runtime. Blackwell extends that across the PCIe interface with TEE-IO — trusted execution that spans CPU and GPU as one boundary.
The CPU's TEE (Intel TDX or AMD SEV-SNP) and the Blackwell GPU's confidential mode form a single attested enclave. The hypervisor and host OS sit outside the boundary — they can schedule, but they cannot read the encrypted memory or the device's state.
DMA buffers between CPU TEE and GPU TEE are encrypted with a key established at attestation time; the PCIe / NVLink fabric carries ciphertext only.
Defends against a compromised hypervisor, a malicious cloud operator, and a snooping co-tenant. Does not defend against a physical attacker with arbitrary access to the package — HBM scraping is out of scope, as it always has been.
Critical for cloud LLM inference on sensitive data: medical, financial, defense. The customer's prompt and the model's KV-cache exist only in GPU TEE memory; even the cloud operator can't read them.
Most enterprise AI conversations now have a "could we host this on third-party infrastructure?" question, and the honest answer used to be "not really, because the hypervisor sees everything." TEE-IO turns that into "yes, with a defensible attestation chain." It's the feature that makes regulated industries actually buy GPU time.
Pick a model, a precision, a batch size, and a domain. The planner sizes weights against the domain's HBM, computes aggregate compute, estimates collective time and a rough prefill TPS. Numbers are first-order; treat the verdict as "is this remotely feasible," not as a benchmark.
The "fits" badge is the first question — if it's red, no amount of compute helps. The all-reduce figure is the worst-case full parameter sync; for sharded gradient updates it's typically 10–50× smaller. Prefill TPS is a roof, not a benchmark; real numbers depend heavily on attention kernels, KV layout, and (especially for MoE) routing efficiency. Use this to rule out infeasible plans, not to commit a procurement.