From Volta's first 4×4×4 FP16 MMA to Blackwell's MX-FP4 microscaling — every generation has redefined the cost of training and serving large models. Here's what each generation actually does, what numbers it can multiply, and why FP8 and FP4 changed everything.
One moving part — the tensor core — explains roughly half the difference between an A100 and a B200 in 2026. Here is what every generation added, what numbers it learned to multiply, and how to choose a precision in practice.
Dense matrix multiplication is roughly 95% of the compute in transformer training and inference. Attention is matmul. Feed-forward is matmul. The output projection is matmul. If you can make matmul cheaper, you make LLMs cheaper.
One fused multiply-add per cycle: d = a*b + c. Excellent for general code; terrible density for matmul. A 128-lane SM doing only FP32 FMAs delivers 256 FLOPs/cycle.
One fused 4×4×4 matrix-multiply-accumulate per cycle: D = A·B + C. That's 64 multiplies + 64 adds = 128 FLOPs per tensor core per cycle, in a single instruction. Eight tensor cores per SM gives 1024 FLOPs/cycle for tensor work — ~16× the FP32 path on the same silicon area.
If tensor cores were only a one-shot win, we'd have moved on. The story since 2017 is that each generation learns a new number format that further halves training or inference cost: FP16 → INT8 → BF16/TF32 → FP8 → FP4. Each step doubled per-clock throughput on the same area, and each step required a new compiler stack and a new training recipe.
V100 (2017): 125 TFLOPS FP16. B200 (2024): 9 PFLOPS MX-FP4 (dense). That's roughly 70× more dense matmul on a single die in seven years — almost entirely from tensor-core extensions, not from Moore's law.
The first tensor core. Eight per SM, 80 SMs per V100 = 640 tensor cores. They did one thing well: a fused 4×4×4 MMA with FP16 inputs and FP32 accumulate.
D = A · B + C where A is 4×4 FP16, B is 4×4 FP16, C/D are 4×4 FP32. 16 FMAs per cycle per tensor core — on the FP16 inputs, accumulated into FP32 to preserve dynamic range.
V100 SXM2: 125 TFLOPS FP16 via tensor cores vs 15.7 TFLOPS FP32 via CUDA cores — an 8× jump for the same dies. Mixed-precision training (FP16 fwd/bwd, FP32 master weights, loss scaling) became the default overnight.
nvcuda::wmma): warp-cooperative fragments for A, B, C. Each warp loads 16-element fragments, calls mma_sync, stores back.wmma.mma.sync.aligned.col.row.f32.f16.f16.f32 — fixed shape, fixed alignment, no flexibility yet.#include <mma.h>
using namespace nvcuda::wmma;
__global__ void mm_v100(const half* A, const half* B, float* C) {
// 16x16x16 tile, FP16 in / FP32 out
fragment<matrix_a, 16, 16, 16, half, row_major> a_frag;
fragment<matrix_b, 16, 16, 16, half, col_major> b_frag;
fragment<accumulator, 16, 16, 16, float> c_frag;
fill_fragment(c_frag, 0.0f);
load_matrix_sync(a_frag, A, 16);
load_matrix_sync(b_frag, B, 16);
mma_sync(c_frag, a_frag, b_frag, c_frag); // the new instruction
store_matrix_sync(C, c_frag, 16, mem_row_major);
}
No INT8 tensor path, no BF16, no sparsity, no async copies. Inputs strictly FP16. If you needed FP32 throughout, the tensor cores sat idle. Mixed precision was effectively mandatory to get the speedup.
Same MMA shape as Volta, but the input formats fanned out. Turing added INT8 and INT4 tensor paths, with throughput scaling inverse to bit-width.
| Format | Throughput vs FP16 | Use case |
|---|---|---|
| FP16 | 1× (baseline) | Training / mixed-precision inference |
| INT8 | 2× | Quantised inference (CV first, NLP later) |
| INT4 | 4× | Aggressive quant; rare outside research |
| INT1 / binary | 8× | Almost never used in production |
The T4 became the workhorse cloud-inference card of the late 2010s: 130 TOPS INT8 in a 75 W single-slot PCIe envelope, no extra power connector, ~$2k. AWS, GCP, and Azure all standardised on it for serving ResNet, BERT, and early translation models.
By 2018 INT8 quantisation for CV models was well understood: per-channel scales, calibration sets, simple symmetric quant. NLP took longer because attention softmax is range-sensitive — QAT and BF16 came later.
Turing also introduced RT cores for ray-tracing acceleration. They share the SM but are unrelated to tensor cores — different units, different instructions, different workloads. Don't conflate them.
Turing tensor cores are CC 7.5. They support mma.sync with INT8/INT4 input and INT32 accumulate. Same WMMA fragment API as Volta with new signed char / unsigned char fragment types.
Ampere did three big things at once: it grew the native MMA shape, it added TF32 and BF16, and it introduced 2:4 structured sparsity in hardware.
Bigger native shapes mean fewer instructions per macro-tile, less register pressure for the same throughput, and better op/byte ratios at the warp level.
1 sign + 8 exp + 10 mantissa = 19 bits stored in a 32-bit slot. Same exponent range as FP32, but FP16-class mantissa. Designed as a drop-in replacement for FP32 training: no code change, no loss-scaling, just turn it on. ~8× speedup over true FP32.
1 sign + 8 exp + 7 mantissa = 16 bits. Same range as FP32 (so no loss scaling needed), half the bytes. Became the default training format in 2021 and onward — "BF16 with FP32 master weights" is the dominant 2026 recipe.
Every 4-element block of weights can have at most 2 zeros, encoded in a 2-bit mask. The tensor core skips the zeros: 2× throughput on the same physical units. The sparsity has to be baked in at training time (sparse fine-tune); the hardware can't extract it for you.
| Format | Dense TFLOPS | Sparse TFLOPS |
|---|---|---|
| FP16 / BF16 | 312 | 624 |
| TF32 | 156 | 312 |
| FP32 (CUDA cores) | 19.5 | — |
| INT8 | 624 TOPS | 1248 TOPS |
The A100 is six years old and still outsells any non-Hopper card for fine-tuning and inference. Reason: BF16 + 2:4 sparsity + 80 GB HBM2e is enough for almost all 7B–70B work, and used A100s are now affordable.
Hopper is the FP8 generation. Two new formats, a software stack that uses them automatically, and a new instruction (wgmma) that operates on much larger tiles.
1 sign + 4 exp + 3 mantissa. Range ~±448, 7 bits of fractional precision. Used for forward activations and weight tensors where range is bounded.
1 sign + 5 exp + 2 mantissa. Range ~±57344, much wider but only 4 bits of fraction. Used for backward gradients, which span many orders of magnitude.
Both are 8 bits, so throughput is 2× BF16. The E4M3/E5M2 split mirrors how the network actually uses tensors: forward activations are bounded, gradients are heavy-tailed.
Software + hardware system that auto-quantises a transformer layer to FP8: per-tensor scaling factors are tracked across iterations (delayed scaling) so that activation ranges stay representable without per-step calibration. Wraps PyTorch nn.Linear as te.Linear; you keep your model code, you get FP8 forward and backward.
A new asynchronous instruction that operates on much larger shapes — 64 × N × K where N goes up to 256 and the operation is issued by a warp group (128 threads / 4 warps) instead of a single warp. Combined with the Tensor Memory Accelerator (TMA) for the data movement, this is what unlocks Hopper's headline numbers.
| Format | Dense TFLOPS | Sparse TFLOPS |
|---|---|---|
| FP16 / BF16 | 989 | 1979 |
| TF32 | 495 | 989 |
| FP8 (E4M3 / E5M2) | 1979 | 3958 |
| INT8 | 1979 TOPS | 3958 TOPS |
The shift from "BF16 is the default" to "FP8 is the default for prefill and serving" is the single biggest cost reduction in LLM ops since 2021. A 70B served at FP8 on an H100 fits in 80 GB, decodes ~30% faster than BF16 (bandwidth win), and prefills ~2× faster (compute win).
Ada Lovelace is technically the same tensor-core generation as Hopper (4th-gen) and the same compute capability epoch (CC 8.9), but with very different feature gating between the consumer and datacenter parts.
Has FP8 tensor cores, same E4M3/E5M2 path as Hopper. Throughput ~733 TFLOPS FP8 dense. Used heavily for cost-sensitive inference where HBM is overkill.
FP8 tensor cores are present (~660 TFLOPS dense FP8 on a 4090, 2× BF16) — AD102 silicon is the same family as L40S. What consumer Ada lacks is the Transformer Engine software stack and full datacenter driver/library support; you can hit FP8 via CUTLASS / Triton / TensorRT-LLM kernels directly.
mma.sync instruction. Larger tiles are software-tiled."4th-gen tensor cores" appears in both Ada and Hopper marketing — but a 4090 and an H100 do not run the same kernels at the same speed for the same instructions. CUTLASS, vLLM, and TensorRT-LLM all carry separate kernel paths for AD102 vs GH100. Ada is best thought of as "Hopper-minus-FP8-minus-async".
Blackwell pushes per-element bit-width down again, this time below 8 bits, by introducing microscaling (MX) formats. Each block of 32 elements shares an 8-bit power-of-two scale; the elements themselves are 4 or 6 bits.
| Format | Element bits | Block size | Block scale | Effective bits/elem |
|---|---|---|---|---|
| MX-FP4 (E2M1) | 4 (1+2+1) | 32 | E8M0 (8-bit pow-2) | 4.25 |
| MX-FP6 E2M3 | 6 (1+2+3) | 32 | E8M0 | 6.25 |
| MX-FP6 E3M2 | 6 (1+3+2) | 32 | E8M0 | 6.25 |
| MX-FP8 | 8 (E4M3 or E5M2) | 32 | E8M0 | 8.25 |
The shared block scale lets each 32-element block centre its dynamic range independently of every other block — so a 4-bit element can still represent a tensor whose values span many orders of magnitude across a layer, as long as within any 32-element window the spread is bounded.
Blackwell's 5th-gen tensor cores accelerate both FP4 flavours. MX-FP4 is the open OCP Microscaling format above (32-element block, 8-bit E8M0 power-of-two scale). NVFP4 is NVIDIA's variant: 16-element blocks, E4M3 block scale (with a per-tensor FP32 scale on top), giving finer-grained, higher-precision blocks at the cost of slightly more scale overhead. NVFP4 is what TensorRT-LLM and the Transformer Engine default to in 2026; MX-FP4 is what the rest of the ecosystem (vLLM, llama.cpp, OCP-aligned stacks) tends to target. Same hardware throughput either way.
Handles the MX / NVFP4 block scaling automatically: at quantisation time it picks per-block scales, packs the 16 or 32 elements, and emits the right tensor-core instruction. The user keeps writing PyTorch.
| Format | Dense TFLOPS | Sparse TFLOPS |
|---|---|---|
| FP16 / BF16 | 2250 | 4500 |
| FP8 (E4M3 / E5M2) | 4500 | 9000 |
| MX-FP6 | 4500 | 9000 |
| MX-FP4 | 9000 | 18000 |
Volta FP16 (125) → A100 BF16 (312) → H100 FP8 (1979) → B200 MX-FP4 (9000). Each step is roughly 2–3×, and roughly half of each step came from a new tensor-core format rather than from more transistors.
All eleven formats you'll meet on a modern NVIDIA GPU. The dynamic range column is approximate — exact range is 21-bias to (2-2-mantissa) · 22exp-1-bias for IEEE-style floats; for MX formats the per-block E8M0 scale extends each block's effective range further.
| Format | Total bits | S+E+M | Approx dynamic range | Use case |
|---|---|---|---|---|
| FP32 | 32 | 1+8+23 | ~10-38 … 1038 | Master weights, optimiser state, accuracy-critical math |
| TF32 | 32 stored / 19 used | 1+8+10 | same as FP32 | Drop-in FP32 training on Ampere+; ~8× speedup |
| BF16 | 16 | 1+8+7 | same as FP32 | Default training/serving format since 2021 |
| FP16 | 16 | 1+5+10 | ~6×10-5 … 65504 | Mixed-precision training (with loss scaling), legacy serving |
| FP8 E4M3 | 8 | 1+4+3 | ~±448 | Forward activations & weights on Hopper+ |
| FP8 E5M2 | 8 | 1+5+2 | ~±57344 | Backward gradients on Hopper+ |
| MX-FP6 E2M3 | 6 + 8/32 scale | 1+2+3 (block) | per-block, wide | Blackwell precision-vs-quality middle ground |
| MX-FP6 E3M2 | 6 + 8/32 scale | 1+3+2 (block) | per-block, wider | As above with wider exp; better for grad-like tensors |
| MX-FP4 E2M1 | 4 + 8/32 scale | 1+2+1 (block) | per-block, narrow | Aggressive Blackwell inference; 2× FP8 throughput |
| INT8 | 8 | signed integer | -128 … 127 | CV/NLP quantised serving since Turing |
| INT4 | 4 | signed integer | -8 … 7 | AWQ/GPTQ weight-only quant on Ampere/Ada |
Each generation grew the native hardware shape, the PTX instruction family, and (slowly) the high-level programming model. Bigger shapes = fewer instructions per macro-tile = better op/byte and lower instruction-issue overhead.
| Generation | Hardware shape (M×N×K) | PTX instruction | Programming API |
|---|---|---|---|
| Volta (V100, CC 7.0) | 16×16×16 (built from 4×4×4 unit) | wmma.mma.sync |
WMMA C++ (nvcuda::wmma) |
| Turing (T4, CC 7.5) | 16×16×16 + INT8/INT4 paths | mma.sync |
WMMA + cuBLAS Lt |
| Ampere (A100, CC 8.0) | 16×8×16 (FP16/BF16), 16×8×8 (TF32), 16×8×32 (INT8) | mma.sync (new shapes) |
WMMA / CUTLASS 2.x / cuBLASLt |
| Hopper (H100, CC 9.0) | 64×N×16 (BF16/FP8), N up to 256; warp-group async | wgmma.mma_async |
CUTLASS 3.x / cuBLASLt / Triton / Transformer Engine |
| Ada (AD102, CC 8.9) | same shapes as Ampere; FP8 only on L40S | mma.sync (no wgmma) |
CUTLASS 2.x / cuBLASLt |
| Blackwell (B200, CC 10.0) | 64×N×32 for MX formats; warp-group async + UMMA on B200 | wgmma family + new MX variants |
CUTLASS 3.x / Transformer Engine 2.0 / Triton |
One 16×16×16 issued by one warp (32 threads). Each thread holds tiny fragments. Lots of instructions to fill a 128×128 macro-tile.
One wgmma 64x256x16 issued by 4 warps (128 threads), asynchronously; the warp group can issue more wgmmas while the previous one is in flight, with TMA bringing the next operand tile in parallel. The whole pipeline overlaps.
using Gemm = cutlass::gemm::device::GemmUniversalAdapter<
cutlass::gemm::kernel::GemmUniversal<
cutlass::gemm::collective::CollectiveBuilder<
cutlass::arch::Sm90, // Hopper
cutlass::arch::OpClassTensorOp, // tensor cores
cutlass::float_e4m3_t, RowMajor, 16, // A: FP8 E4M3
cutlass::float_e4m3_t, ColumnMajor, 16, // B: FP8 E4M3
float, // accumulate FP32
cute::Shape<cute::_128, cute::_256, cute::_64>,
cute::Shape<cute::_2, cute::_1, cute::_1>,
cutlass::gemm::collective::StageCountAuto,
cutlass::gemm::collective::KernelTmaWarpSpecializedPingpong // uses TMA + wgmma
>::CollectiveOp,
...
>
>;
Introduced on Ampere (CC 8.0), still present on every later generation. Every block of 4 weights along the K dimension may have at most 2 non-zeros; a 2-bit mask tells the tensor core which two to read.
Imagine a 4-element weight vector: [w0, w1, w2, w3]. Dense matmul reads all four, multiplies by all four activations, accumulates four products.
[w0 w1 w2 w3] // dense
× × × ×
[a0 a1 a2 a3]
↓ ↓ ↓ ↓
sum = w0a0 + w1a1 + w2a2 + w3a3
Two of w1, w2 are zero. The compressed weight stores just [w0, w3] + a 2-bit mask 0b1001 selecting positions 0 and 3. The hardware reads only a0 and a3 — 2 multiplies instead of 4.
[w0 _ _ w3] mask = 0b1001 × × [a0 a1 a2 a3] // HW skips a1, a2 ↓ ↓ sum = w0a0 + w3a3 // 2× speed
The hardware cannot extract this pattern from a dense weight matrix. The 2:4 mask must be present at training time — either learned via sparse fine-tuning (NVIDIA's ASP toolkit) or imposed by magnitude pruning + retrain. Otherwise the tensor core falls back to dense.
Real-world wins are around 1.5–1.8× on inference for sparse-trained models — consistent with the 2× theoretical when memory bandwidth is the actual bottleneck. But almost no public LLM ships sparse weights, because (a) the fine-tune cost is real and (b) FP8/MX-FP4 quant gives a similar size win without the retrain. So 2:4 lives mostly in NVIDIA-internal benchmarks and a handful of CV models.
Pick a GPU and a numeric format. The panel tells you whether the format runs natively, the peak tensor-core throughput, and how it compares to the FP16 baseline. Emulated means the dtype is supported in software (so you save memory and bandwidth) but not in hardware (so compute throughput is no faster than the next-larger format).
"Native" means the kernel calls a tensor-core MMA in this format directly — full speedup. "Emulated" (Ampere FP8) means the data is stored in 8-bit but the math runs at BF16: you save bandwidth but not compute. "No" means software has no kernel for this format on this arch and will refuse or fall back. Sparse rows assume you have a sparse-fine-tuned checkpoint — not just a dense one with 50% zeros.
One cheat-sheet table to close. The rule of thumb across all five generations: store smaller, accumulate wider. Each new format halves storage and doubles tensor-core throughput; the FP32 accumulator stays the same so numerical stability does not degrade.
| Workload | Best precision | Why | Min generation |
|---|---|---|---|
| Pretraining (dense, large) | BF16 master + BF16 tensor cores; FP8 forward on Hopper+ | FP32-range exponent prevents overflow; 2× over FP16. FP8 cuts further on Hopper+. | Ampere |
| Fine-tune (LoRA / SFT) | BF16 base + LoRA adapters; FP8 with TE on Hopper | LoRA lives in a few MB; base model can be quantised; TE handles per-tensor scales. | Ampere |
| Inference — prefill (compute-bound) | FP8 on Hopper+; BF16 on Ampere/Ada-consumer | Prefill is matmul-bound so the 2× FP8 compute uplift is fully realised on Hopper. | Ada (L40S) / Hopper |
| Inference — decode (bandwidth-bound) | FP8 weights + FP8 KV on H100; AWQ-INT4 on Ampere/Ada; MX-FP4 / NVFP4 on Blackwell | Decode is bandwidth-bound; smaller weights = more tok/s. Quality holds at INT4 / FP4 for most models. | Ampere (INT4); Hopper (FP8); Blackwell (FP4) |
| Embedding / encoder | FP16 weights + q8_0 KV (rarely tensor-core-bound) | Short sequences, small batches — you're memory-bound, not matmul-bound. | Turing+ |
If you have Hopper, default to BF16 for training and FP8 for serving; turn on Transformer Engine and FP8 KV-cache. If you have Ampere, default to BF16 for training and AWQ-INT4 for serving; FP8 is software-emulated only. Consumer Ada has FP8 silicon (~660 TFLOPS on a 4090) but no Transformer Engine — reach FP8 via CUTLASS / Triton / TensorRT-LLM. If you have Blackwell, BF16 still wins for pretraining but MX-FP4 / NVFP4 is the new floor for inference. If you have Volta or Turing, you are doing FP16 / INT8 only — treat anything newer in the marketing as unsupported.