NVIDIA GPU Architectures Series — Presentation 03

Tensor Cores — Five Generations of NVIDIA Matmul Hardware

From Volta's first 4×4×4 FP16 MMA to Blackwell's MX-FP4 microscaling — every generation has redefined the cost of training and serving large models. Here's what each generation actually does, what numbers it can multiply, and why FP8 and FP4 changed everything.

Tensor CoreFP16BF16 TF32INT8FP8 FP4SparsityMicroscaling
Volta Turing Ampere Hopper Ada Blackwell
00

Topics We'll Cover

One moving part — the tensor core — explains roughly half the difference between an A100 and a B200 in 2026. Here is what every generation added, what numbers it learned to multiply, and how to choose a precision in practice.

01

Why Tensor Cores Exist

Dense matrix multiplication is roughly 95% of the compute in transformer training and inference. Attention is matmul. Feed-forward is matmul. The output projection is matmul. If you can make matmul cheaper, you make LLMs cheaper.

Scalar FP32 ALU (CUDA core)

One fused multiply-add per cycle: d = a*b + c. Excellent for general code; terrible density for matmul. A 128-lane SM doing only FP32 FMAs delivers 256 FLOPs/cycle.

Tensor core (V100 onwards)

One fused 4×4×4 matrix-multiply-accumulate per cycle: D = A·B + C. That's 64 multiplies + 64 adds = 128 FLOPs per tensor core per cycle, in a single instruction. Eight tensor cores per SM gives 1024 FLOPs/cycle for tensor work — ~16× the FP32 path on the same silicon area.

The pattern: every generation extends the math

If tensor cores were only a one-shot win, we'd have moved on. The story since 2017 is that each generation learns a new number format that further halves training or inference cost: FP16 → INT8 → BF16/TF32 → FP8 → FP4. Each step doubled per-clock throughput on the same area, and each step required a new compiler stack and a new training recipe.

The headline number

V100 (2017): 125 TFLOPS FP16. B200 (2024): 9 PFLOPS MX-FP4 (dense). That's roughly 70× more dense matmul on a single die in seven years — almost entirely from tensor-core extensions, not from Moore's law.

02

Generation 1 — Volta (2017, V100)

The first tensor core. Eight per SM, 80 SMs per V100 = 640 tensor cores. They did one thing well: a fused 4×4×4 MMA with FP16 inputs and FP32 accumulate.

The hardware shape

D = A · B + C where A is 4×4 FP16, B is 4×4 FP16, C/D are 4×4 FP32. 16 FMAs per cycle per tensor core — on the FP16 inputs, accumulated into FP32 to preserve dynamic range.

Headline throughput

V100 SXM2: 125 TFLOPS FP16 via tensor cores vs 15.7 TFLOPS FP32 via CUDA cores — an 8× jump for the same dies. Mixed-precision training (FP16 fwd/bwd, FP32 master weights, loss scaling) became the default overnight.

Programming model

CUDA — first tensor-core matmul (Volta)
#include <mma.h>
using namespace nvcuda::wmma;

__global__ void mm_v100(const half* A, const half* B, float* C) {
    // 16x16x16 tile, FP16 in / FP32 out
    fragment<matrix_a, 16, 16, 16, half, row_major> a_frag;
    fragment<matrix_b, 16, 16, 16, half, col_major> b_frag;
    fragment<accumulator, 16, 16, 16, float>       c_frag;

    fill_fragment(c_frag, 0.0f);
    load_matrix_sync(a_frag, A, 16);
    load_matrix_sync(b_frag, B, 16);
    mma_sync(c_frag, a_frag, b_frag, c_frag);  // the new instruction
    store_matrix_sync(C, c_frag, 16, mem_row_major);
}
What Volta could not do

No INT8 tensor path, no BF16, no sparsity, no async copies. Inputs strictly FP16. If you needed FP32 throughout, the tensor cores sat idle. Mixed precision was effectively mandatory to get the speedup.

03

Generation 2 — Turing (2018, T4 / RTX 20)

Same MMA shape as Volta, but the input formats fanned out. Turing added INT8 and INT4 tensor paths, with throughput scaling inverse to bit-width.

FormatThroughput vs FP16Use case
FP161× (baseline)Training / mixed-precision inference
INT8Quantised inference (CV first, NLP later)
INT4Aggressive quant; rare outside research
INT1 / binaryAlmost never used in production

The T4 inference workhorse

The T4 became the workhorse cloud-inference card of the late 2010s: 130 TOPS INT8 in a 75 W single-slot PCIe envelope, no extra power connector, ~$2k. AWS, GCP, and Azure all standardised on it for serving ResNet, BERT, and early translation models.

Why integer quant first

By 2018 INT8 quantisation for CV models was well understood: per-channel scales, calibration sets, simple symmetric quant. NLP took longer because attention softmax is range-sensitive — QAT and BF16 came later.

RT cores: separate hardware

Turing also introduced RT cores for ray-tracing acceleration. They share the SM but are unrelated to tensor cores — different units, different instructions, different workloads. Don't conflate them.

Compute capability

Turing tensor cores are CC 7.5. They support mma.sync with INT8/INT4 input and INT32 accumulate. Same WMMA fragment API as Volta with new signed char / unsigned char fragment types.

04

Generation 3 — Ampere (2020, A100, RTX 30)

Ampere did three big things at once: it grew the native MMA shape, it added TF32 and BF16, and it introduced 2:4 structured sparsity in hardware.

Larger native shapes

Bigger native shapes mean fewer instructions per macro-tile, less register pressure for the same throughput, and better op/byte ratios at the warp level.

The two new formats that mattered

TF32 — the lazy-FP32

1 sign + 8 exp + 10 mantissa = 19 bits stored in a 32-bit slot. Same exponent range as FP32, but FP16-class mantissa. Designed as a drop-in replacement for FP32 training: no code change, no loss-scaling, just turn it on. ~8× speedup over true FP32.

BF16 — FP32 range, FP16 cost

1 sign + 8 exp + 7 mantissa = 16 bits. Same range as FP32 (so no loss scaling needed), half the bytes. Became the default training format in 2021 and onward — "BF16 with FP32 master weights" is the dominant 2026 recipe.

2:4 structured sparsity

Every 4-element block of weights can have at most 2 zeros, encoded in a 2-bit mask. The tensor core skips the zeros: 2× throughput on the same physical units. The sparsity has to be baked in at training time (sparse fine-tune); the hardware can't extract it for you.

A100 SXM4 headline numbers

FormatDense TFLOPSSparse TFLOPS
FP16 / BF16312624
TF32156312
FP32 (CUDA cores)19.5
INT8624 TOPS1248 TOPS
Why Ampere is still everywhere

The A100 is six years old and still outsells any non-Hopper card for fine-tuning and inference. Reason: BF16 + 2:4 sparsity + 80 GB HBM2e is enough for almost all 7B–70B work, and used A100s are now affordable.

05

Generation 4 — Hopper (2022, H100)

Hopper is the FP8 generation. Two new formats, a software stack that uses them automatically, and a new instruction (wgmma) that operates on much larger tiles.

FP8 in two flavours

E4M3 — the weights/activation format

1 sign + 4 exp + 3 mantissa. Range ~±448, 7 bits of fractional precision. Used for forward activations and weight tensors where range is bounded.

E5M2 — the gradient format

1 sign + 5 exp + 2 mantissa. Range ~±57344, much wider but only 4 bits of fraction. Used for backward gradients, which span many orders of magnitude.

Both are 8 bits, so throughput is 2× BF16. The E4M3/E5M2 split mirrors how the network actually uses tensors: forward activations are bounded, gradients are heavy-tailed.

Transformer Engine

Software + hardware system that auto-quantises a transformer layer to FP8: per-tensor scaling factors are tracked across iterations (delayed scaling) so that activation ranges stay representable without per-step calibration. Wraps PyTorch nn.Linear as te.Linear; you keep your model code, you get FP8 forward and backward.

WGMMA — warp-group MMA

A new asynchronous instruction that operates on much larger shapes — 64 × N × K where N goes up to 256 and the operation is issued by a warp group (128 threads / 4 warps) instead of a single warp. Combined with the Tensor Memory Accelerator (TMA) for the data movement, this is what unlocks Hopper's headline numbers.

H100 SXM5 headline numbers

FormatDense TFLOPSSparse TFLOPS
FP16 / BF169891979
TF32495989
FP8 (E4M3 / E5M2)19793958
INT81979 TOPS3958 TOPS
Why Hopper changed pricing

The shift from "BF16 is the default" to "FP8 is the default for prefill and serving" is the single biggest cost reduction in LLM ops since 2021. A 70B served at FP8 on an H100 fits in 80 GB, decodes ~30% faster than BF16 (bandwidth win), and prefills ~2× faster (compute win).

06

Generation 4 (Ada Variant, AD102)

Ada Lovelace is technically the same tensor-core generation as Hopper (4th-gen) and the same compute capability epoch (CC 8.9), but with very different feature gating between the consumer and datacenter parts.

L40S (datacenter Ada)

Has FP8 tensor cores, same E4M3/E5M2 path as Hopper. Throughput ~733 TFLOPS FP8 dense. Used heavily for cost-sensitive inference where HBM is overkill.

RTX 4090 / 4080 (consumer Ada)

FP8 tensor cores are present (~660 TFLOPS dense FP8 on a 4090, 2× BF16) — AD102 silicon is the same family as L40S. What consumer Ada lacks is the Transformer Engine software stack and full datacenter driver/library support; you can hit FP8 via CUTLASS / Triton / TensorRT-LLM kernels directly.

What Ada does not have that Hopper does

The naming gotcha

"4th-gen tensor cores" appears in both Ada and Hopper marketing — but a 4090 and an H100 do not run the same kernels at the same speed for the same instructions. CUTLASS, vLLM, and TensorRT-LLM all carry separate kernel paths for AD102 vs GH100. Ada is best thought of as "Hopper-minus-FP8-minus-async".

07

Generation 5 — Blackwell (2024, B200)

Blackwell pushes per-element bit-width down again, this time below 8 bits, by introducing microscaling (MX) formats. Each block of 32 elements shares an 8-bit power-of-two scale; the elements themselves are 4 or 6 bits.

The MX-FP variants

FormatElement bitsBlock sizeBlock scaleEffective bits/elem
MX-FP4 (E2M1)4 (1+2+1)32E8M0 (8-bit pow-2)4.25
MX-FP6 E2M36 (1+2+3)32E8M06.25
MX-FP6 E3M26 (1+3+2)32E8M06.25
MX-FP88 (E4M3 or E5M2)32E8M08.25

The shared block scale lets each 32-element block centre its dynamic range independently of every other block — so a 4-bit element can still represent a tensor whose values span many orders of magnitude across a layer, as long as within any 32-element window the spread is bounded.

Two FP4 variants: MX-FP4 and NVFP4

Blackwell's 5th-gen tensor cores accelerate both FP4 flavours. MX-FP4 is the open OCP Microscaling format above (32-element block, 8-bit E8M0 power-of-two scale). NVFP4 is NVIDIA's variant: 16-element blocks, E4M3 block scale (with a per-tensor FP32 scale on top), giving finer-grained, higher-precision blocks at the cost of slightly more scale overhead. NVFP4 is what TensorRT-LLM and the Transformer Engine default to in 2026; MX-FP4 is what the rest of the ecosystem (vLLM, llama.cpp, OCP-aligned stacks) tends to target. Same hardware throughput either way.

2nd-gen Transformer Engine

Handles the MX / NVFP4 block scaling automatically: at quantisation time it picks per-block scales, packs the 16 or 32 elements, and emits the right tensor-core instruction. The user keeps writing PyTorch.

B200 headline numbers

FormatDense TFLOPSSparse TFLOPS
FP16 / BF1622504500
FP8 (E4M3 / E5M2)45009000
MX-FP645009000
MX-FP4900018000
The line of doublings

Volta FP16 (125) → A100 BF16 (312) → H100 FP8 (1979) → B200 MX-FP4 (9000). Each step is roughly 2–3×, and roughly half of each step came from a new tensor-core format rather than from more transistors.

08

Number Formats Explained

All eleven formats you'll meet on a modern NVIDIA GPU. The dynamic range column is approximate — exact range is 21-bias to (2-2-mantissa) · 22exp-1-bias for IEEE-style floats; for MX formats the per-block E8M0 scale extends each block's effective range further.

FormatTotal bitsS+E+MApprox dynamic rangeUse case
FP32321+8+23~10-38 … 1038Master weights, optimiser state, accuracy-critical math
TF3232 stored / 19 used1+8+10same as FP32Drop-in FP32 training on Ampere+; ~8× speedup
BF16161+8+7same as FP32Default training/serving format since 2021
FP16161+5+10~6×10-5 … 65504Mixed-precision training (with loss scaling), legacy serving
FP8 E4M381+4+3~±448Forward activations & weights on Hopper+
FP8 E5M281+5+2~±57344Backward gradients on Hopper+
MX-FP6 E2M36 + 8/32 scale1+2+3 (block)per-block, wideBlackwell precision-vs-quality middle ground
MX-FP6 E3M26 + 8/32 scale1+3+2 (block)per-block, widerAs above with wider exp; better for grad-like tensors
MX-FP4 E2M14 + 8/32 scale1+2+1 (block)per-block, narrowAggressive Blackwell inference; 2× FP8 throughput
INT88signed integer-128 … 127CV/NLP quantised serving since Turing
INT44signed integer-8 … 7AWQ/GPTQ weight-only quant on Ampere/Ada

Bit layout, side by side

Bit layout: sign / exponent / mantissa (block scale where present) FP32 (32b) TF32 (19b used) BF16 (16b) FP16 (16b) FP8 E4M3 (8b) FP8 E5M2 (8b) MX-FP6 E2M3 (6b+blk) + 8b scale / 32 elems MX-FP6 E3M2 (6b+blk) + 8b scale / 32 elems MX-FP4 E2M1 (4b+blk) + 8b scale / 32 elems INT8 (8b signed) INT4 (4b signed) sign exponent mantissa shared block scale (E8M0) Bar widths are proportional to bit count; block-scale bar is shared across 32 elements.
09

MMA Shapes & APIs

Each generation grew the native hardware shape, the PTX instruction family, and (slowly) the high-level programming model. Bigger shapes = fewer instructions per macro-tile = better op/byte and lower instruction-issue overhead.

GenerationHardware shape (M×N×K)PTX instructionProgramming API
Volta (V100, CC 7.0) 16×16×16 (built from 4×4×4 unit) wmma.mma.sync WMMA C++ (nvcuda::wmma)
Turing (T4, CC 7.5) 16×16×16 + INT8/INT4 paths mma.sync WMMA + cuBLAS Lt
Ampere (A100, CC 8.0) 16×8×16 (FP16/BF16), 16×8×8 (TF32), 16×8×32 (INT8) mma.sync (new shapes) WMMA / CUTLASS 2.x / cuBLASLt
Hopper (H100, CC 9.0) 64×N×16 (BF16/FP8), N up to 256; warp-group async wgmma.mma_async CUTLASS 3.x / cuBLASLt / Triton / Transformer Engine
Ada (AD102, CC 8.9) same shapes as Ampere; FP8 only on L40S mma.sync (no wgmma) CUTLASS 2.x / cuBLASLt
Blackwell (B200, CC 10.0) 64×N×32 for MX formats; warp-group async + UMMA on B200 wgmma family + new MX variants CUTLASS 3.x / Transformer Engine 2.0 / Triton

Why bigger shapes matter

Volta — per-warp, per-cycle

One 16×16×16 issued by one warp (32 threads). Each thread holds tiny fragments. Lots of instructions to fill a 128×128 macro-tile.

Hopper — warp-group async

One wgmma 64x256x16 issued by 4 warps (128 threads), asynchronously; the warp group can issue more wgmmas while the previous one is in flight, with TMA bringing the next operand tile in parallel. The whole pipeline overlaps.

CUTLASS 3.x — Hopper FP8 GEMM (sketch)
using Gemm = cutlass::gemm::device::GemmUniversalAdapter<
    cutlass::gemm::kernel::GemmUniversal<
        cutlass::gemm::collective::CollectiveBuilder<
            cutlass::arch::Sm90,                        // Hopper
            cutlass::arch::OpClassTensorOp,             // tensor cores
            cutlass::float_e4m3_t, RowMajor, 16,        // A: FP8 E4M3
            cutlass::float_e4m3_t, ColumnMajor, 16,     // B: FP8 E4M3
            float,                                      // accumulate FP32
            cute::Shape<cute::_128, cute::_256, cute::_64>,
            cute::Shape<cute::_2, cute::_1, cute::_1>,
            cutlass::gemm::collective::StageCountAuto,
            cutlass::gemm::collective::KernelTmaWarpSpecializedPingpong  // uses TMA + wgmma
        >::CollectiveOp,
        ...
    >
>;
10

2:4 Structured Sparsity

Introduced on Ampere (CC 8.0), still present on every later generation. Every block of 4 weights along the K dimension may have at most 2 non-zeros; a 2-bit mask tells the tensor core which two to read.

The dense vector

Imagine a 4-element weight vector: [w0, w1, w2, w3]. Dense matmul reads all four, multiplies by all four activations, accumulates four products.

[w0  w1  w2  w3]   // dense
 ×   ×   ×   ×
[a0  a1  a2  a3]
 ↓   ↓   ↓   ↓
sum = w0a0 + w1a1 + w2a2 + w3a3

2:4 sparse with mask

Two of w1, w2 are zero. The compressed weight stores just [w0, w3] + a 2-bit mask 0b1001 selecting positions 0 and 3. The hardware reads only a0 and a32 multiplies instead of 4.

[w0   _   _  w3]   mask = 0b1001
 ×             ×
[a0  a1  a2  a3]    // HW skips a1, a2
 ↓             ↓
sum = w0a0 + w3a3   // 2× speed

The catch: training-time, not runtime

The hardware cannot extract this pattern from a dense weight matrix. The 2:4 mask must be present at training time — either learned via sparse fine-tuning (NVIDIA's ASP toolkit) or imposed by magnitude pruning + retrain. Otherwise the tensor core falls back to dense.

Why it's rare in production LLMs

Real-world wins are around 1.5–1.8× on inference for sparse-trained models — consistent with the 2× theoretical when memory bandwidth is the actual bottleneck. But almost no public LLM ships sparse weights, because (a) the fine-tune cost is real and (b) FP8/MX-FP4 quant gives a similar size win without the retrain. So 2:4 lives mostly in NVIDIA-internal benchmarks and a handful of CV models.

11

Interactive: Precision/Throughput Calculator

Pick a GPU and a numeric format. The panel tells you whether the format runs natively, the peak tensor-core throughput, and how it compares to the FP16 baseline. Emulated means the dtype is supported in software (so you save memory and bandwidth) but not in hardware (so compute throughput is no faster than the next-larger format).

Native?
Peak TFLOPS / TOPS
vs FP16
Practical model size
Reading the numbers

"Native" means the kernel calls a tensor-core MMA in this format directly — full speedup. "Emulated" (Ampere FP8) means the data is stored in 8-bit but the math runs at BF16: you save bandwidth but not compute. "No" means software has no kernel for this format on this arch and will refuse or fall back. Sparse rows assume you have a sparse-fine-tuned checkpoint — not just a dense one with 50% zeros.

12

Bottom Line — Pick Your Precision

One cheat-sheet table to close. The rule of thumb across all five generations: store smaller, accumulate wider. Each new format halves storage and doubles tensor-core throughput; the FP32 accumulator stays the same so numerical stability does not degrade.

WorkloadBest precisionWhyMin generation
Pretraining (dense, large) BF16 master + BF16 tensor cores; FP8 forward on Hopper+ FP32-range exponent prevents overflow; 2× over FP16. FP8 cuts further on Hopper+. Ampere
Fine-tune (LoRA / SFT) BF16 base + LoRA adapters; FP8 with TE on Hopper LoRA lives in a few MB; base model can be quantised; TE handles per-tensor scales. Ampere
Inference — prefill (compute-bound) FP8 on Hopper+; BF16 on Ampere/Ada-consumer Prefill is matmul-bound so the 2× FP8 compute uplift is fully realised on Hopper. Ada (L40S) / Hopper
Inference — decode (bandwidth-bound) FP8 weights + FP8 KV on H100; AWQ-INT4 on Ampere/Ada; MX-FP4 / NVFP4 on Blackwell Decode is bandwidth-bound; smaller weights = more tok/s. Quality holds at INT4 / FP4 for most models. Ampere (INT4); Hopper (FP8); Blackwell (FP4)
Embedding / encoder FP16 weights + q8_0 KV (rarely tensor-core-bound) Short sequences, small batches — you're memory-bound, not matmul-bound. Turing+
The shortest correct answer

If you have Hopper, default to BF16 for training and FP8 for serving; turn on Transformer Engine and FP8 KV-cache. If you have Ampere, default to BF16 for training and AWQ-INT4 for serving; FP8 is software-emulated only. Consumer Ada has FP8 silicon (~660 TFLOPS on a 4090) but no Transformer Engine — reach FP8 via CUTLASS / Triton / TensorRT-LLM. If you have Blackwell, BF16 still wins for pretraining but MX-FP4 / NVFP4 is the new floor for inference. If you have Volta or Turing, you are doing FP16 / INT8 only — treat anything newer in the marketing as unsupported.