Block-floating-point at 4 bits/element — how the OCP MX spec and NVIDIA's NVFP4 work, where they live in Blackwell's 5th-gen tensor cores, and how they compare to FP8, INT4 and BF16.
LLM training and inference are both dominated by two costs: HBM bandwidth (every step reads all the weights, activations, and — in training — gradients) and tensor-core throughput (every token or training sample does hundreds of GFLOPs of dense matmul). Halving the bytes-per-element roughly doubles tok/s on the inference side and step/s on the training side; halving the bits also doubles peak math on a tensor core that natively supports the format. BFP4 is the format both NVIDIA and the OCP standard reach for to do that — and unlike INT4, it works for forward, backward and weight-update passes alike.
FP16 70 B = 140 GB. NVFP4 70 B = 39 GB. The model now fits in a single 48 GB card with KV-cache headroom.
Blackwell 5th-gen tensor cores deliver 2× FP8 at FP4. B200 reaches ~9 PFLOPS dense FP4 / ~18 PFLOPS sparse.
Naive 4-bit destroys quality. Block-scaling restores it: NVFP4 gets within ~0.1 perplexity of FP8 on Llama-3 70 B.
You don't quantise individual numbers to 4 bits. You quantise blocks of numbers, share one exponent (or scale) across each block, and let the per-element mantissas be 4 bits. This is the block-floating-point (BFP) idea, and it is decades old.
Three kinds of tensor get quantised in an LLM — the first two during inference, all three during training:
WQ, WK, WV, WO, Wup, Wdown, etc.). Static, set during training, read every token.All three are locally well-behaved within a small window: weights because training plus weight-decay keeps each row's magnitudes within a narrow range; activations because RMSNorm (or LayerNorm) immediately precedes every linear layer and pins each token's vector to roughly unit RMS; and gradients because they inherit that structure with one extra Jacobian. So a single shared scale per 16- or 32-element block absorbs almost all the dynamic range, and the 4-bit mantissas only have to encode the relative values within the block. The same encoding works whether you're serving 1000 tok/s of inference or running an FP4 training step.
BFP was invented for fixed-point DSP — the trick of letting an array share one exponent so the multiplier can stay integer. Modern AI BFP is the same idea, retro-fitted to tensor cores.
A second 4-bit format, also accelerated on Blackwell, with a smaller block (16) and a finer scale (FP8 E4M3 instead of E8M0), plus a per-tensor FP32 macro-scale. Targets training and inference where MXFP4 quality is borderline.
"Borderline" here means the cases MXFP4 measurably hurts but FP8 doesn't. Specifically:
For pure weight-only inference of a chat model in the ≤32k regime, MXFP4 is usually fine; NVFP4 is the one to reach for everywhere else.
BFP is best understood as a hierarchical extension of classic floating point. An ordinary FP number is sign + exponent + mantissa, and is fully self-contained: every element pays for its own dynamic range (exponent) and its own precision (mantissa). BFP factors the exponent up one level — a block of elements share one scale, and each element carries only sign + mantissa relative to that shared scale. The encoder finds the block's largest absolute value, picks the smallest scale that lets every element fit on the per-element grid, divides everything by it, and rounds.
The depth of the hierarchy varies by format. Each new level handles a different scale of variation, and each is itself a small floating-point number:
Every level is purely multiplicative — no offsets anywhere in the hierarchy. Decode is just value = tensor_scale × block_scale × element_value; no level adds a zero-point or bias on top. This is one of the things that keeps BFP simpler than asymmetric integer quantisation: INT4 / INT8 schemes (AWQ, GPTQ, GGUF q4_K) often need a per-block zero-point — an additive offset, costing extra bits per block — to handle activation distributions that don't sit symmetrically around zero. BFP doesn't need one, because the per-element FP encoding is signed and naturally centred at zero. (The exponent bias inside the FP4 element encoding itself — bias = 1 for E2M1 — is a constant of the format, not a per-block or per-tensor parameter.)
The diagram below makes the hierarchy concrete: each row is one level, increasing in granularity from the top (one number per tensor) to the bottom (one number per element). Decoding any element is just multiplying its way back up the stack.
Both MXFP4 and NVFP4 elements are OCP FP4: 1 sign bit, 2 exponent bits (bias 1), 1 mantissa bit, and a "no Inf, no NaN, no rounding modes" simplification. The 4 bits encode 16 codes that map to 15 distinct values (±0 collapse to one) — four of those codes (0000, 1000, 0001, 1001) are subnormals.
A normal IEEE-style float carries an implicit leading 1 in the mantissa, so the value is (−1)s × 2exp−bias × 1.m. A subnormal drops that implicit leading 1 when the exponent field is all zeros and replaces the formula with (−1)s × 21−bias × 0.m. For E2M1 (bias 1) this means codes x000 represent ±0 (mantissa 0) and codes x001 represent ±0.5 (mantissa 1). Without subnormals the smallest positive value would jump straight from 0 to 1.0, leaving a "dead zone" around zero where small activations would round to either 0 or 1. Subnormals fill that gap and make the lattice continuous through zero — the linear-spaced ramp on either side of zero you see below.
Read the lattice: the spacing is logarithmic at the top end (steps of 2 between 4 and 6) and linear closest to zero (steps of 0.5 from 0 to 2). That is exactly what activations want — resolution near zero and headroom for outliers. The total dynamic range of FP4 alone is only 6/0.5 = 12×; the block scale is what stretches that to whatever the tensor needs.
Block size: 32 elements. Shared scale: one E8M0 byte (8-bit unsigned exponent, no sign, no mantissa — just 2k-127 for k=0..254, plus NaN). Element: FP4 E2M1.
| Format | Element | Block | Scale | Eff. bits/el | Use |
|---|---|---|---|---|---|
| MXFP8 | FP8 E5M2 or E4M3 | 32 | E8M0 | 8.25 | Drop-in replacement for Hopper FP8 with hardware-scaled blocks |
| MXFP6 | FP6 E3M2 or E2M3 | 32 | E8M0 | 6.25 | Quality of FP8 at ¾ the bytes — rare in production |
| MXFP4 | FP4 E2M1 | 32 | E8M0 | 4.25 | Inference workhorse on Blackwell |
| MXINT8 | INT8 (signed) | 32 | E8M0 | 8.25 | Integer compute path with shared exponent |
E8M0 is just an 8-bit power-of-two. Multiplying by it is a shift. The encoder picks the smallest k such that max(|x|) ≤ max_FP4 × 2k-127. No mantissa work, no sign, no rounding for the scale itself — all the rounding lives in the FP4 elements, which is where the hardware quantiser already runs.
NVIDIA's variant fixes the two things that bite MXFP4 hardest: block granularity (32 is too coarse for some attention layers) and scale resolution (a power-of-two scale wastes ~1 bit of mantissa whenever max-abs falls between two powers of two).
Half the MX block. Catches narrow outlier channels (e.g. attention head dims of 64 or 128 split over 8/16-element groups). Cost: more scale bytes per tensor.
Per-block scale is FP8 E4M3 (a real float, not power-of-two). Per-tensor scale is FP32 (set once by the Transformer Engine, like Hopper FP8). Decode: x = tensor_FP32 × block_E4M3 × element_FP4.
Both formats are produced by NVIDIA's Model Optimizer (and TensorRT-LLM); the export step converts an FP16/BF16 checkpoint into either MXFP4 or NVFP4 with calibration data. Pick NVFP4 for training and quality-critical inference (long-context reasoning, code, math); pick MXFP4 for general-purpose inference and the smallest possible weight footprint.
MXFP4 is part of the OCP Microscaling specification (v1.0, Sept 2023) — an open, royalty-free standard published by the Open Compute Project and contributed by AMD, Arm, Intel, Meta, Microsoft, NVIDIA and Qualcomm. Any silicon vendor can implement it. It already appears on NVIDIA Blackwell, and OCP MX support is on the public roadmap of competing accelerators (AMD CDNA 4, Intel Gaudi successors, Meta MTIA, AWS Trainium2/3).
NVFP4 is NVIDIA-proprietary. There is no public specification document, no OCP submission, and no independent implementation: the format is documented only in NVIDIA's Transformer Engine and Model Optimizer release notes, and the only hardware that decodes it is NVIDIA Blackwell. Other vendors are free to design similar two-level-scaled FP4 schemes (the underlying idea — block FP plus per-tensor scaling — is decades old and not novel by itself), but the specific NVFP4 encoding, the "NVFP4" name, and the TE v2 scale-management workflow are NVIDIA's. If you want a portable 4-bit format that runs across vendors, MXFP4 is the only choice today.
For quality — yes, consistently. Every published benchmark (NVIDIA Model Optimizer, OCP study, third-party reproductions) shows NVFP4 closer to FP8 than MXFP4 is, by a comfortable margin. The two-level scaling buys back most of what 32-element E8M0 blocks throw away.
For portability — no. MXFP4 is an open, royalty-free OCP standard implemented (or planned) on multiple vendors' silicon; NVFP4 is NVIDIA-only and runs only on Blackwell.
For bytes — MXFP4 wins. 4.25 vs 4.5 effective bits per element — about 5–6% smaller weight files for MXFP4.
| Property | MXFP4 | NVFP4 |
|---|---|---|
| Standard / governance | OCP MX v1.0 (Sept 2023) — open, royalty-free, multi-vendor | NVIDIA-proprietary — no public spec document |
| Implemented on which silicon? | NVIDIA Blackwell today; AMD CDNA 4, Intel Gaudi successors, Meta MTIA, AWS Trainium on roadmap | NVIDIA Blackwell only (B100/B200/GB200/RTX 50/RTX PRO 6000/GB10) |
| Element | FP4 E2M1 | FP4 E2M1 |
| Block size | 32 | 16 |
| Block scale | E8M0 (8 bits, power of two) | FP8 E4M3 (8 bits, real float) |
| Per-tensor scale | None (one level only) | FP32 (managed by TE v2) |
| Effective bits/element | 4.25 | ~4.5 |
| Tensor-core support | Blackwell 5th-gen MMA | Blackwell 5th-gen MMA |
| PFLOPS on B200 (dense / sparse) | ~9 / ~18 PFLOPS | ~9 / ~18 PFLOPS |
| Quality vs FP8 (Llama-3 70B) | ~0.3–0.6 PPL gap | ~0.05–0.15 PPL gap |
| Best fit | Cross-vendor inference, smallest footprint | Training and quality-critical inference on NVIDIA |
The MMA pipeline gains an FP4 / FP6 / NVFP4 datapath. The dot-product unit performs 4-bit × 4-bit → FP32 partial product with the block scales applied at accumulation. Sparsity (2:4) doubles peak again for FP4.
PTX/SASS exposes new MMA shapes via Blackwell's tcgen05.mma family (the 5th-gen tensor-core MMA op, replacing Hopper's wgmma) with .kind::mxf4, .kind::nvf4, .kind::mxf6.
Manages dynamic per-tensor scales for NVFP4 the same way TE v1 did for FP8 on Hopper — tracks max-abs across forward/backward passes, picks per-tensor FP32 scales, schedules cast kernels. The block-scale path (E8M0 for MX, E4M3 for NVFP4) is computed inline by the cast kernel.
import transformer_engine.pytorch as te
from transformer_engine.common.recipe import Format, DelayedScaling
recipe = DelayedScaling(fp8_format=Format.NVFP4,
amax_history_len=16,
amax_compute_algo="max")
with te.fp8_autocast(enabled=True, fp8_recipe=recipe):
out = model(input_ids) # NVFP4 tensor cores light up
trtllm-build --checkpoint_dir llama3-70b-fp16 \
--quantization mxfp4 \
--output_dir engines/llama3-70b-mxfp4 \
--gpus 1 \
--max_seq_len 8192
Peak dense math per GPU, by format. The point is not the absolute numbers (vendor TFLOPs are always optimistic) but the shape of the curve: each generation pushes one new low-precision format into the tensor core, and its peak roughly doubles each time.
Equal-width steps in this chart = equal multiplicative jumps in throughput. Note that the H100 FP16→FP8 jump (~95 viewBox units), the B200 FP16→FP8 jump (also ~95), and the B200 FP8→FP4 jump (also ~95) are all visibly the same size — that is the "halving bits per element doubles peak math" rule made visible. Sparse (2:4) tensor cores would add a fourth identical step on Blackwell. INT4 / INT8 tensor-core support continued from Turing/Ampere into Hopper but is no longer a quality-leading choice for LLMs — FP4-with-block-scaling is. Bar colours: green = FP16/INT8 tier, orange = FP8 tier, red = FP4 tier.
Hopper has no native FP4 tensor cores and no MX support — if you serve MXFP4 weights on H100, the engine will dequantise to FP8 or FP16 before MMA. You get the bandwidth win, not the compute win. To collect both, you need Blackwell (B100/B200/RTX 50/RTX PRO 6000/GB10).
Representative WikiText-2 perplexity on Llama-3 70 B Instruct (lower is better). Numbers from NVIDIA Model Optimizer benchmarks plus published OpenCompute MX studies; absolute values vary with calibration data and tokeniser, the deltas are the point.
B200 has 8 TB/s HBM3e. A single forward pass of 70 B params at FP16 (140 GB) takes a hard minimum of 17.5 ms. At NVFP4 (39 GB) the floor drops to 4.9 ms. Decode tok/s scales the same way.
NVFP4 frees ~100 GB of HBM on a single B200, which is ~2 M tokens of paged KV-cache at FP16 KV. That is the difference between 4 concurrent users and 64.
Memory traffic dominates inference energy. 4× smaller weights → ~3.5× lower J/token at fixed batch size on Blackwell, before any compute savings.
9 PFLOPS dense (18 PFLOPS sparse) FP4 on B200 means a 70 B model's prefill is essentially free even at long context. The bottleneck is back to activation bandwidth and KV access, not weight loading.
Activation bandwidth = the HBM bytes per token spent reading and writing the inter-layer activations — the tensors that flow between operators (residual stream, post-RMSNorm input to each linear, attention scores, GLU intermediates), as distinct from the weight bytes (the static parameters) and the KV-cache bytes (the per-token attention state).
For one transformer layer at hidden size H, batch B, sequence S, activation dtype d_act bytes, you write ≈ B × S × H × d_act bytes back to HBM and read them again next layer. With weights at FP16 this is small relative to weight reads; with weights at NVFP4 it's now comparable — weights dropped to ~0.56 bytes/param while activations are usually still BF16/FP16 (2 bytes/element) for accuracy. Above some sequence length the activation-bandwidth term overtakes weight bandwidth, and the classic "decode is weight-bound" intuition breaks. NVIDIA's FP8 / NVFP4 activation casting in Transformer Engine v2 is partly to address this; not all layers benefit yet.
Equal-width steps = equal multiplicative jumps. The two bracketed intervals at the bottom (FP16→FP8 and FP8→FP4) are the same length — that is the "halving bits/param" rule made visible. FP32 is omitted; it would extend off the right of the chart at 32 bits/param. The four 4-bit formats are clustered between the "4" and "5" gridlines because they all live in the same bit-decade — but the spread (INT4 119 px < MXFP4 129 < NVFP4 150 < GGUF 174) is large enough to read directly. Every fraction of a bit per parameter still matters at 70B–671B scale: ~1.4 GB per 0.16 bits/param at 70 B.
Drive the encoder yourself. Drag the four FP4 mantissas and the block scale; the panel decodes them through the BFP rules and shows the actual real-number values, plus what the same block would look like under MXFP4 vs NVFP4 vs naive INT4 quantisation of the same numbers.
Move element 2 to the largest negative (code 15 → -6) and the block scale to a small power of two. Then switch to MXFP4 and back to NVFP4. Watch the same fixed mantissas change effective values because the scale resolution changes. Switch to naive INT4 and notice how a single per-tensor scale crushes the dynamic range that the block-shared scale was protecting.
This deck zooms into the bit-level mechanics of the FP4 formats that several earlier decks reference at a higher level — Tensor Cores for the MMA pipeline, Blackwell and Inside Blackwell for the SM context, Quantization for Local Hosting for the practical INT4-vs-FP4 trade-off, and Hardware-Aware Quantisation for the broader number-format landscape.
DelayedScaling, Format.NVFP4. In-text citation: [1] in the NVFP4 “What that buys you” bullet on slide 06 points to reference 1 above (the NVFP4 developer-blog post). The numerical comparisons on slide 10 draw on references 1, 2 and 6.