NVIDIA GPU 38 — BFP4 Formats: NVFP4 & MXFP4

00

Topics

Why a 4-bit block format
A short history of block floating point
The block-floating-point idea
FP4 E2M1 — the element
MXFP4 — the OCP microscaling spec
NVFP4 — NVIDIA's two-level scaled variant
MXFP4 vs NVFP4 side-by-side
Where it lives — Blackwell tensor cores & Transformer Engine v2
Tensor-core throughput across architectures
Quality — FP, BFP and integer compared
Bytes-per-parameter & bandwidth wins
Interactive: decode a BFP4 block
Take-aways
References & further reading

01

Why a 4-bit block format

LLM training and inference are both dominated by two costs: HBM bandwidth (every step reads all the weights, activations, and — in training — gradients) and tensor-core throughput (every token or training sample does hundreds of GFLOPs of dense matmul). Halving the bytes-per-element roughly doubles tok/s on the inference side and step/s on the training side; halving the bits also doubles peak math on a tensor core that natively supports the format. BFP4 is the format both NVIDIA and the OCP standard reach for to do that — and unlike INT4, it works for forward, backward and weight-update passes alike.

Memory

FP16 70 B = 140 GB. NVFP4 70 B = 39 GB. The model now fits in a single 48 GB card with KV-cache headroom.

Compute

Blackwell 5th-gen tensor cores deliver 2× FP8 at FP4. B200 reaches ~9 PFLOPS dense FP4 / ~18 PFLOPS sparse.

Quality

Naive 4-bit destroys quality. Block-scaling restores it: NVFP4 gets within ~0.1 perplexity of FP8 on Llama-3 70 B.

The trick

You don't quantise individual numbers to 4 bits. You quantise blocks of numbers, share one exponent (or scale) across each block, and let the per-element mantissas be 4 bits. This is the block-floating-point (BFP) idea, and it is decades old.

Three kinds of tensor get quantised in an LLM — the first two during inference, all three during training:

Weights — the learned parameters stored in the checkpoint (the matrices W_Q, W_K, W_V, W_O, W_up, W_down, etc.). Static, set during training, read every token.
Activations — the signals flowing between layers: the output of one operator that becomes the input to the next (the residual stream, the post-RMSNorm input to each linear, the attention scores, the GLU intermediates). Computed fresh each forward pass.
Gradients (training only) — the partial derivatives flowing back through the network on the backward pass, used to update the weights. Wider dynamic range and more outlier-prone than activations — this is why NVFP4's two-level scaling matters in training (see slide 02).

All three are locally well-behaved within a small window: weights because training plus weight-decay keeps each row's magnitudes within a narrow range; activations because RMSNorm (or LayerNorm) immediately precedes every linear layer and pins each token's vector to roughly unit RMS; and gradients because they inherit that structure with one extra Jacobian. So a single shared scale per 16- or 32-element block absorbs almost all the dynamic range, and the 4-bit mantissas only have to encode the relative values within the block. The same encoding works whether you're serving 1000 tok/s of inference or running an FP4 training step.

02

A short history of block floating point

BFP was invented for fixed-point DSP — the trick of letting an array share one exponent so the multiplier can stay integer. Modern AI BFP is the same idea, retro-fitted to tensor cores.

1965 — IBM 360/91

Hex floating point with shared exponents on stacked operations

Not strictly BFP, but the lineage starts here — engineers learned to amortise an exponent across many mantissas to save on adders.

1980s — First commercial DSPs

TI TMS320, ADI 21xx, Motorola 56k — FFT in BFP

Fixed-point hardware ran the FFT butterfly with a per-stage shared exponent, rescaled between stages. This is BFP under another name and is still how most low-power DSP audio runs today.

2008 — IEEE 754-2008

FP16 standardised; deep learning imminent

Half-precision becomes the default ML training format on GPUs (P100, V100, then everything Volta+).

2020 — Microsoft MSFP

First production BFP for AI inference

Microsoft Research's Microsoft Floating Point (MSFP12 / MSFP11 / MSFP9) shipped on the Brainwave NPUs in Azure. Block-shared 8-bit exponent with 4–7-bit mantissas. Demonstrated <1% accuracy loss vs FP32 on production transformers.

2022 — Hopper FP8

First per-tensor-scaled FP8 on a GPU

H100's 4th-gen tensor cores added FP8 (E4M3 & E5M2) with one FP32 scale per tensor and a Transformer Engine that retunes scales each step. Sets the stage for finer-grained scaling.

Sept 2023 — OCP MX spec v1.0

Microscaling Formats for AI — the open standard

AMD, Arm, Intel, Meta, Microsoft, NVIDIA & Qualcomm publish the Microscaling (MX) specification under the Open Compute Project: six formats — MXFP8 (E5M2/E4M3), MXFP6 (E3M2/E2M3), MXFP4 (E2M1), MXINT8 — all sharing a 32-element block with one E8M0 scale.

March 2024 — Blackwell

5th-gen tensor cores — first GPU with native MXFP4

B100 / B200 / GB200 / RTX 50 / GB10 ship with hardware MMA support for MXFP6 and MXFP4 (and an INT-side MXINT8). The 2nd-gen Transformer Engine drives scale management.

June 2024 — NVFP4 disclosed

NVIDIA's higher-accuracy 4-bit variant

A second 4-bit format, also accelerated on Blackwell, with a smaller block (16) and a finer scale (FP8 E4M3 instead of E8M0), plus a per-tensor FP32 macro-scale. Targets training and inference where MXFP4 quality is borderline.

"Borderline" here means the cases MXFP4 measurably hurts but FP8 doesn't. Specifically:

Training — gradients have a much wider dynamic range than activations and are full of outliers, so a power-of-two E8M0 scale per 32 elements rounds away too much signal. Errors then compound over millions of steps. NVFP4's FP8 block scale plus FP32 tensor scale lets the Transformer Engine track amax dynamically (the trick that made FP8 training stable on Hopper).
Long-context inference — attention scores at 32k+ tokens span a huge range, and small mass on the tail still matters; 32-element blocks blur it.
Reasoning models (o1, R1-style) — errors compound across hundreds of generated tokens of internal chain-of-thought, so a 0.3 PPL gap that is invisible on a chat benchmark becomes a measurable accuracy drop on AIME / GPQA.
Models with peaky weight distributions — some MoE experts and recent code models have a few channels that are 5–10× the median; 16-element blocks isolate those, 32-element blocks let them dominate.

For pure weight-only inference of a chat model in the ≤32k regime, MXFP4 is usually fine; NVFP4 is the one to reach for everywhere else.

03

The block-floating-point idea

BFP is best understood as a hierarchical extension of classic floating point. An ordinary FP number is sign + exponent + mantissa, and is fully self-contained: every element pays for its own dynamic range (exponent) and its own precision (mantissa). BFP factors the exponent up one level — a block of elements share one scale, and each element carries only sign + mantissa relative to that shared scale. The encoder finds the block's largest absolute value, picks the smallest scale that lets every element fit on the per-element grid, divides everything by it, and rounds.

The depth of the hierarchy varies by format. Each new level handles a different scale of variation, and each is itself a small floating-point number:

Classic DSP BFP (1980s) — one level. Block exponent at the top, signed-integer mantissas underneath. No exponent on the element itself.
MXFP4 — two levels. A per-block E8M0 scale (8-bit shared exponent), then a per-element FP4 E2M1 which still carries 2 exponent bits and 1 mantissa bit of its own. So the block scale absorbs the bulk of the dynamic range and the tiny element FP gives a small extra range and a few discrete values inside the block.
NVFP4 — three levels. A per-tensor FP32 scale on top, a per-block FP8 (E4M3) scale in the middle, and a per-element FP4 underneath. The per-tensor scale tracks slow drift across the whole weight matrix during training, the per-block scale absorbs local outliers, and the per-element FP captures fine values inside the block.

Every level is purely multiplicative — no offsets anywhere in the hierarchy. Decode is just value = tensor_scale × block_scale × element_value; no level adds a zero-point or bias on top. This is one of the things that keeps BFP simpler than asymmetric integer quantisation: INT4 / INT8 schemes (AWQ, GPTQ, GGUF q4_K) often need a per-block zero-point — an additive offset, costing extra bits per block — to handle activation distributions that don't sit symmetrically around zero. BFP doesn't need one, because the per-element FP encoding is signed and naturally centred at zero. (The exponent bias inside the FP4 element encoding itself — bias = 1 for E2M1 — is a constant of the format, not a per-block or per-tensor parameter.)

The diagram below makes the hierarchy concrete: each row is one level, increasing in granularity from the top (one number per tensor) to the bottom (one number per element). Decoding any element is just multiplying its way back up the stack.

Why this beats per-tensor scaling

Outliers don't pollute the whole tensor. A single big activation channel forces a large per-tensor scale, which leaves all the small-value weights with too few mantissa bits. With one scale per block of 16/32, only the outlier's block pays the cost.
The hardware multiplier stays small. The dot-product engine multiplies 4-bit mantissas with 4-bit mantissas (or 4×FP8), then adds the shared exponents at the end. Per-block scale handling is a fixed cost amortised over 16–32 multiplies.
Calibration is local. Picking a per-block scale = max-abs(block) is a single-pass operation. No Hessian, no second-order optimisation, no dataset.

04

FP4 E2M1 — the element

Both MXFP4 and NVFP4 elements are OCP FP4: 1 sign bit, 2 exponent bits (bias 1), 1 mantissa bit, and a "no Inf, no NaN, no rounding modes" simplification. The 4 bits encode 16 codes that map to 15 distinct values (±0 collapse to one) — four of those codes (0000, 1000, 0001, 1001) are subnormals.

A normal IEEE-style float carries an implicit leading 1 in the mantissa, so the value is (−1)^s × 2^exp−bias × 1.m. A subnormal drops that implicit leading 1 when the exponent field is all zeros and replaces the formula with (−1)^s × 2^1−bias × 0.m. For E2M1 (bias 1) this means codes x000 represent ±0 (mantissa 0) and codes x001 represent ±0.5 (mantissa 1). Without subnormals the smallest positive value would jump straight from 0 to 1.0, leaving a "dead zone" around zero where small activations would round to either 0 or 1. Subnormals fill that gap and make the lattice continuous through zero — the linear-spaced ramp on either side of zero you see below.

Read the lattice: the spacing is logarithmic at the top end (steps of 2 between 4 and 6) and linear closest to zero (steps of 0.5 from 0 to 2). That is exactly what activations want — resolution near zero and headroom for outliers. The total dynamic range of FP4 alone is only 6/0.5 = 12×; the block scale is what stretches that to whatever the tensor needs.

05

MXFP4 — the OCP microscaling spec

Block size: 32 elements. Shared scale: one E8M0 byte (8-bit unsigned exponent, no sign, no mantissa — just 2^k-127 for k=0..254, plus NaN). Element: FP4 E2M1.

The MX family at a glance

Format	Element	Block	Scale	Eff. bits/el	Use
MXFP8	FP8 E5M2 or E4M3	32	E8M0	8.25	Drop-in replacement for Hopper FP8 with hardware-scaled blocks
MXFP6	FP6 E3M2 or E2M3	32	E8M0	6.25	Quality of FP8 at ¾ the bytes — rare in production
MXFP4	FP4 E2M1	32	E8M0	4.25	Inference workhorse on Blackwell
MXINT8	INT8 (signed)	32	E8M0	8.25	Integer compute path with shared exponent

Why E8M0 (not E4M3) for the scale

E8M0 is just an 8-bit power-of-two. Multiplying by it is a shift. The encoder picks the smallest k such that max(|x|) ≤ max_FP4 × 2^k-127. No mantissa work, no sign, no rounding for the scale itself — all the rounding lives in the FP4 elements, which is where the hardware quantiser already runs.

06

NVFP4 — NVIDIA's two-level scaled variant

NVIDIA's variant fixes the two things that bite MXFP4 hardest: block granularity (32 is too coarse for some attention layers) and scale resolution (a power-of-two scale wastes ~1 bit of mantissa whenever max-abs falls between two powers of two).

Block size = 16

Half the MX block. Catches narrow outlier channels (e.g. attention head dims of 64 or 128 split over 8/16-element groups). Cost: more scale bytes per tensor.

Two-level scaling

Per-block scale is FP8 E4M3 (a real float, not power-of-two). Per-tensor scale is FP32 (set once by the Transformer Engine, like Hopper FP8). Decode: x = tensor_FP32 × block_E4M3 × element_FP4.

What that buys you

Smaller block (16 vs 32) means an outlier in one head dim only contaminates 16 elements, not 32. NVIDIA's NVFP4 whitepaper [1] shows attention with 16-element blocks recovers accuracy that MXFP4 (32) loses on long-context reasoning models.
FP8 block scale gives ~22 distinct scale values between any two powers of two. The block can pack mantissas tighter against the FP4 grid, reducing rounding error per element.
Per-tensor FP32 macro-scale lets the Transformer Engine v2 do the same dynamic re-scaling trick as on Hopper FP8 — you don't lose dynamic range to weight-update drift during training.
Cost: ~4.5 bits/element vs MXFP4's 4.25, so 5–6% larger weight files. Same tensor-core throughput.

Practical note

Both formats are produced by NVIDIA's Model Optimizer (and TensorRT-LLM); the export step converts an FP16/BF16 checkpoint into either MXFP4 or NVFP4 with calibration data. Pick NVFP4 for training and quality-critical inference (long-context reasoning, code, math); pick MXFP4 for general-purpose inference and the smallest possible weight footprint.

Open vs proprietary

MXFP4 is part of the OCP Microscaling specification (v1.0, Sept 2023) — an open, royalty-free standard published by the Open Compute Project and contributed by AMD, Arm, Intel, Meta, Microsoft, NVIDIA and Qualcomm. Any silicon vendor can implement it. It already appears on NVIDIA Blackwell, and OCP MX support is on the public roadmap of competing accelerators (AMD CDNA 4, Intel Gaudi successors, Meta MTIA, AWS Trainium2/3).

NVFP4 is NVIDIA-proprietary. There is no public specification document, no OCP submission, and no independent implementation: the format is documented only in NVIDIA's Transformer Engine and Model Optimizer release notes, and the only hardware that decodes it is NVIDIA Blackwell. Other vendors are free to design similar two-level-scaled FP4 schemes (the underlying idea — block FP plus per-tensor scaling — is decades old and not novel by itself), but the specific NVFP4 encoding, the "NVFP4" name, and the TE v2 scale-management workflow are NVIDIA's. If you want a portable 4-bit format that runs across vendors, MXFP4 is the only choice today.

07

MXFP4 vs NVFP4 side-by-side

Is NVFP4 better than MXFP4?

For quality — yes, consistently. Every published benchmark (NVIDIA Model Optimizer, OCP study, third-party reproductions) shows NVFP4 closer to FP8 than MXFP4 is, by a comfortable margin. The two-level scaling buys back most of what 32-element E8M0 blocks throw away.

For portability — no. MXFP4 is an open, royalty-free OCP standard implemented (or planned) on multiple vendors' silicon; NVFP4 is NVIDIA-only and runs only on Blackwell.

For bytes — MXFP4 wins. 4.25 vs 4.5 effective bits per element — about 5–6% smaller weight files for MXFP4.

Property	MXFP4	NVFP4
Standard / governance	OCP MX v1.0 (Sept 2023) — open, royalty-free, multi-vendor	NVIDIA-proprietary — no public spec document
Implemented on which silicon?	NVIDIA Blackwell today; AMD CDNA 4, Intel Gaudi successors, Meta MTIA, AWS Trainium on roadmap	NVIDIA Blackwell only (B100/B200/GB200/RTX 50/RTX PRO 6000/GB10)
Element	FP4 E2M1	FP4 E2M1
Block size	32	16
Block scale	E8M0 (8 bits, power of two)	FP8 E4M3 (8 bits, real float)
Per-tensor scale	None (one level only)	FP32 (managed by TE v2)
Effective bits/element	4.25	~4.5
Tensor-core support	Blackwell 5th-gen MMA	Blackwell 5th-gen MMA
PFLOPS on B200 (dense / sparse)	~9 / ~18 PFLOPS	~9 / ~18 PFLOPS
Quality vs FP8 (Llama-3 70B)	~0.3–0.6 PPL gap	~0.05–0.15 PPL gap
Best fit	Cross-vendor inference, smallest footprint	Training and quality-critical inference on NVIDIA

Bytes for one 70 B-parameter weight tensor

08

Where it lives — Blackwell tensor cores & Transformer Engine v2

5th-gen tensor cores

The MMA pipeline gains an FP4 / FP6 / NVFP4 datapath. The dot-product unit performs 4-bit × 4-bit → FP32 partial product with the block scales applied at accumulation. Sparsity (2:4) doubles peak again for FP4.

PTX/SASS exposes new MMA shapes via Blackwell's tcgen05.mma family (the 5th-gen tensor-core MMA op, replacing Hopper's wgmma) with .kind::mxf4, .kind::nvf4, .kind::mxf6.

Transformer Engine v2

Manages dynamic per-tensor scales for NVFP4 the same way TE v1 did for FP8 on Hopper — tracks max-abs across forward/backward passes, picks per-tensor FP32 scales, schedules cast kernels. The block-scale path (E8M0 for MX, E4M3 for NVFP4) is computed inline by the cast kernel.

^T, FP32 acc amax history recipe.fp8_format = NVFP4 memory layout interleaved scale + mantissa

API surface

Transformer Engine, Blackwell, NVFP4 inference

import transformer_engine.pytorch as te
from transformer_engine.common.recipe import Format, DelayedScaling

recipe = DelayedScaling(fp8_format=Format.NVFP4,
                        amax_history_len=16,
                        amax_compute_algo="max")

with te.fp8_autocast(enabled=True, fp8_recipe=recipe):
    out = model(input_ids)              # NVFP4 tensor cores light up

TensorRT-LLM build with MXFP4 weight-only

trtllm-build --checkpoint_dir llama3-70b-fp16 \
             --quantization mxfp4 \
             --output_dir engines/llama3-70b-mxfp4 \
             --gpus 1 \
             --max_seq_len 8192

09

Tensor-core throughput across architectures

Peak dense math per GPU, by format. The point is not the absolute numbers (vendor TFLOPs are always optimistic) but the shape of the curve: each generation pushes one new low-precision format into the tensor core, and its peak roughly doubles each time.

Peak dense TFLOPs per chip — log₂ scale, axis at top

Equal-width steps in this chart = equal multiplicative jumps in throughput. Note that the H100 FP16→FP8 jump (~95 viewBox units), the B200 FP16→FP8 jump (also ~95), and the B200 FP8→FP4 jump (also ~95) are all visibly the same size — that is the "halving bits per element doubles peak math" rule made visible. Sparse (2:4) tensor cores would add a fourth identical step on Blackwell. INT4 / INT8 tensor-core support continued from Turing/Ampere into Hopper but is no longer a quality-leading choice for LLMs — FP4-with-block-scaling is. Bar colours: green = FP16/INT8 tier, orange = FP8 tier, red = FP4 tier.

Read carefully

Hopper has no native FP4 tensor cores and no MX support — if you serve MXFP4 weights on H100, the engine will dequantise to FP8 or FP16 before MMA. You get the bandwidth win, not the compute win. To collect both, you need Blackwell (B100/B200/RTX 50/RTX PRO 6000/GB10).

10

Quality — FP, BFP and integer compared

Representative WikiText-2 perplexity on Llama-3 70 B Instruct (lower is better). Numbers from NVIDIA Model Optimizer benchmarks plus published OpenCompute MX studies; absolute values vary with calibration data and tokeniser, the deltas are the point.

NVFP4 closes 80% of the gap between MXFP4 and FP8. The two-level scale is what lets it.
MXFP4 ties AWQ-INT4 and beats GPTQ-INT4 — integer quant of the same bit width is no longer the quality leader at 4 bits.
Naive PTQ INT4 (no calibration, no learnt scales) is the lower bound and shows how much of the work is in the scaling, not the bits.
FP6 fills the gap: when you can spare ~50% more bytes than FP4 but want better-than-MX-FP4 quality, MXFP6 E2M3 lands almost on FP8.

11

Bytes-per-parameter & bandwidth wins

HBM bandwidth as a clock

B200 has 8 TB/s HBM3e. A single forward pass of 70 B params at FP16 (140 GB) takes a hard minimum of 17.5 ms. At NVFP4 (39 GB) the floor drops to 4.9 ms. Decode tok/s scales the same way.

Concurrency

NVFP4 frees ~100 GB of HBM on a single B200, which is ~2 M tokens of paged KV-cache at FP16 KV. That is the difference between 4 concurrent users and 64.

Energy per token

Memory traffic dominates inference energy. 4× smaller weights → ~3.5× lower J/token at fixed batch size on Blackwell, before any compute savings.

Compute headroom

9 PFLOPS dense (18 PFLOPS sparse) FP4 on B200 means a 70 B model's prefill is essentially free even at long context. The bottleneck is back to activation bandwidth and KV access, not weight loading.

What "activation bandwidth" means

Activation bandwidth = the HBM bytes per token spent reading and writing the inter-layer activations — the tensors that flow between operators (residual stream, post-RMSNorm input to each linear, attention scores, GLU intermediates), as distinct from the weight bytes (the static parameters) and the KV-cache bytes (the per-token attention state).

For one transformer layer at hidden size H, batch B, sequence S, activation dtype d_act bytes, you write ≈ B × S × H × d_act bytes back to HBM and read them again next layer. With weights at FP16 this is small relative to weight reads; with weights at NVFP4 it's now comparable — weights dropped to ~0.56 bytes/param while activations are usually still BF16/FP16 (2 bytes/element) for accuracy. Above some sequence length the activation-bandwidth term overtakes weight bandwidth, and the classic "decode is weight-bound" intuition breaks. NVIDIA's FP8 / NVFP4 activation casting in Transformer Engine v2 is partly to address this; not all layers benefit yet.

Effective bits/param across formats — log₂ scale, axis at top

Equal-width steps = equal multiplicative jumps. The two bracketed intervals at the bottom (FP16→FP8 and FP8→FP4) are the same length — that is the "halving bits/param" rule made visible. FP32 is omitted; it would extend off the right of the chart at 32 bits/param. The four 4-bit formats are clustered between the "4" and "5" gridlines because they all live in the same bit-decade — but the spread (INT4 119 px < MXFP4 129 < NVFP4 150 < GGUF 174) is large enough to read directly. Every fraction of a bit per parameter still matters at 70B–671B scale: ~1.4 GB per 0.16 bits/param at 70 B.

12

Interactive: decode a BFP4 block

Drive the encoder yourself. Drag the four FP4 mantissas and the block scale; the panel decodes them through the BFP rules and shows the actual real-number values, plus what the same block would look like under MXFP4 vs NVFP4 vs naive INT4 quantisation of the same numbers.

Format

Block scale 4.00

Tensor scale (FP32) 1.00

Element 0 (FP4 code 0–15) 010 = +1.0

Element 1 011 = +1.5

Element 2 110 = -4.0

Element 3 101 = +3.0

What you should notice

Move element 2 to the largest negative (code 15 → -6) and the block scale to a small power of two. Then switch to MXFP4 and back to NVFP4. Watch the same fixed mantissas change effective values because the scale resolution changes. Switch to naive INT4 and notice how a single per-tensor scale crushes the dynamic range that the block-shared scale was protecting.

13

Take-aways

BFP is a 40-year-old DSP trick. Sharing one exponent across a block of mantissas was how the FFT ran on TI TMS320s in the 1980s. AI just rediscovered it under new names — AFP / MSFP, then OCP MX, then NVFP4.
The element format is shared. Both MXFP4 and NVFP4 use FP4 E2M1 with the same 16 lattice values. The difference is entirely in the scaling: 32-element blocks with E8M0 (MX) vs 16-element blocks with FP8 + FP32 (NV).
NVFP4 is the higher-quality 4-bit format, closing most of the FP8→FP4 gap at a 5–6% cost in bytes. Pick it for training and quality-critical inference; pick MXFP4 for the smallest weight footprint.
MXFP4 is open, NVFP4 is NVIDIA-proprietary. MXFP4 is an OCP royalty-free standard adopted (or roadmapped) by AMD, Intel, Meta, AWS and others; NVFP4 has no public spec, no OCP submission, and decodes only on NVIDIA Blackwell. If you need the same checkpoint to run on non-NVIDIA accelerators, MXFP4 is currently the only 4-bit option.
Tensor-core support requires Blackwell. H100 / H200 (Hopper) can read MXFP4 but they cast it up to FP8 or FP16 before MMA. Only B100 / B200 / GB200 / RTX 50 / RTX PRO 6000 / GB10 (Spark) accelerate the math itself.
Memory wins regardless. Even on Hopper, MXFP4 weight-only quantisation halves bandwidth-bound decode time vs FP8.
Sparsity stacks. 2:4 structured sparsity doubles peak NVFP4/MXFP4 throughput on Blackwell, taking B200 to 9 PFLOPS dense / 18 PFLOPS sparse FP4.

Where this fits in the series

This deck zooms into the bit-level mechanics of the FP4 formats that several earlier decks reference at a higher level — Tensor Cores for the MMA pipeline, Blackwell and Inside Blackwell for the SM context, Quantization for Local Hosting for the practical INT4-vs-FP4 trade-off, and Hardware-Aware Quantisation for the broader number-format landscape.

14

References & further reading

NVIDIA — NVFP4 and Blackwell

NVIDIA Developer Blog — Introducing NVFP4 for Efficient and Accurate Low-Precision Inference. The primary public reference for NVFP4: format definition (16-element block, FP8 micro-scale, FP32 macro-scale), measured perplexity vs MXFP4 and FP8 on Llama-family models, Transformer Engine integration.
developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference
NVIDIA — Pretraining Large Language Models with NVFP4 (research paper, 2025). Training-time stability results, gradient handling, two-level scale management.
arxiv.org/abs/2509.25149
NVIDIA — Blackwell Architecture Technical Brief (March 2024). 5th-gen tensor cores, MX-FP4 / MX-FP6 datapaths, 2nd-gen Transformer Engine, NVL72.
resources.nvidia.com/en-us-blackwell-architecture
NVIDIA Transformer Engine documentation — FP8 / NVFP4 recipe API, DelayedScaling, Format.NVFP4.
docs.nvidia.com/deeplearning/transformer-engine

Open Compute Project — MX spec

OCP — Microscaling Formats (MX) v1.0 Specification (September 2023). The open multi-vendor spec defining MXFP8, MXFP6, MXFP4, MXINT8 with E8M0 shared scale and 32-element blocks. Contributed by AMD, Arm, Intel, Meta, Microsoft, NVIDIA, Qualcomm.
opencompute.org — OCP Microscaling Formats (MX) v1.0 Spec
Rouhani et al. — Microscaling Data Formats for Deep Learning (2023). The Microsoft / OCP study showing MX formats' accuracy vs FP8 and BF16 baselines across LLMs, vision and recommendation models.
arxiv.org/abs/2310.10537

Background — FP8, BFP, integer quantisation

Micikevicius et al. — FP8 Formats for Deep Learning (NVIDIA / Arm / Intel, 2022). Justifies E4M3 + E5M2 split, defines the per-tensor scale workflow that NVFP4's macro-scale builds on.
arxiv.org/abs/2209.05433
Microsoft Research — Pushing the Limits of Narrow Precision Inferencing at Cloud Scale with Microsoft Floating Point (NeurIPS 2020). The MSFP precursor to MX — first production deployment of block-shared exponents for transformer inference.
microsoft.com/en-us/research — MSFP paper
Frantar et al. — GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (ICLR 2023). The INT4 baseline NVFP4 / MXFP4 are usually compared to.
arxiv.org/abs/2210.17323
Lin et al. — AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (MLSys 2024). The other INT4 baseline.
arxiv.org/abs/2306.00978

In-text citation: [1] in the NVFP4 “What that buys you” bullet on slide 06 points to reference 1 above (the NVFP4 developer-blog post). The numerical comparisons on slide 10 draw on references 1, 2 and 6.