NVIDIA GPU Architectures — Presentation 38

BFP4 — NVFP4 & MXFP4

Block-floating-point at 4 bits/element — how the OCP MX spec and NVIDIA's NVFP4 work, where they live in Blackwell's 5th-gen tensor cores, and how they compare to FP8, INT4 and BF16.

BFPMXFP4NVFP4 FP4 E2M1Blackwell5th-Gen Tensor Cores
Why 4-bit History BFP idea FP4 element MXFP4 NVFP4 Compare
00

Topics

01

Why a 4-bit block format

LLM training and inference are both dominated by two costs: HBM bandwidth (every step reads all the weights, activations, and — in training — gradients) and tensor-core throughput (every token or training sample does hundreds of GFLOPs of dense matmul). Halving the bytes-per-element roughly doubles tok/s on the inference side and step/s on the training side; halving the bits also doubles peak math on a tensor core that natively supports the format. BFP4 is the format both NVIDIA and the OCP standard reach for to do that — and unlike INT4, it works for forward, backward and weight-update passes alike.

Memory

FP16 70 B = 140 GB. NVFP4 70 B = 39 GB. The model now fits in a single 48 GB card with KV-cache headroom.

Compute

Blackwell 5th-gen tensor cores deliver 2× FP8 at FP4. B200 reaches ~9 PFLOPS dense FP4 / ~18 PFLOPS sparse.

Quality

Naive 4-bit destroys quality. Block-scaling restores it: NVFP4 gets within ~0.1 perplexity of FP8 on Llama-3 70 B.

The trick

You don't quantise individual numbers to 4 bits. You quantise blocks of numbers, share one exponent (or scale) across each block, and let the per-element mantissas be 4 bits. This is the block-floating-point (BFP) idea, and it is decades old.

Three kinds of tensor get quantised in an LLM — the first two during inference, all three during training:

All three are locally well-behaved within a small window: weights because training plus weight-decay keeps each row's magnitudes within a narrow range; activations because RMSNorm (or LayerNorm) immediately precedes every linear layer and pins each token's vector to roughly unit RMS; and gradients because they inherit that structure with one extra Jacobian. So a single shared scale per 16- or 32-element block absorbs almost all the dynamic range, and the 4-bit mantissas only have to encode the relative values within the block. The same encoding works whether you're serving 1000 tok/s of inference or running an FP4 training step.

02

A short history of block floating point

BFP was invented for fixed-point DSP — the trick of letting an array share one exponent so the multiplier can stay integer. Modern AI BFP is the same idea, retro-fitted to tensor cores.

1965 — IBM 360/91
Hex floating point with shared exponents on stacked operations
Not strictly BFP, but the lineage starts here — engineers learned to amortise an exponent across many mantissas to save on adders.
1980s — First commercial DSPs
TI TMS320, ADI 21xx, Motorola 56k — FFT in BFP
Fixed-point hardware ran the FFT butterfly with a per-stage shared exponent, rescaled between stages. This is BFP under another name and is still how most low-power DSP audio runs today.
2008 — IEEE 754-2008
FP16 standardised; deep learning imminent
Half-precision becomes the default ML training format on GPUs (P100, V100, then everything Volta+).
2020 — Microsoft MSFP
First production BFP for AI inference
Microsoft Research's Microsoft Floating Point (MSFP12 / MSFP11 / MSFP9) shipped on the Brainwave NPUs in Azure. Block-shared 8-bit exponent with 4–7-bit mantissas. Demonstrated <1% accuracy loss vs FP32 on production transformers.
2022 — Hopper FP8
First per-tensor-scaled FP8 on a GPU
H100's 4th-gen tensor cores added FP8 (E4M3 & E5M2) with one FP32 scale per tensor and a Transformer Engine that retunes scales each step. Sets the stage for finer-grained scaling.
Sept 2023 — OCP MX spec v1.0
Microscaling Formats for AI — the open standard
AMD, Arm, Intel, Meta, Microsoft, NVIDIA & Qualcomm publish the Microscaling (MX) specification under the Open Compute Project: six formats — MXFP8 (E5M2/E4M3), MXFP6 (E3M2/E2M3), MXFP4 (E2M1), MXINT8 — all sharing a 32-element block with one E8M0 scale.
March 2024 — Blackwell
5th-gen tensor cores — first GPU with native MXFP4
B100 / B200 / GB200 / RTX 50 / GB10 ship with hardware MMA support for MXFP6 and MXFP4 (and an INT-side MXINT8). The 2nd-gen Transformer Engine drives scale management.
June 2024 — NVFP4 disclosed
NVIDIA's higher-accuracy 4-bit variant

A second 4-bit format, also accelerated on Blackwell, with a smaller block (16) and a finer scale (FP8 E4M3 instead of E8M0), plus a per-tensor FP32 macro-scale. Targets training and inference where MXFP4 quality is borderline.

"Borderline" here means the cases MXFP4 measurably hurts but FP8 doesn't. Specifically:

  • Training — gradients have a much wider dynamic range than activations and are full of outliers, so a power-of-two E8M0 scale per 32 elements rounds away too much signal. Errors then compound over millions of steps. NVFP4's FP8 block scale plus FP32 tensor scale lets the Transformer Engine track amax dynamically (the trick that made FP8 training stable on Hopper).
  • Long-context inference — attention scores at 32k+ tokens span a huge range, and small mass on the tail still matters; 32-element blocks blur it.
  • Reasoning models (o1, R1-style) — errors compound across hundreds of generated tokens of internal chain-of-thought, so a 0.3 PPL gap that is invisible on a chat benchmark becomes a measurable accuracy drop on AIME / GPQA.
  • Models with peaky weight distributions — some MoE experts and recent code models have a few channels that are 5–10× the median; 16-element blocks isolate those, 32-element blocks let them dominate.

For pure weight-only inference of a chat model in the ≤32k regime, MXFP4 is usually fine; NVFP4 is the one to reach for everywhere else.

03

The block-floating-point idea

BFP is best understood as a hierarchical extension of classic floating point. An ordinary FP number is sign + exponent + mantissa, and is fully self-contained: every element pays for its own dynamic range (exponent) and its own precision (mantissa). BFP factors the exponent up one level — a block of elements share one scale, and each element carries only sign + mantissa relative to that shared scale. The encoder finds the block's largest absolute value, picks the smallest scale that lets every element fit on the per-element grid, divides everything by it, and rounds.

The depth of the hierarchy varies by format. Each new level handles a different scale of variation, and each is itself a small floating-point number:

Every level is purely multiplicative — no offsets anywhere in the hierarchy. Decode is just value = tensor_scale × block_scale × element_value; no level adds a zero-point or bias on top. This is one of the things that keeps BFP simpler than asymmetric integer quantisation: INT4 / INT8 schemes (AWQ, GPTQ, GGUF q4_K) often need a per-block zero-point — an additive offset, costing extra bits per block — to handle activation distributions that don't sit symmetrically around zero. BFP doesn't need one, because the per-element FP encoding is signed and naturally centred at zero. (The exponent bias inside the FP4 element encoding itself — bias = 1 for E2M1 — is a constant of the format, not a per-block or per-tensor parameter.)

The diagram below makes the hierarchy concrete: each row is one level, increasing in granularity from the top (one number per tensor) to the bottom (one number per element). Decoding any element is just multiplying its way back up the stack.

Per-tensor FP32 (top) — per-block float (middle) — per-element FP4 (bottom) tensor-scale FP32 one per tensor — NVFP4 only (MXFP4 has none) block-scales FP8 FP8 FP8 FP8 FP8 FP8 FP8 FP8 one per block — FP8 E4M3 (NVFP4) or E8M0 (MXFP4) elements one per element — FP4 E2M1 (4 bits each) value = tensor_scale × block_scale × mantissa NVFP4 uses all three; MXFP4 drops tensor_scale; legacy DSP BFP keeps only block_scale with integer mantissas

Why this beats per-tensor scaling

04

FP4 E2M1 — the element

Both MXFP4 and NVFP4 elements are OCP FP4: 1 sign bit, 2 exponent bits (bias 1), 1 mantissa bit, and a "no Inf, no NaN, no rounding modes" simplification. The 4 bits encode 16 codes that map to 15 distinct values (±0 collapse to one) — four of those codes (0000, 1000, 0001, 1001) are subnormals.

A normal IEEE-style float carries an implicit leading 1 in the mantissa, so the value is (−1)s × 2exp−bias × 1.m. A subnormal drops that implicit leading 1 when the exponent field is all zeros and replaces the formula with (−1)s × 21−bias × 0.m. For E2M1 (bias 1) this means codes x000 represent ±0 (mantissa 0) and codes x001 represent ±0.5 (mantissa 1). Without subnormals the smallest positive value would jump straight from 0 to 1.0, leaving a "dead zone" around zero where small activations would round to either 0 or 1. Subnormals fill that gap and make the lattice continuous through zero — the linear-spaced ramp on either side of zero you see below.

FP4 E2M1 bit layout (top) — full 16-code lattice with bit codes (bottom) s e e m sign exp (bias 1) mant = 1 sign + 2 exp + 1 mant finite values only — no Inf, no NaN, no rounding modes range: max ±6, min normal ±1, subnormals ±0.5 -6 1111 -4 1110 -3 1101 -2 1100 -1.5 1011 -1 1010 -0.5 1001 ±0 x000 0.5 0001 1 0010 1.5 0011 2 0100 3 0101 4 0110 6 0111 subnormal range (exp = 00): codes x000 = ±0, x001 = ±0.5 negative half — sign bit = 1 positive half — sign bit = 0

Read the lattice: the spacing is logarithmic at the top end (steps of 2 between 4 and 6) and linear closest to zero (steps of 0.5 from 0 to 2). That is exactly what activations want — resolution near zero and headroom for outliers. The total dynamic range of FP4 alone is only 6/0.5 = 12×; the block scale is what stretches that to whatever the tensor needs.

05

MXFP4 — the OCP microscaling spec

Block size: 32 elements. Shared scale: one E8M0 byte (8-bit unsigned exponent, no sign, no mantissa — just 2k-127 for k=0..254, plus NaN). Element: FP4 E2M1.

One MX block = 1 byte scale + 32 × 4-bit elements = 17 bytes total scale E8M0 elements (32) FP4 E2M1 — 4 bits each, 32 elements occupy 16 bytes x[i] = scale × mantissa[i] scale = 2k−127 , k = 0..254 (k=255 reserved for NaN) Effective bits/element = 4 + 8/32 = 4.25

The MX family at a glance

FormatElementBlockScaleEff. bits/elUse
MXFP8FP8 E5M2 or E4M332E8M08.25Drop-in replacement for Hopper FP8 with hardware-scaled blocks
MXFP6FP6 E3M2 or E2M332E8M06.25Quality of FP8 at ¾ the bytes — rare in production
MXFP4FP4 E2M132E8M04.25Inference workhorse on Blackwell
MXINT8INT8 (signed)32E8M08.25Integer compute path with shared exponent
Why E8M0 (not E4M3) for the scale

E8M0 is just an 8-bit power-of-two. Multiplying by it is a shift. The encoder picks the smallest k such that max(|x|) ≤ max_FP4 × 2k-127. No mantissa work, no sign, no rounding for the scale itself — all the rounding lives in the FP4 elements, which is where the hardware quantiser already runs.

06

NVFP4 — NVIDIA's two-level scaled variant

NVIDIA's variant fixes the two things that bite MXFP4 hardest: block granularity (32 is too coarse for some attention layers) and scale resolution (a power-of-two scale wastes ~1 bit of mantissa whenever max-abs falls between two powers of two).

Block size = 16

Half the MX block. Catches narrow outlier channels (e.g. attention head dims of 64 or 128 split over 8/16-element groups). Cost: more scale bytes per tensor.

Two-level scaling

Per-block scale is FP8 E4M3 (a real float, not power-of-two). Per-tensor scale is FP32 (set once by the Transformer Engine, like Hopper FP8). Decode: x = tensor_FP32 × block_E4M3 × element_FP4.

NVFP4 = FP32 tensor scale + FP8 per-16 block scale + FP4 mantissa tensor FP32 scale block-scale FP8 E4M3 — one per 16-element block elements 32 FP4 elements (= 2 blocks). Eff. bits/el = 4 + 8/16 + (32 / N_tensor) ≈ 4.5

What that buys you

Practical note

Both formats are produced by NVIDIA's Model Optimizer (and TensorRT-LLM); the export step converts an FP16/BF16 checkpoint into either MXFP4 or NVFP4 with calibration data. Pick NVFP4 for training and quality-critical inference (long-context reasoning, code, math); pick MXFP4 for general-purpose inference and the smallest possible weight footprint.

Open vs proprietary

MXFP4 is part of the OCP Microscaling specification (v1.0, Sept 2023) — an open, royalty-free standard published by the Open Compute Project and contributed by AMD, Arm, Intel, Meta, Microsoft, NVIDIA and Qualcomm. Any silicon vendor can implement it. It already appears on NVIDIA Blackwell, and OCP MX support is on the public roadmap of competing accelerators (AMD CDNA 4, Intel Gaudi successors, Meta MTIA, AWS Trainium2/3).

NVFP4 is NVIDIA-proprietary. There is no public specification document, no OCP submission, and no independent implementation: the format is documented only in NVIDIA's Transformer Engine and Model Optimizer release notes, and the only hardware that decodes it is NVIDIA Blackwell. Other vendors are free to design similar two-level-scaled FP4 schemes (the underlying idea — block FP plus per-tensor scaling — is decades old and not novel by itself), but the specific NVFP4 encoding, the "NVFP4" name, and the TE v2 scale-management workflow are NVIDIA's. If you want a portable 4-bit format that runs across vendors, MXFP4 is the only choice today.

07

MXFP4 vs NVFP4 side-by-side

Is NVFP4 better than MXFP4?

For quality — yes, consistently. Every published benchmark (NVIDIA Model Optimizer, OCP study, third-party reproductions) shows NVFP4 closer to FP8 than MXFP4 is, by a comfortable margin. The two-level scaling buys back most of what 32-element E8M0 blocks throw away.

For portability — no. MXFP4 is an open, royalty-free OCP standard implemented (or planned) on multiple vendors' silicon; NVFP4 is NVIDIA-only and runs only on Blackwell.

For bytes — MXFP4 wins. 4.25 vs 4.5 effective bits per element — about 5–6% smaller weight files for MXFP4.

PropertyMXFP4NVFP4
Standard / governanceOCP MX v1.0 (Sept 2023) — open, royalty-free, multi-vendorNVIDIA-proprietary — no public spec document
Implemented on which silicon?NVIDIA Blackwell today; AMD CDNA 4, Intel Gaudi successors, Meta MTIA, AWS Trainium on roadmapNVIDIA Blackwell only (B100/B200/GB200/RTX 50/RTX PRO 6000/GB10)
ElementFP4 E2M1FP4 E2M1
Block size3216
Block scaleE8M0 (8 bits, power of two)FP8 E4M3 (8 bits, real float)
Per-tensor scaleNone (one level only)FP32 (managed by TE v2)
Effective bits/element4.25~4.5
Tensor-core supportBlackwell 5th-gen MMABlackwell 5th-gen MMA
PFLOPS on B200 (dense / sparse)~9 / ~18 PFLOPS~9 / ~18 PFLOPS
Quality vs FP8 (Llama-3 70B)~0.3–0.6 PPL gap~0.05–0.15 PPL gap
Best fitCross-vendor inference, smallest footprintTraining and quality-critical inference on NVIDIA

Bytes for one 70 B-parameter weight tensor

70 B params (GB) FP32 280 FP16/BF16 140 FP8 70 NVFP4 39 MXFP4 37
08

Where it lives — Blackwell tensor cores & Transformer Engine v2

5th-gen tensor cores

The MMA pipeline gains an FP4 / FP6 / NVFP4 datapath. The dot-product unit performs 4-bit × 4-bit → FP32 partial product with the block scales applied at accumulation. Sparsity (2:4) doubles peak again for FP4.

PTX/SASS exposes new MMA shapes via Blackwell's tcgen05.mma family (the 5th-gen tensor-core MMA op, replacing Hopper's wgmma) with .kind::mxf4, .kind::nvf4, .kind::mxf6.

Transformer Engine v2

Manages dynamic per-tensor scales for NVFP4 the same way TE v1 did for FP8 on Hopper — tracks max-abs across forward/backward passes, picks per-tensor FP32 scales, schedules cast kernels. The block-scale path (E8M0 for MX, E4M3 for NVFP4) is computed inline by the cast kernel.

Cast → MMA → Accumulate (Blackwell, NVFP4 path) BF16/FP16 input RMSNorm output TE v2 cast kernel find max-abs / 16 pick block FP8 scale NVFP4 tile FP32 + 8×FP8 + 128×FP4 5th-gen MMA A × BT, FP32 acc amax history recipe.fp8_format = NVFP4 memory layout interleaved scale + mantissa

API surface

Transformer Engine, Blackwell, NVFP4 inference
import transformer_engine.pytorch as te
from transformer_engine.common.recipe import Format, DelayedScaling

recipe = DelayedScaling(fp8_format=Format.NVFP4,
                        amax_history_len=16,
                        amax_compute_algo="max")

with te.fp8_autocast(enabled=True, fp8_recipe=recipe):
    out = model(input_ids)              # NVFP4 tensor cores light up
TensorRT-LLM build with MXFP4 weight-only
trtllm-build --checkpoint_dir llama3-70b-fp16 \
             --quantization mxfp4 \
             --output_dir engines/llama3-70b-mxfp4 \
             --gpus 1 \
             --max_seq_len 8192
09

Tensor-core throughput across architectures

Peak dense math per GPU, by format. The point is not the absolute numbers (vendor TFLOPs are always optimistic) but the shape of the curve: each generation pushes one new low-precision format into the tensor core, and its peak roughly doubles each time.

Peak dense TFLOPs per chip — log₂ scale, axis at top

peak dense TFLOPs — log₂ (each doubling = same step) 128 256 512 1024 2048 4096 8192 V100 FP16 125 TF A100 FP16 312 TF A100 INT8 624 TOP H100 FP16 990 TF H100 FP8 1,979 TF B200 FP16 2,250 TF B200 FP8 4,500 TF B200 MXFP6 4,500 TF B200 MXFP4 9,000 TF B200 NVFP4 9,000 TF GB10 (Spark) FP4 ~1,000 TF FP16→FP8: ×2 FP16→FP8: ×2 FP8→FP4: ×2

Equal-width steps in this chart = equal multiplicative jumps in throughput. Note that the H100 FP16→FP8 jump (~95 viewBox units), the B200 FP16→FP8 jump (also ~95), and the B200 FP8→FP4 jump (also ~95) are all visibly the same size — that is the "halving bits per element doubles peak math" rule made visible. Sparse (2:4) tensor cores would add a fourth identical step on Blackwell. INT4 / INT8 tensor-core support continued from Turing/Ampere into Hopper but is no longer a quality-leading choice for LLMs — FP4-with-block-scaling is. Bar colours: green = FP16/INT8 tier, orange = FP8 tier, red = FP4 tier.

Read carefully

Hopper has no native FP4 tensor cores and no MX support — if you serve MXFP4 weights on H100, the engine will dequantise to FP8 or FP16 before MMA. You get the bandwidth win, not the compute win. To collect both, you need Blackwell (B100/B200/RTX 50/RTX PRO 6000/GB10).

10

Quality — FP, BFP and integer compared

Representative WikiText-2 perplexity on Llama-3 70 B Instruct (lower is better). Numbers from NVIDIA Model Optimizer benchmarks plus published OpenCompute MX studies; absolute values vary with calibration data and tokeniser, the deltas are the point.

Perplexity (left) — lower is better — bytes/param (right) FP16 (ref) 5.84 2.0 B FP8 (E4M3) 5.85 1.0 B NVFP4 5.92 ~0.56 B MXFP4 6.18 ~0.53 B MXFP6 (E2M3) 5.93 ~0.78 B AWQ INT4 (g128) 6.21 ~0.55 B GPTQ INT4 6.34 ~0.55 B GGUF q4_K_M 6.41 ~0.6 B INT4 PTQ naive 7.85 0.5 B
11

Bytes-per-parameter & bandwidth wins

HBM bandwidth as a clock

B200 has 8 TB/s HBM3e. A single forward pass of 70 B params at FP16 (140 GB) takes a hard minimum of 17.5 ms. At NVFP4 (39 GB) the floor drops to 4.9 ms. Decode tok/s scales the same way.

Concurrency

NVFP4 frees ~100 GB of HBM on a single B200, which is ~2 M tokens of paged KV-cache at FP16 KV. That is the difference between 4 concurrent users and 64.

Energy per token

Memory traffic dominates inference energy. 4× smaller weights → ~3.5× lower J/token at fixed batch size on Blackwell, before any compute savings.

Compute headroom

9 PFLOPS dense (18 PFLOPS sparse) FP4 on B200 means a 70 B model's prefill is essentially free even at long context. The bottleneck is back to activation bandwidth and KV access, not weight loading.

What "activation bandwidth" means

Activation bandwidth = the HBM bytes per token spent reading and writing the inter-layer activations — the tensors that flow between operators (residual stream, post-RMSNorm input to each linear, attention scores, GLU intermediates), as distinct from the weight bytes (the static parameters) and the KV-cache bytes (the per-token attention state).

For one transformer layer at hidden size H, batch B, sequence S, activation dtype d_act bytes, you write ≈ B × S × H × d_act bytes back to HBM and read them again next layer. With weights at FP16 this is small relative to weight reads; with weights at NVFP4 it's now comparable — weights dropped to ~0.56 bytes/param while activations are usually still BF16/FP16 (2 bytes/element) for accuracy. Above some sequence length the activation-bandwidth term overtakes weight bandwidth, and the classic "decode is weight-bound" intuition breaks. NVIDIA's FP8 / NVFP4 activation casting in Transformer Engine v2 is partly to address this; not all layers benefit yet.

Effective bits/param across formats — log₂ scale, axis at top

effective bits/param — log₂ (each halving = same step) 3 4 5 6 8 12 16 FP16/BF16 16.00 MXFP8 8.25 FP8 8.00 MXFP6 E2M3 6.25 GGUF q4_K_M ~4.80 NVFP4 4.50 MXFP4 4.25 INT4 (g128) 4.13 halving: FP16 → FP8 (256 px) halving: FP8 → FP4 (257 px)

Equal-width steps = equal multiplicative jumps. The two bracketed intervals at the bottom (FP16→FP8 and FP8→FP4) are the same length — that is the "halving bits/param" rule made visible. FP32 is omitted; it would extend off the right of the chart at 32 bits/param. The four 4-bit formats are clustered between the "4" and "5" gridlines because they all live in the same bit-decade — but the spread (INT4 119 px < MXFP4 129 < NVFP4 150 < GGUF 174) is large enough to read directly. Every fraction of a bit per parameter still matters at 70B–671B scale: ~1.4 GB per 0.16 bits/param at 70 B.

12

Interactive: decode a BFP4 block

Drive the encoder yourself. Drag the four FP4 mantissas and the block scale; the panel decodes them through the BFP rules and shows the actual real-number values, plus what the same block would look like under MXFP4 vs NVFP4 vs naive INT4 quantisation of the same numbers.

4.00
1.00
010 = +1.0
011 = +1.5
110 = -4.0
101 = +3.0
What you should notice

Move element 2 to the largest negative (code 15 → -6) and the block scale to a small power of two. Then switch to MXFP4 and back to NVFP4. Watch the same fixed mantissas change effective values because the scale resolution changes. Switch to naive INT4 and notice how a single per-tensor scale crushes the dynamic range that the block-shared scale was protecting.

13

Take-aways

Where this fits in the series

This deck zooms into the bit-level mechanics of the FP4 formats that several earlier decks reference at a higher level — Tensor Cores for the MMA pipeline, Blackwell and Inside Blackwell for the SM context, Quantization for Local Hosting for the practical INT4-vs-FP4 trade-off, and Hardware-Aware Quantisation for the broader number-format landscape.

14

References & further reading

NVIDIA — NVFP4 and Blackwell

  1. NVIDIA Developer Blog — Introducing NVFP4 for Efficient and Accurate Low-Precision Inference. The primary public reference for NVFP4: format definition (16-element block, FP8 micro-scale, FP32 macro-scale), measured perplexity vs MXFP4 and FP8 on Llama-family models, Transformer Engine integration.
    developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference
  2. NVIDIA — Pretraining Large Language Models with NVFP4 (research paper, 2025). Training-time stability results, gradient handling, two-level scale management.
    arxiv.org/abs/2509.25149
  3. NVIDIA — Blackwell Architecture Technical Brief (March 2024). 5th-gen tensor cores, MX-FP4 / MX-FP6 datapaths, 2nd-gen Transformer Engine, NVL72.
    resources.nvidia.com/en-us-blackwell-architecture
  4. NVIDIA Transformer Engine documentation — FP8 / NVFP4 recipe API, DelayedScaling, Format.NVFP4.
    docs.nvidia.com/deeplearning/transformer-engine

Open Compute Project — MX spec

  1. OCP — Microscaling Formats (MX) v1.0 Specification (September 2023). The open multi-vendor spec defining MXFP8, MXFP6, MXFP4, MXINT8 with E8M0 shared scale and 32-element blocks. Contributed by AMD, Arm, Intel, Meta, Microsoft, NVIDIA, Qualcomm.
    opencompute.org — OCP Microscaling Formats (MX) v1.0 Spec
  2. Rouhani et al. — Microscaling Data Formats for Deep Learning (2023). The Microsoft / OCP study showing MX formats' accuracy vs FP8 and BF16 baselines across LLMs, vision and recommendation models.
    arxiv.org/abs/2310.10537

Background — FP8, BFP, integer quantisation

  1. Micikevicius et al. — FP8 Formats for Deep Learning (NVIDIA / Arm / Intel, 2022). Justifies E4M3 + E5M2 split, defines the per-tensor scale workflow that NVFP4's macro-scale builds on.
    arxiv.org/abs/2209.05433
  2. Microsoft Research — Pushing the Limits of Narrow Precision Inferencing at Cloud Scale with Microsoft Floating Point (NeurIPS 2020). The MSFP precursor to MX — first production deployment of block-shared exponents for transformer inference.
    microsoft.com/en-us/research — MSFP paper
  3. Frantar et al. — GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (ICLR 2023). The INT4 baseline NVFP4 / MXFP4 are usually compared to.
    arxiv.org/abs/2210.17323
  4. Lin et al. — AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (MLSys 2024). The other INT4 baseline.
    arxiv.org/abs/2306.00978

In-text citation: [1] in the NVFP4 “What that buys you” bullet on slide 06 points to reference 1 above (the NVFP4 developer-blog post). The numerical comparisons on slide 10 draw on references 1, 2 and 6.