GGUF, AWQ, GPTQ, FP8, INT4, MX-FP4 — how each format works, what you give up, which framework supports which, and how to pick.
Modern decoding is memory-bandwidth bound: the main cost per token is reading all the weights from HBM into the SMs. Halve the weight bytes, double (roughly) the tokens/second. Quantising is the biggest lever you have on local hardware.
70B FP16 = 140 GB. 70B INT4 = 40 GB — now it fits on a 48 GB card.
Decode tok/s scales roughly with 1/bytes-per-param. INT4 is > 2× FP16 on the same GPU.
Smaller weights leave more VRAM for the KV pool — more parallel sessions.
Quantisation does not reduce compute on GPUs that can't natively multiply in that precision. INT4 weights still get upcasted to FP16/BF16 on Ampere for the matmul. The win is purely memory-bandwidth. On Blackwell's native FP4/INT4 tensor cores, the compute also gets faster.
Round every weight to the target grid. Fast, no data needed. Quality drops a lot below 8 bits. This is what pure "convert to q4_0" does.
Run a small calibration set (~512 samples) through the model and pick per-tensor or per-channel scales that minimise activation error. GPTQ, AWQ, SmoothQuant, FP8-dynamic all live here.
Quantisation-aware training. Best quality but requires re-training. Rarely used for LLMs — calibration-aware PTQ is already close enough.
GGUF is a file format (not a single quantisation) that packs weights, metadata, tokeniser, and quantisation type in one blob. Within GGUF, llama.cpp defines a large family of quant variants.
| Quant | Bits (eff.) | How | Use |
|---|---|---|---|
| q8_0 | 8.5 | INT8, scalar per block of 32 | Near-FP16 quality, 2× smaller |
| q6_K | 6.6 | 6-bit, K-quant with per-block scale & min | Excellent quality / size balance |
| q5_K_M | 5.7 | 5-bit K-quant, mixed groups | Sweet spot for 13–30B models |
| q4_K_M | 4.8 | 4-bit K-quant, mixed | The "default" tag on Ollama |
| q4_0 | 4.5 | 4-bit, simpler blocks | Faster on some CPUs; slightly lower quality |
| q3_K_M | 3.9 | 3-bit K-quant | Tight VRAM; noticeable quality loss |
| q2_K | 2.6 | 2-bit K-quant | Emergency fit only |
| IQ4_NL | ~4.4 | Non-linear importance matrix | Best 4-bit quality in GGUF today |
| IQ2_XXS | ~2.1 | Importance-weighted 2-bit | Runs 70B in 16 GB if you must |
"K-quant" groups 256 weights into a super-block, stores one FP16 scale and one FP16 min for the super-block, and then per-block (32 weights) quantised values with a 6-bit scale. Average effective bit count is above the name: q4_K_M is not 4 bits; it's about 4.8.
llm-compressorfrom llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers import oneshot
recipe = [
SmoothQuantModifier(smoothing_strength=0.8),
GPTQModifier(targets="Linear",
scheme="W4A16", # weights INT4, activations FP16
ignore=["lm_head"]),
]
oneshot(model="meta-llama/Meta-Llama-3.1-8B-Instruct",
dataset="ultrachat_200k", recipe=recipe,
num_calibration_samples=512, max_seq_length=2048)
vLLM, TGI, LMDeploy, SGLang all accept the quantised checkpoint directly. --quantization awq or --quantization gptq. Weights are unpacked on-the-fly into FP16/BF16 for the matmul (Ampere/Ada) or held natively INT4 (Hopper / Blackwell tensor cores).
H100 / H200 / B100 / B200 / Spark. Two sub-formats: E4M3 (weights) and E5M2 (activations/gradients). Native tensor-core support → compute also faster, not just bandwidth. Quality loss is tiny (typically < 0.3% perplexity) with a calibration step.
Serving: --dtype fp8 on vLLM; pre-quantise with llm-compressor or use NVIDIA's Model Optimizer.
Blackwell introduces native 4-bit floating point with per-block microscaling (MX format — Open Compute spec). Real 2×-over-FP8 inference on supported kernels. vLLM 0.7+, TensorRT-LLM 0.12+, SGLang.
Quality with good calibration is comparable to INT4; sometimes better because the float grid handles outliers more gracefully than integer.
Often forgotten: you can quantise the KV cache separately from the weights. vLLM supports --kv-cache-dtype fp8. For long contexts this alone can halve cache memory with negligible quality impact, letting you double concurrency or context length.
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct-FP8 \
--dtype auto \
--kv-cache-dtype fp8 \
--calculate-kv-scales \
--gpu-memory-utilization 0.92
Representative WikiText-2 perplexity for Llama-3-8B-Instruct (lower is better). Absolute numbers depend on tokeniser/eval setup; the deltas are the point.
Above 5 bits eff. you almost can't tell. Below 3 bits you definitely can. The four formats FP8, AWQ-INT4, q4_K_M, q5_K_M cover 95% of what anyone should run locally.
| Format | vLLM | TGI | SGLang | TRT-LLM | Ollama / llama.cpp | LMDeploy |
|---|---|---|---|---|---|---|
| FP16 / BF16 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| FP8 W8A8 | ✓ (H100+) | ✓ | ✓ | ✓ | — | ✓ |
| MX-FP4 | ✓ (B200) | — | ✓ | ✓ | — | — |
| AWQ INT4 | ✓ | ✓ | ✓ | ✓ | — | native |
| GPTQ INT4 | ✓ | ✓ | ✓ | ✓ | — | ✓ |
| GGUF | ✓ (read-only) | — | — | — | native | — |
| KV-cache FP8 | ✓ | — | ✓ | ✓ | via q8_0 KV | ✓ |
If you're on NVIDIA H100+ and willing to do the calibration work, FP8 weights + FP8 KV is almost always the right production choice. For consumer GPUs, AWQ-INT4. For Apple Silicon / CPU, GGUF q5_K_M or q4_K_M. Everything else is tuning.