Local LLM Hosting Series — Presentation 07

Quantization for Local Hosting

GGUF, AWQ, GPTQ, FP8, INT4, MX-FP4 — how each format works, what you give up, which framework supports which, and how to pick.

GGUFAWQGPTQ FP8INT4Perplexity
Why quantise Number formats Calibration Formats Quality Pick
00

Topics

01

Why Quantise

Modern decoding is memory-bandwidth bound: the main cost per token is reading all the weights from HBM into the SMs. Halve the weight bytes, double (roughly) the tokens/second. Quantising is the biggest lever you have on local hardware.

Fit

70B FP16 = 140 GB. 70B INT4 = 40 GB — now it fits on a 48 GB card.

Speed

Decode tok/s scales roughly with 1/bytes-per-param. INT4 is > 2× FP16 on the same GPU.

Concurrency

Smaller weights leave more VRAM for the KV pool — more parallel sessions.

What you do not get

Quantisation does not reduce compute on GPUs that can't natively multiply in that precision. INT4 weights still get upcasted to FP16/BF16 on Ampere for the matmul. The win is purely memory-bandwidth. On Blackwell's native FP4/INT4 tensor cores, the compute also gets faster.

02

Number Formats, Visual

Bit layouts — sign (red), exponent (orange), mantissa (green) FP32 (32b) s1 e8 m23 FP16 (16b) s1 e5 m10 BF16 (16b) s1 e8 m7 — FP32 range, FP16 size FP8 E4M3 (8b) s1 e4 m3 — weights FP8 E5M2 (8b) s1 e5 m2 — gradients/activations FP4 E2M1 (4b) Blackwell tensor cores (MX-FP4) INT8 (8b) integer, needs scale+zero-point INT4 (4b) AWQ, GPTQ, GGUF q4 INT2 (2b) experimental — GGUF q2, extreme
03

PTQ vs Calibration-aware vs QAT

PTQ naive

Round every weight to the target grid. Fast, no data needed. Quality drops a lot below 8 bits. This is what pure "convert to q4_0" does.

Calibration-aware

Run a small calibration set (~512 samples) through the model and pick per-tensor or per-channel scales that minimise activation error. GPTQ, AWQ, SmoothQuant, FP8-dynamic all live here.

QAT

Quantisation-aware training. Best quality but requires re-training. Rarely used for LLMs — calibration-aware PTQ is already close enough.

What AWQ is doing differently to GPTQ

04

GGUF — The llama.cpp Family

GGUF is a file format (not a single quantisation) that packs weights, metadata, tokeniser, and quantisation type in one blob. Within GGUF, llama.cpp defines a large family of quant variants.

QuantBits (eff.)HowUse
q8_08.5INT8, scalar per block of 32Near-FP16 quality, 2× smaller
q6_K6.66-bit, K-quant with per-block scale & minExcellent quality / size balance
q5_K_M5.75-bit K-quant, mixed groupsSweet spot for 13–30B models
q4_K_M4.84-bit K-quant, mixedThe "default" tag on Ollama
q4_04.54-bit, simpler blocksFaster on some CPUs; slightly lower quality
q3_K_M3.93-bit K-quantTight VRAM; noticeable quality loss
q2_K2.62-bit K-quantEmergency fit only
IQ4_NL~4.4Non-linear importance matrixBest 4-bit quality in GGUF today
IQ2_XXS~2.1Importance-weighted 2-bitRuns 70B in 16 GB if you must
K-quant intuition

"K-quant" groups 256 weights into a super-block, stores one FP16 scale and one FP16 min for the super-block, and then per-block (32 weights) quantised values with a 6-bit scale. Average effective bit count is above the name: q4_K_M is not 4 bits; it's about 4.8.

05

AWQ & GPTQ — Learned INT4

Quantise Llama-3.1-8B to AWQ INT4 with llm-compressor
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers import oneshot

recipe = [
    SmoothQuantModifier(smoothing_strength=0.8),
    GPTQModifier(targets="Linear",
                 scheme="W4A16",       # weights INT4, activations FP16
                 ignore=["lm_head"]),
]
oneshot(model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        dataset="ultrachat_200k", recipe=recipe,
        num_calibration_samples=512, max_seq_length=2048)

Serving AWQ / GPTQ

vLLM, TGI, LMDeploy, SGLang all accept the quantised checkpoint directly. --quantization awq or --quantization gptq. Weights are unpacked on-the-fly into FP16/BF16 for the matmul (Ampere/Ada) or held natively INT4 (Hopper / Blackwell tensor cores).

What to expect

  • ~4× smaller than FP16
  • ~1.7–2× decode speedup vs FP16 on same GPU
  • Perplexity penalty <1% on well-calibrated AWQ-INT4; ~1–3% on GPTQ depending on calibration data
  • First output is often subtly different from FP16 — deck 09 on determinism
06

FP8 & Beyond — Native GPU Formats

FP8

H100 / H200 / B100 / B200 / Spark. Two sub-formats: E4M3 (weights) and E5M2 (activations/gradients). Native tensor-core support → compute also faster, not just bandwidth. Quality loss is tiny (typically < 0.3% perplexity) with a calibration step.

Serving: --dtype fp8 on vLLM; pre-quantise with llm-compressor or use NVIDIA's Model Optimizer.

FP4 / MX-FP4

Blackwell introduces native 4-bit floating point with per-block microscaling (MX format — Open Compute spec). Real 2×-over-FP8 inference on supported kernels. vLLM 0.7+, TensorRT-LLM 0.12+, SGLang.

Quality with good calibration is comparable to INT4; sometimes better because the float grid handles outliers more gracefully than integer.

KV-cache quantisation

Often forgotten: you can quantise the KV cache separately from the weights. vLLM supports --kv-cache-dtype fp8. For long contexts this alone can halve cache memory with negligible quality impact, letting you double concurrency or context length.

FP8 weights + FP8 KV cache
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct-FP8 \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --calculate-kv-scales \
    --gpu-memory-utilization 0.92
07

Quality Impact — Real Perplexity Numbers

Representative WikiText-2 perplexity for Llama-3-8B-Instruct (lower is better). Absolute numbers depend on tokeniser/eval setup; the deltas are the point.

Perplexity (left bars) and weight size (right bars, GB) — Llama-3-8B FP16 (ref) 6.12 (ref) 16.0 GB FP8 (W8A8) 6.14 8.0 GB AWQ-INT4 6.18 4.5 GB GPTQ-INT4 6.25 4.5 GB GGUF q4_K_M 6.27 4.8 GB GGUF q3_K_M 6.38 3.9 GB GGUF q2_K 6.72 2.6 GB
Practical rule

Above 5 bits eff. you almost can't tell. Below 3 bits you definitely can. The four formats FP8, AWQ-INT4, q4_K_M, q5_K_M cover 95% of what anyone should run locally.

08

Interactive: Pick a Format

09

Framework Support Cheat-sheet

FormatvLLMTGISGLangTRT-LLMOllama / llama.cppLMDeploy
FP16 / BF16
FP8 W8A8✓ (H100+)
MX-FP4✓ (B200)
AWQ INT4native
GPTQ INT4
GGUF✓ (read-only)native
KV-cache FP8via q8_0 KV
Take-away

If you're on NVIDIA H100+ and willing to do the calibration work, FP8 weights + FP8 KV is almost always the right production choice. For consumer GPUs, AWQ-INT4. For Apple Silicon / CPU, GGUF q5_K_M or q4_K_M. Everything else is tuning.