Local LLM Hosting 07 — Quantization for Local Hosting

00

Topics

Why Quantise
Number Formats, Visual
PTQ vs Calibration-aware vs QAT
GGUF — The llama.cpp Family
AWQ & GPTQ — Learned INT4
FP8 & Beyond — Native GPU Formats
Quality Impact — Real Perplexity Numbers
Interactive: Pick a Format
Framework Support Cheat-sheet

01

Why Quantise

Modern decoding is memory-bandwidth bound: the main cost per token is reading all the weights from HBM into the SMs. Halve the weight bytes, double (roughly) the tokens/second. Quantising is the biggest lever you have on local hardware.

Fit

70B FP16 = 140 GB. 70B INT4 = 40 GB — now it fits on a 48 GB card.

Speed

Decode tok/s scales roughly with 1/bytes-per-param. INT4 is > 2× FP16 on the same GPU.

Concurrency

Smaller weights leave more VRAM for the KV pool — more parallel sessions.

What you do not get

Quantisation does not reduce compute on GPUs that can't natively multiply in that precision. INT4 weights still get upcasted to FP16/BF16 on Ampere for the matmul. The win is purely memory-bandwidth. On Blackwell's native FP4/INT4 tensor cores, the compute also gets faster.

02

Number Formats, Visual

03

PTQ vs Calibration-aware vs QAT

PTQ naive

Round every weight to the target grid. Fast, no data needed. Quality drops a lot below 8 bits. This is what pure "convert to q4_0" does.

Calibration-aware

Run a small calibration set (~512 samples) through the model and pick per-tensor or per-channel scales that minimise activation error. GPTQ, AWQ, SmoothQuant, FP8-dynamic all live here.

QAT

Quantisation-aware training. Best quality but requires re-training. Rarely used for LLMs — calibration-aware PTQ is already close enough.

What AWQ is doing differently to GPTQ

GPTQ (Frantar et al.) — layer-wise second-order: Hessian-based error compensation, quantises weights row by row and corrects downstream. Needs calibration data; can be sensitive to it.
AWQ (Lin et al.) — activation-aware: scales weights that correspond to large-magnitude activations to preserve them; everything else is quantised normally. Typically more robust than GPTQ, often better quality at INT4.
SmoothQuant — activation-weight channel-wise rescaling before quantisation so that both fit the target grid. Usually combined with INT8.

04

GGUF — The llama.cpp Family

GGUF is a file format (not a single quantisation) that packs weights, metadata, tokeniser, and quantisation type in one blob. Within GGUF, llama.cpp defines a large family of quant variants.

Quant	Bits (eff.)	How	Use
q8_0	8.5	INT8, scalar per block of 32	Near-FP16 quality, 2× smaller
q6_K	6.6	6-bit, K-quant with per-block scale & min	Excellent quality / size balance
q5_K_M	5.7	5-bit K-quant, mixed groups	Sweet spot for 13–30B models
q4_K_M	4.8	4-bit K-quant, mixed	The "default" tag on Ollama
q4_0	4.5	4-bit, simpler blocks	Faster on some CPUs; slightly lower quality
q3_K_M	3.9	3-bit K-quant	Tight VRAM; noticeable quality loss
q2_K	2.6	2-bit K-quant	Emergency fit only
IQ4_NL	~4.4	Non-linear importance matrix	Best 4-bit quality in GGUF today
IQ2_XXS	~2.1	Importance-weighted 2-bit	Runs 70B in 16 GB if you must

K-quant intuition

"K-quant" groups 256 weights into a super-block, stores one FP16 scale and one FP16 min for the super-block, and then per-block (32 weights) quantised values with a 6-bit scale. Average effective bit count is above the name: q4_K_M is not 4 bits; it's about 4.8.

05

AWQ & GPTQ — Learned INT4

Quantise Llama-3.1-8B to AWQ INT4 with llm-compressor

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers import oneshot

recipe = [
    SmoothQuantModifier(smoothing_strength=0.8),
    GPTQModifier(targets="Linear",
                 scheme="W4A16",       # weights INT4, activations FP16
                 ignore=["lm_head"]),
]
oneshot(model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        dataset="ultrachat_200k", recipe=recipe,
        num_calibration_samples=512, max_seq_length=2048)

Serving AWQ / GPTQ

vLLM, TGI, LMDeploy, SGLang all accept the quantised checkpoint directly. --quantization awq or --quantization gptq. Weights are unpacked on-the-fly into FP16/BF16 for the matmul (Ampere/Ada) or held natively INT4 (Hopper / Blackwell tensor cores).

What to expect

~4× smaller than FP16
~1.7–2× decode speedup vs FP16 on same GPU
Perplexity penalty <1% on well-calibrated AWQ-INT4; ~1–3% on GPTQ depending on calibration data
First output is often subtly different from FP16 — deck 09 on determinism

06

FP8 & Beyond — Native GPU Formats

FP8

H100 / H200 / B100 / B200 / Spark. Two sub-formats: E4M3 (weights) and E5M2 (activations/gradients). Native tensor-core support → compute also faster, not just bandwidth. Quality loss is tiny (typically < 0.3% perplexity) with a calibration step.

Serving: --dtype fp8 on vLLM; pre-quantise with llm-compressor or use NVIDIA's Model Optimizer.

FP4 / MX-FP4

Blackwell introduces native 4-bit floating point with per-block microscaling (MX format — Open Compute spec). Real 2×-over-FP8 inference on supported kernels. vLLM 0.7+, TensorRT-LLM 0.12+, SGLang.

Quality with good calibration is comparable to INT4; sometimes better because the float grid handles outliers more gracefully than integer.

KV-cache quantisation

Often forgotten: you can quantise the KV cache separately from the weights. vLLM supports --kv-cache-dtype fp8. For long contexts this alone can halve cache memory with negligible quality impact, letting you double concurrency or context length.

FP8 weights + FP8 KV cache

vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct-FP8 \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --calculate-kv-scales \
    --gpu-memory-utilization 0.92

07

Quality Impact — Real Perplexity Numbers

Representative WikiText-2 perplexity for Llama-3-8B-Instruct (lower is better). Absolute numbers depend on tokeniser/eval setup; the deltas are the point.

Practical rule

Above 5 bits eff. you almost can't tell. Below 3 bits you definitely can. The four formats FP8, AWQ-INT4, q4_K_M, q5_K_M cover 95% of what anyone should run locally.

08

Interactive: Pick a Format

Hardware

Model size

Priority

09

Framework Support Cheat-sheet

Format	vLLM	TGI	SGLang	TRT-LLM	Ollama / llama.cpp	LMDeploy
FP16 / BF16	✓	✓	✓	✓	✓	✓
FP8 W8A8	✓ (H100+)	✓	✓	✓	—	✓
MX-FP4	✓ (B200)	—	✓	✓	—	—
AWQ INT4	✓	✓	✓	✓	—	native
GPTQ INT4	✓	✓	✓	✓	—	✓
GGUF	✓ (read-only)	—	—	—	native	—
KV-cache FP8	✓	—	✓	✓	via q8_0 KV	✓

Take-away

If you're on NVIDIA H100+ and willing to do the calibration work, FP8 weights + FP8 KV is almost always the right production choice. For consumer GPUs, AWQ-INT4. For Apple Silicon / CPU, GGUF q5_K_M or q4_K_M. Everything else is tuning.