Model Optimisation is the single largest NCP-GENL domain at 17%. The exam expects working knowledge of the full inference stack: how memory is consumed, how throughput is maximised, what each optimisation technique actually does, and which frameworks implement which techniques. This note is correspondingly detailed.
Cross-reference: NVIDIA_GPU_19_TensorRT_LLM for TensorRT-LLM depth; LLM_Hub_Local_LLM_Hosting for vLLM, Ollama, TGI, and llama.cpp comparisons.
Inference optimisation involves three coupled constraints:
These constraints trade off: increasing batch size improves throughput but increases latency for early requests in the batch. Reducing model precision frees memory and often improves throughput but may degrade quality. Every optimisation technique in this note occupies a specific position in this triangle.
During autoregressive decoding, attention requires access to the key and value tensors computed for every previous token. Recomputing these from scratch at each step would be prohibitively expensive; instead they are cached — the KV cache.
The memory footprint of the KV cache for a single sequence is:
KV_cache_bytes = 2 × n_layers × n_kv_heads × head_dim × seq_len × bytes_per_element
The factor of 2 is for keys and values. For a model with 32 layers, 32 KV heads, head dimension 128, sequence length 4096, and FP16 (2 bytes per element):
2 × 32 × 32 × 128 × 4096 × 2 = 2,147,483,648 bytes ≈ 2 GB per sequence
This is not a small number. At batch size 8, that is 16 GB just for the KV cache — exceeding the RTX 4000 Ada’s entire 20 GB. Long context makes this worse non-linearly: doubling the sequence length doubles the KV cache. Models with grouped-query attention (GQA) or multi-query attention (MQA) reduce n_kv_heads substantially — Llama 3 70B uses 8 KV heads against 64 query heads, cutting KV cache by 8×.
The KV cache dominates the memory budget at long context and modest model sizes. On Brendan’s hardware (10 GB and 20 GB), the practical context-length ceiling for even a 7B model at FP16 without KV quantisation is in the low thousands of tokens once the model weights occupy their share.
Before PagedAttention (Kwon et al., 2023 — arXiv:2309.06180), KV caches were allocated as contiguous memory blocks reserved at the start of a request. Because sequence length is not known in advance, systems over-provisioned, wasting 20–80% of KV cache memory to internal fragmentation.
PagedAttention borrows the OS virtual-memory paging concept. The KV cache is divided into fixed-size blocks (pages), each holding the keys and values for a fixed number of tokens. Pages are allocated on demand and stored non-contiguously; a block table maps logical page indices to physical memory locations. Consequences:
vLLM’s throughput improvement over FasterTransformer (the prior state of the art) was reported at 2–4× at the same latency level.
Static batching processes a fixed batch of requests in lockstep — the batch is held until every sequence finishes generating, then a new batch starts. This wastes GPU cycles whenever a short sequence finishes early and its slot sits idle waiting for the longest sequence in the batch.
Continuous batching (also called in-flight batching or iteration-level scheduling) releases a slot as soon as a sequence finishes and immediately inserts a new request. From the GPU’s perspective, every forward pass is at (near) peak utilisation. The serving engine manages a dynamic set of active sequences whose membership can change every token step.
TensorRT-LLM implements in-flight batching natively; its C++ runtime manages the batch state, KV cache allocation, and request scheduling at the iteration level. vLLM implements the same concept. For TensorRT-LLM internals, see NVIDIA_GPU_19_TensorRT_LLM.
Standard attention computes the full Q × K^T score matrix in HBM before applying softmax and multiplying by V. For sequence length L and head dimension d, the score matrix is O(L²) in memory. At L = 8192 and 32 heads, this is impractically large and requires multiple round-trips between HBM and the compute units.
FlashAttention v1 (Dao et al., 2022 — arXiv:2205.14135) reformulated attention as a tiled algorithm that keeps intermediate results in on-chip SRAM. The full L×L matrix is never materialised in HBM. This is IO-optimal for exact attention — it reduces HBM read/write volume from O(L²) to O(L), at the cost of more arithmetic (recomputing softmax normalisers). Measured speedups: up to 3× on GPT-2 training at 1K tokens; training sequence lengths up to 16K became practical.
FlashAttention v2 (Dao, 2023 — arXiv:2307.08691) addressed the 25–40% GPU utilisation ceiling of v1. Three changes: (1) fewer non-matrix operations by restructuring the softmax rescaling; (2) parallelisation across sequence dimension even within a single head; (3) better warp-level work distribution within a thread block. Result: ~2× speedup over v1, reaching 50–73% of theoretical peak FLOPs/s on A100.
FlashAttention v3 targets H100-class hardware specifically (using FP8 tensor cores and the async pipeline features of Hopper). On consumer hardware (Ampere/Ada), v2 is the relevant implementation.
FlashAttention is a training and prefill optimisation. It does not change decode memory — the KV cache is still needed during autoregressive generation — but it dramatically accelerates the prefill phase and makes long-context training feasible.
Quantisation reduces the numerical precision of weights and/or activations, shrinking memory footprint and often improving throughput by using lower-precision matrix-multiply units.
| Format | Bits/weight | Typical use case |
|---|---|---|
| FP16/BF16 | 16 | Baseline serving |
| INT8 (W8A8) | 8 weight + 8 activation | SmoothQuant; good accuracy, ~2× memory |
| FP8 (W8A8) | 8 weight + 8 activation | H100 FP8 tensor cores; TensorRT-LLM native |
| INT4 AWQ | 4 weight + 16 activation | Weight-only; consumer GPU serving |
| INT4 GPTQ | 4 weight + 16 activation | Weight-only; post-training |
| NVFP4 (W4A8) | 4 weight + 8 activation | TensorRT-LLM on Ada/Hopper |
Weight-only quantisation (AWQ, GPTQ) quantises weights to INT4 but dequantises to FP16 before the matmul. This reduces memory bandwidth pressure (loading fewer bytes per weight) and thus improves decode throughput on bandwidth-limited hardware, but the compute itself still runs in FP16. There is no benefit from narrower compute units.
Weight-and-activation quantisation (INT8 SmoothQuant, FP8) runs the matmul itself at lower precision, exploiting NVIDIA’s INT8/FP8 tensor core throughput. This can improve throughput beyond the memory-bandwidth improvement alone. The challenge is that activations have dynamic range that varies per token, making uniform quantisation lossy; SmoothQuant migrates quantisation difficulty from activations to weights by applying a per-channel scale.
Calibration: post-training quantisation (PTQ) requires a small calibration dataset to determine scaling factors. Quantisation-aware training (QAT) fine-tunes with simulated quantisation in the forward pass, producing better accuracy at the cost of a training run.
Accuracy/latency trade-off: INT4 weight-only typically incurs 0.5–2% accuracy degradation on standard benchmarks; INT8 weight-and-activation is often near-lossless; FP4/NVFP4 requires careful calibration or QAT. Always evaluate on task-specific metrics, not just perplexity.
KV cache can be quantised independently of weights — typically to INT8 or FP8 — reducing the memory cost of long contexts without affecting the model weights at all. This is particularly valuable on consumer hardware where the KV cache competes with weights for the same HBM. TensorRT-LLM exposes KV cache quantisation as a separate option from weight quantisation.
Autoregressive decoding is sequential: each token is generated one at a time. Speculative decoding breaks this by using a small, fast draft model to generate k candidate tokens, then verifying all k with the large target model in a single forward pass. Correct draft tokens are accepted; the first incorrect token causes rejection from that point, and the target model’s corrected token is taken instead.
Under the right conditions (draft model well-aligned with target, k in the range 3–7), speculative decoding can reduce TPOT by 2–3× with no change in output distribution — the verification step guarantees the same distribution as the target model alone.
Medusa (Cai et al., 2024) replaces the draft model with extra decoding heads added directly to the target model, jointly trained to predict tokens at positions +1, +2, …+k. No separate draft model is needed; the extra heads add minimal latency to each target forward pass.
EAGLE (Li et al., 2024) uses an auto-regressive draft model trained to predict one step ahead at the feature (hidden state) level rather than the token level, achieving higher acceptance rates than Medusa in practice. EAGLE-3 is the current iteration, supported natively in TensorRT-LLM.
The speedup versus verification cost trade-off depends on the draft acceptance rate and the ratio of draft-to-target model size. A poorly-calibrated draft that generates mostly rejected tokens adds overhead without acceleration.
Fine-tuned models are often LoRA adapters on a shared base. Serving multiple adapters simultaneously with one base model copy avoids reloading the base weights for each adapter switch.
S-LoRA (Sheng et al., 2023) and Punica introduced the concept of batching requests across different adapters in a single forward pass. The base model forward pass is shared; adapter deltas are applied per-request using custom CUDA kernels that fuse the LoRA A×B computation into the same pass. The adapters themselves (typically small — rank-16 LoRA on a 7B model is a few hundred MB per adapter) are stored in GPU or CPU memory and swapped as needed.
vLLM supports multi-LoRA serving natively. This is particularly relevant for platforms that fine-tune per-user or per-task and want to host tens or hundreds of adapters without proportional memory cost.
| Framework | Primary strength | Hardware target | Notes |
|---|---|---|---|
| TensorRT-LLM | NVIDIA-optimised throughput; FP8/FP4; in-flight batching | NVIDIA (A10G, A100, H100, Ada) | Engine builder + runtime; Triton backend for serving. See NVIDIA_GPU_19_TensorRT_LLM |
| vLLM | PagedAttention; OpenAI-compatible API; broad model support | NVIDIA + AMD | Best for high-throughput serving of standard architectures |
| SGLang | Structured generation; RadixAttention for prefix caching; fast TTFT | NVIDIA | Strong for agentic and multi-turn workloads |
| TGI (Text Generation Inference) | Hugging Face ecosystem integration | NVIDIA + AMD + Intel | Simpler deployment; less tuning flexibility |
| llama.cpp | CPU + consumer GPU; GGUF quantisation | CPU, NVIDIA, Apple Metal | Widest hardware support; lower throughput than GPU-native |
| Ollama | Zero-config local serving; wraps llama.cpp | CPU, NVIDIA, Apple | Best for individual developer use; not production-grade |
For a detailed comparison, see LLM_Hub_Local_LLM_Hosting.
When to reach for which: TensorRT-LLM is the right choice when maximising NVIDIA GPU utilisation in a production setting and the target hardware is NVIDIA. vLLM is the right choice for rapid deployment with good performance across a wide model zoo. SGLang is worth considering for structured-output-heavy or agentic workloads. llama.cpp / Ollama are appropriate for local development on consumer hardware.
On the RTX 4000 Ada (20 GB):
On the RTX 3080 (10 GB):
KV cache quantisation (INT8) approximately halves the KV cache footprint, meaningfully extending usable context length on both machines.