The single largest NCP domain at 17%. KV cache arithmetic, PagedAttention, continuous batching, FlashAttention, quantisation, speculative decoding, multi-LoRA, and the framework landscape.
Twenty-two sections from first principles to exam prep. Work linearly or jump to the section you need.
Model Optimisation is 17% of the NCP-GENL exam — the single largest domain. GPU Acceleration and Optimisation adds a further 14%. Together they account for nearly a third of the exam and both require a working understanding of the inference stack.
Every optimisation technique in this deck maps to a specific vertex of the throughput/latency/memory triangle on the next slide. Knowing which vertex an optimisation addresses is the most efficient way to reason through exam scenarios.
Three coupled constraints. Every inference optimisation technique addresses one or more vertices. Understanding the coupling is more useful than a list of names.
Time To First Token. Prefill phase — processes entire input in parallel. Compute-bound. FlashAttention and fast GPUs reduce TTFT.
Time Per Output Token. Autoregressive decode — one token per forward pass. Memory-bandwidth-bound at batch 1: weights stream from HBM each step.
Larger batch → higher throughput, higher per-request latency. Lower precision → smaller footprint, potential accuracy cost. Continuous batching is the lever that decouples throughput from tail latency most effectively.
The KV cache is the dominant memory consumer at long context. The exam will ask you to compute its size from architecture parameters.
KV_bytes = 2 × n_layers × n_kv_heads × head_dim × seq_len × bytes_per_element
Factor of 2: one tensor for K, one for V.
n_kv_heads may be < n_query_heads when GQA or MQA is used.
Architecture: 32 layers, 32 KV heads, head dimension 128, FP16 (2 bytes per element).
2 × 32 layers × 32 kv_heads × 128 head_dim × 4096 tokens × 2 bytes
= 2,147,483,648 bytes ≈ 2 GB per sequence
At batch size 8: 8 × 2 GB = 16 GB for KV alone.
7B weights at FP16 occupy ≈ 14 GB.
Total: 30 GB — exceeds RTX 4000 Ada 20 GB without optimisation.
n_kv_heads. Llama 3 70B uses 8 KV heads vs 64 query heads: 8× reduction.bytes_per_element from 2 (FP16) to 1 (INT8 or FP8): 2× reduction.seq_len is linear in KV size but quadratic in attention FLOP cost.KV cache grows linearly with sequence length. At long context it eclipses the model weight footprint. GQA and INT8 quantisation are the two main mitigations.
At 64k context, FP16 KV with 32 heads needs ≈32 GB per sequence — impossible on any single consumer GPU. Either GQA (8 heads, 8× reduction) or INT8 quantisation (2× reduction) is required for extended context on constrained hardware.
Kwon et al., 2023 — arXiv:2309.06180 (the vLLM paper). Borrowed from OS virtual memory: manage KV cache as pages rather than contiguous allocations.
Each request pre-allocates a contiguous KV block for its maximum possible output length. Because output length is unknown at request time, systems reserve the worst case.
KV cache split into fixed-size blocks (pages), each holding keys and values for a fixed number of tokens. A per-request page table maps logical indices to physical memory locations.
A 1k-token system prompt used by every request is prefilled once; all concurrent requests reference the same KV blocks. On a busy API server this can eliminate >50% of prefill FLOP costs.
Static batching idles GPU slots waiting for the longest sequence. Continuous batching reclaims those slots at the iteration level.
Standard attention materialises the full L×L score matrix in HBM. FlashAttention avoids this with a tiled algorithm that keeps intermediate results in on-chip SRAM.
arXiv:2205.14135. IO-optimal tiling: HBM reads/writes drop from O(L²) to O(L). The L×L score matrix is never materialised in HBM; softmax normalisers are recomputed on-chip. Up to 3× speedup on GPT-2 at 1k tokens; 16k training sequences became practical.
arXiv:2307.08691. Reached 50–73% peak FLOPs/s on A100 (v1 achieved ~25%). Fewer non-matmul ops; parallelism across sequence dimension; better warp-level work distribution. Roughly 2× over v1.
Targets Hopper hardware specifically: FP8 tensor cores and async pipeline features (WGMMA/TMA). On consumer Ampere/Ada hardware, v2 is the relevant implementation.
| Aspect | Helps? | Reason |
|---|---|---|
| Prefill speed (TTFT) | Yes | Optimises the L×L compute in the prefill phase directly |
| Long-context training | Yes | 16k+ sequence lengths become practical in HBM |
| KV cache size (memory) | No | KV cache still exists; FlashAttention is a compute/IO optimisation, not memory reduction |
| Decode TPOT | Marginal | Decode is weight-bandwidth-bound; attention is not the dominant per-step cost |
Precision determines memory footprint, throughput, and which hardware provides native acceleration. Know this table for numerical and selection questions.
| Format | Bits | Bytes/weight | Range (approx.) | Native HW | Primary use |
|---|---|---|---|---|---|
| FP32 | 32 | 4 | ±3.4×10³&sup8; | All | Training; inference reference only |
| BF16 | 16 | 2 | ±3.4×10³&sup8; (8-bit exp) | Ampere+, Hopper | Baseline serving; FP32 range with less mantissa |
| FP16 | 16 | 2 | ±65 504 | All modern NVIDIA | Baseline serving; narrower range than BF16 |
| FP8 E4M3 | 8 | 1 | ±448 | Hopper, Ada | W+A inference; best accuracy at 8-bit |
| FP8 E5M2 | 8 | 1 | ±57 344 | Hopper, Ada | Gradient storage during training |
| INT8 | 8 | 1 | −128 to +127 | Ampere+ tensor cores | SmoothQuant W8A8; solid accuracy on A100 |
| NVFP4 / MX-FP4 | 4 | 0.5 | Limited; micro-block scaling | Blackwell (B100, B200) | Maximum throughput; calibration required |
| INT4 (AWQ/GPTQ) | 4 | 0.5 | −8 to +7 | All (dequant to FP16) | Weight-only; consumer GPU decode |
BF16 has the same 8-bit exponent as FP32 (same dynamic range) but only 7 mantissa bits. FP16 has 10 mantissa bits but a narrower range (±65 504 vs ±3.4×10³&sup8;). For LLM serving, BF16 is generally preferred because large activations can overflow FP16.
Two fundamentally different techniques. The distinction is a frequent exam topic because they target different bottlenecks.
Weights stored at INT4; dequantised to FP16 before the matmul. Compute still runs in FP16.
Both weights and activations quantised; matmul runs at INT8 or FP8 on tensor cores.
PTQ with calibration: small representative dataset determines per-channel scales post-training. Fast; accuracy loss is higher at INT4. Quantisation-aware training (QAT): simulates quantisation in the forward pass during fine-tuning. Recovers accuracy lost by PTQ; requires a training run.
Both produce INT4 weight-only quantised models via post-training quantisation. They optimise for different objectives.
| Aspect | AWQ | GPTQ |
|---|---|---|
| Paper | Lin et al., 2023 — arXiv:2306.00978 | Frantar et al., 2022 — arXiv:2210.17323 |
| Core method | Identifies 1% of "salient" weights by activation magnitude; protects them, quantises the rest to INT4 | Layer-wise Hessian-based reconstruction; minimises second-order error in layer output |
| Calibration speed | Fast (minutes for 7B) | Slower (minutes to hours for 70B+) |
| Calibration data needed | Small representative set | Small representative set |
| Accuracy at INT4 | Slightly better on instruction-tuned models | Competitive; can edge ahead on some base model benchmarks |
| TensorRT-LLM | Native (--qformat awq) | Native (--qformat gptq) |
| vLLM | Supported | Supported |
| Prefer when | Fast iteration; instruction-tuned deployment; Brendan's hardware | Maximising accuracy at fixed bit-width is primary concern |
For most production deployments, the accuracy difference between AWQ and GPTQ is below the noise of task-specific evaluation variance. Use AWQ for faster iteration. Always evaluate on your target task metric, not perplexity alone. Both are supported natively in TensorRT-LLM and vLLM.
Unlike weight-only INT4 (which dequantises before compute), these formats run the matmul itself at reduced precision on dedicated tensor cores.
Two encodings: E4M3 (4-bit exponent, 3-bit mantissa; range ±448) and E5M2 (5-bit exponent, 2-bit mantissa; range ±57 344). For inference, E4M3 is the primary format — higher precision, lower range. E5M2 is used for gradient storage. H100 FP8 tensor cores deliver roughly 2× the FLOPs of BF16 tensor cores. TensorRT-LLM FP8 quantisation via --qformat fp8 is the recommended production path on Hopper.
NVIDIA's NVFP4 and the MX-FP4 micro-scaling format apply a shared scale per small block of weights, providing more dynamic range than uniform INT4 within a 4-bit budget. Blackwell FP4 tensor cores provide the highest throughput in the current generation. TensorRT-LLM exposes this via --qformat nvfp4.
| Format | Hardware target | Weight bits | Activation bits | Key benefit |
|---|---|---|---|---|
| FP8 E4M3 | Hopper H100; Ada | 8 | 8 (FP8) | ~2× throughput vs BF16; near-lossless accuracy with calibration |
| NVFP4 / MX-FP4 | Blackwell B100, B200 | 4 + block scale | 8 (FP8) | Maximum throughput; QAT or careful calibration required |
| INT8 SmoothQuant | Ampere A100; Hopper | 8 | 8 (INT8) | ~2× throughput on Ampere; reliable accuracy |
Weight quantisation and KV cache quantisation are independent choices. Quantising the KV cache reduces long-context memory pressure without affecting model weight accuracy.
KV tensors are computed at inference time from activations. They are not model parameters. Quantising them to INT8 or FP8 post-computation compresses the in-HBM footprint during decoding.
TensorRT-LLM separates this as a distinct flag: --kv_cache_dtype fp8 or int8, independent of weight quantisation format.
RTX 4000 Ada (20 GB): 7B FP16 model leaves ~6 GB for KV. INT8 KV doubles usable context length or batch size within that budget. RTX 3080 (10 GB): 7B INT4 weights occupy ~4 GB, leaving 6 GB; INT8 KV is essentially mandatory for any serious context length.
Autoregressive decoding is sequential by construction. Speculative decoding breaks the serialisation with a draft-then-verify protocol that preserves the target model output distribution exactly.
Let α = draft token acceptance rate, k = draft tokens per cycle. Expected tokens per cycle ≈ 1 + α⋅k (one target token always accepted, plus accepted drafts). At α = 0.8, k = 5: expected 5 tokens per verification step instead of 1 → up to 2–3× TPOT improvement in practice.
Extra decoding heads on the target model; head k predicts token at position +k. No separate draft model. Lower acceptance rates than autoregressive drafts but zero separate-model cost. Supported in TensorRT-LLM.
Auto-regressive draft model trained to predict one step ahead at hidden-state level, not token level. Higher acceptance rates than Medusa. EAGLE-3 is the current iteration; natively supported in TensorRT-LLM.
Below α ≈ 0.5, speculative decoding adds overhead without throughput gain. Creative generation tasks with broad target distributions see low acceptance rates. Always profile acceptance rate before deploying a speculative setup in production.
Fine-tuned LoRA adapters share a base model. Batching cross-adapter requests in a single forward pass avoids reloading base weights on each adapter switch.
Unified paging for LoRA adapters with custom BGMV CUDA kernels for batched LoRA computation. Demonstrated hundreds of adapters served from a single GPU. The foundation for vLLM multi-LoRA support.
Alternative implementation using BGMV (batched grouped matmul for variable ranks). Similar throughput to S-LoRA. Both S-LoRA and Punica underpin vLLM's multi-LoRA serving.
A rank-16 LoRA adapter on a 7B model contributes on the order of 20–50 MB. The base model is ~14 GB at FP16. Storing 50 adapters adds 1–2.5 GB. The base model dominates; adapters are inexpensive to keep in VRAM simultaneously.
For a detailed local-hardware comparison, see LLM_Hub_Local_LLM_Hosting.
| Framework | Primary strength | Hardware | Pick when… |
|---|---|---|---|
| TensorRT-LLM | NVIDIA-optimised throughput; FP8/FP4; IFB; speculative decoding; engine compilation | NVIDIA only (A10G, A100, H100, Ada) | Maximum tok/s in production on NVIDIA; NIM deployment |
| vLLM | PagedAttention; OpenAI-compatible API; broad model zoo; multi-LoRA | NVIDIA + AMD (ROCm) | Good throughput without compilation; rapid deployment; team comfort with Python API |
| SGLang | RadixAttention (prefix caching); structured generation; fast TTFT | NVIDIA | Agentic workloads; structured output; multi-turn with long shared prefixes |
| TGI | HuggingFace ecosystem integration; simple deployment | NVIDIA + AMD + Intel | Already using HF Hub; simpler ops; less throughput tuning needed |
| llama.cpp | CPU + consumer GPU; GGUF quantisation; widest hardware support | CPU, NVIDIA, Apple Metal | Non-NVIDIA hardware; extreme quantisation (Q2–Q4); CPU-only fallback |
| Ollama | Zero-config local serving; wraps llama.cpp | CPU, NVIDIA, Apple | Individual developer local use; not production-grade |
Open-source (Apache 2.0 since October 2023) Python/C++ library. For full coverage see NVIDIA_GPU_19_TensorRT_LLM.
--qformat fp8 | awq | gptq | nvfp4 | smoothquanttrtllm-build — kernel fusion, tactic selection, bakes max_batch / max_seq / precision into binary| Feature | TensorRT-LLM support |
|---|---|
| Quantisation | FP8 E4M3, NVFP4, INT4 AWQ, INT4 GPTQ, INT8 SmoothQuant |
| In-flight batching | Yes — iteration-level in C++ runtime; default when batching_strategy: inflight_fused_batching |
| Paged KV cache | Yes — configurable block size; prefix caching; host memory offload |
| Speculative decoding | Draft model, Medusa heads, EAGLE-3 |
| Parallelism | Tensor (TP), pipeline (PP), expert (EP) for MoE |
| Multi-LoRA | Yes — multiple adapters, runtime switching |
TensorRT-LLM is an ahead-of-time compiler. The engine is frozen to a specific GPU architecture, precision, and shape configuration. Build time: 5–30 minutes for a 70B model. vLLM uses torch.compile JIT. TRT-LLM trades flexibility for maximum throughput; vLLM trades throughput for faster iteration.
Consumer and workstation hardware with different precision capabilities. The cert focuses on A100/H100 distinctions, but understanding the consumer tier clarifies the Ampere/Ada capability split.
The RTX 3080 has higher memory bandwidth (760 vs 432 GB/s) despite 2× less VRAM. For bandwidth-bound decode TPOT, the 3080 can outperform the 4000 Ada at equivalent model and precision if the model fits. The 4000 Ada wins on batch capacity and model size headroom (20 GB).
Back-of-envelope estimate for a 7B INT4 model on RTX 4000 Ada at batch 1. Derived from formulas, not measured benchmarks.
--- TTFT (prefill, 512 input tokens) ---
Prefill is compute-bound.
FLOPs ≈ 2 × 7×10^9 params × 512 tokens = 7.2×10^12 FLOPs
RTX 4000 Ada peak BF16 tensor core throughput ≈ 40 TFLOPS
TTFT ≈ 7.2×10^12 / 40×10^12 ≈ ~180 ms (order of magnitude)
--- TPOT (decode, INT4 weights, bandwidth-bound) ---
INT4 7B weights: ~3.5 GB effective bytes loaded per forward pass
RTX 4000 Ada bandwidth: ~432 GB/s
TPOT ≈ 3.5×10^9 / 432×10^9 ≈ ~8 ms/token (~125 tok/s)
--- End-to-end, 256 output tokens ---
E2E ≈ TTFT + 256 × TPOT
≈ 180 ms + 256 × 8 ms ≈ ~2.2 s
TPOT dominates: decode accounts for ~93% of end-to-end latency.
These are order-of-magnitude estimates from published formulas. Real throughput depends on kernel efficiency, memory allocation overhead, CUDA stream scheduling, and quantisation dequantisation cost. Use them to develop intuition about the relative contribution of TTFT vs TPOT, not as deployment-grade benchmarks.
What to reach for first, in order of typical expected impact for a standard LLM serving deployment.
Items 1–3 are essentially universal. FlashAttention is assumed on by default in all modern frameworks and does not appear as a discrete optimisation step. Items 4–6 require profiling for the specific workload.
An architecture-level choice that drastically reduces KV cache size with minimal accuracy impact at scale. Selecting models that implement GQA is therefore a free optimisation at deployment time.
Multi-Head Attention. Each query head has its own K and V head. n_kv_heads = n_q_heads. Llama 2 7B: 32 KV heads.
Grouped-Query Attention. Groups of query heads share one KV pair. Llama 3 8B: 8 KV heads vs 32 Q heads (4× reduction). Llama 3 70B: 8 KV heads vs 64 Q heads (8× reduction).
Multi-Query Attention. All Q heads share a single KV pair. Maximum KV reduction; some accuracy loss at smaller scales. Used by Falcon.
Hypothetical MHA 70B (64 KV heads):
2 × 80 layers × 64 kv_heads × 128 head_dim × 4096 tokens × 2 bytes ≈ 42.9 GB
Actual GQA 70B (8 KV heads):
2 × 80 layers × 8 kv_heads × 128 head_dim × 4096 tokens × 2 bytes ≈ 5.4 GB
Reduction: 8× — from formula alone, no measured benchmark.
Common question patterns for the Model Optimisation (17%) and GPU Acceleration (14%) domains. Source: notes/08_inference_optimisation.md.
Cert-focused tour: this deck maps techniques to exam domains. For depth on implementation, see the portfolio repos below.
| Topic | Resource |
|---|---|
| TensorRT-LLM: engine builder, IFB, paged KV, FP8/FP4, speculative | NVIDIA_GPU_19_TensorRT_LLM |
| vLLM, Ollama, TGI, llama.cpp local comparison | LLM_Hub_Local_LLM_Hosting |
| NVIDIA GPU memory hierarchy and bandwidth | NVIDIA_GPU_04_Memory_Hierarchy |
| Tensor cores: INT8 / FP8 throughput arithmetic | NVIDIA_GPU_03_Tensor_Cores |
| Full NVIDIA GPU architecture series | LLM_Hub_NVIDIA_GPUs |
| NVIDIA stack: Triton, NIM, NeMo (deck 05 in this series) | 05_nvidia_stack_overview |
| PagedAttention / vLLM paper | arXiv:2309.06180 — Kwon et al., 2023 |
| FlashAttention v1 | arXiv:2205.14135 — Dao et al., 2022 |
| FlashAttention v2 | arXiv:2307.08691 — Dao, 2023 |
| AWQ | arXiv:2306.00978 — Lin et al., 2023 |
| GPTQ | arXiv:2210.17323 — Frantar et al., 2022 |
| Medusa speculative decoding | arXiv:2401.10774 — Cai et al., 2024 |
| TensorRT-LLM official docs | developer.nvidia.com/tensorrt-llm |