Inference Optimisation — NVIDIA GenAI Cert Prep

00

Topics Covered

Twenty-two sections from first principles to exam prep. Work linearly or jump to the section you need.

Cert Framing
The Throughput / Latency / Memory Triangle
KV Cache — Formula and Worked Example
KV Cache Memory Scaling — Diagram
PagedAttention
Continuous / In-Flight Batching
FlashAttention v1 / v2 / v3
Quantisation Formats — Reference Table
Weight-Only vs Weight-and-Activation Quantisation
AWQ vs GPTQ
FP8 and FP4 — Hardware-Native Precision
KV Cache Quantisation
Speculative Decoding
Multi-LoRA Inference
Inference Framework Matrix
TensorRT-LLM In Depth
Hardware Mapping — RTX 3080 vs RTX 4000 Ada
Worked Latency Budget
Optimisation Priority List
GQA / MQA — Reducing KV Head Count
Exam Angles
Cross-References

01

Cert Framing

Model Optimisation is 17% of the NCP-GENL exam — the single largest domain. GPU Acceleration and Optimisation adds a further 14%. Together they account for nearly a third of the exam and both require a working understanding of the inference stack.

NCP-GENL domains (selected)

Model Optimisation — 17%
GPU Acceleration & Optimisation — 14%
Model Deployment — 9%
Production Monitoring & Reliability — 7%

What the exam expects

Numerical reasoning: KV cache formula, bit-width arithmetic
Technique selection: which optimisation targets which bottleneck
Framework knowledge: TensorRT-LLM, vLLM, SGLang feature sets
Hardware awareness: Ampere vs Hopper vs Ada capability differences

Study approach

Every optimisation technique in this deck maps to a specific vertex of the throughput/latency/memory triangle on the next slide. Knowing which vertex an optimisation addresses is the most efficient way to reason through exam scenarios.

02

The Throughput / Latency / Memory Triangle

Three coupled constraints. Every inference optimisation technique addresses one or more vertices. Understanding the coupling is more useful than a list of names.

TTFT

Time To First Token. Prefill phase — processes entire input in parallel. Compute-bound. FlashAttention and fast GPUs reduce TTFT.

TPOT

Time Per Output Token. Autoregressive decode — one token per forward pass. Memory-bandwidth-bound at batch 1: weights stream from HBM each step.

The coupling

Larger batch → higher throughput, higher per-request latency. Lower precision → smaller footprint, potential accuracy cost. Continuous batching is the lever that decouples throughput from tail latency most effectively.

03

KV Cache — Formula and Worked Example

The KV cache is the dominant memory consumer at long context. The exam will ask you to compute its size from architecture parameters.

KV cache footprint formula

KV_bytes = 2 × n_layers × n_kv_heads × head_dim × seq_len × bytes_per_element

Factor of 2: one tensor for K, one for V.
n_kv_heads may be < n_query_heads when GQA or MQA is used.

Worked example: 7B model, 4 096-token context, FP16

Architecture: 32 layers, 32 KV heads, head dimension 128, FP16 (2 bytes per element).

KV cache per sequence at 4k context, FP16

2 × 32 layers × 32 kv_heads × 128 head_dim × 4096 tokens × 2 bytes
= 2,147,483,648 bytes ≈ 2 GB per sequence

At batch size 8:  8 × 2 GB = 16 GB for KV alone.
7B weights at FP16 occupy ≈ 14 GB.
Total: 30 GB — exceeds RTX 4000 Ada 20 GB without optimisation.

Primary levers for reducing KV footprint

GQA / MQA — reduce n_kv_heads. Llama 3 70B uses 8 KV heads vs 64 query heads: 8× reduction.
KV quantisation — reduce bytes_per_element from 2 (FP16) to 1 (INT8 or FP8): 2× reduction.
Shorter context or chunked prefill — seq_len is linear in KV size but quadratic in attention FLOP cost.

04

KV Cache Memory Scaling

KV cache grows linearly with sequence length. At long context it eclipses the model weight footprint. GQA and INT8 quantisation are the two main mitigations.

At 64k context, FP16 KV with 32 heads needs ≈32 GB per sequence — impossible on any single consumer GPU. Either GQA (8 heads, 8× reduction) or INT8 quantisation (2× reduction) is required for extended context on constrained hardware.

05

PagedAttention

Kwon et al., 2023 — arXiv:2309.06180 (the vLLM paper). Borrowed from OS virtual memory: manage KV cache as pages rather than contiguous allocations.

Before PagedAttention

Each request pre-allocates a contiguous KV block for its maximum possible output length. Because output length is unknown at request time, systems reserve the worst case.

20–80% of KV allocation wasted (internal fragmentation)
Hard batch-size ceiling from worst-case reserves
No sharing of identical prompt prefixes across concurrent requests

With PagedAttention

KV cache split into fixed-size blocks (pages), each holding keys and values for a fixed number of tokens. A per-request page table maps logical indices to physical memory locations.

Near-zero internal fragmentation (only last block partially filled)
Prefix caching: one copy of a shared system prompt, referenced by many requests via copy-on-write
2–4× throughput improvement vs FasterTransformer (from the vLLM paper)

Request arrives with prompt

↓

Block manager allocates pages for prompt KV

↓

Each new token fills next slot in current block

↓

Block full → allocate next free block (anywhere in HBM)

↓

Sequence finishes → all blocks returned to free pool

Prefix caching payoff

A 1k-token system prompt used by every request is prefilled once; all concurrent requests reference the same KV blocks. On a busy API server this can eliminate >50% of prefill FLOP costs.

06

Continuous / In-Flight Batching

Static batching idles GPU slots waiting for the longest sequence. Continuous batching reclaims those slots at the iteration level.

Static batching (the baseline)

GPU utilisation: 20–40% on variable-length chat traffic
Tail latency: dominated by the longest sequence in the batch
New requests queue until the entire batch drains

Continuous batching (IFB)

GPU utilisation: 60–90% on the same traffic pattern
New request admitted within one decode step once a slot frees
Implemented in TensorRT-LLM C++ runtime and vLLM at the iteration level

07

FlashAttention v1 / v2 / v3

Standard attention materialises the full L×L score matrix in HBM. FlashAttention avoids this with a tiled algorithm that keeps intermediate results in on-chip SRAM.

FlashAttention v1 (2022)

arXiv:2205.14135. IO-optimal tiling: HBM reads/writes drop from O(L²) to O(L). The L×L score matrix is never materialised in HBM; softmax normalisers are recomputed on-chip. Up to 3× speedup on GPT-2 at 1k tokens; 16k training sequences became practical.

FlashAttention v2 (2023)

arXiv:2307.08691. Reached 50–73% peak FLOPs/s on A100 (v1 achieved ~25%). Fewer non-matmul ops; parallelism across sequence dimension; better warp-level work distribution. Roughly 2× over v1.

FlashAttention v3

Targets Hopper hardware specifically: FP8 tensor cores and async pipeline features (WGMMA/TMA). On consumer Ampere/Ada hardware, v2 is the relevant implementation.

What FlashAttention does and does not solve

Aspect	Helps?	Reason
Prefill speed (TTFT)	Yes	Optimises the L×L compute in the prefill phase directly
Long-context training	Yes	16k+ sequence lengths become practical in HBM
KV cache size (memory)	No	KV cache still exists; FlashAttention is a compute/IO optimisation, not memory reduction
Decode TPOT	Marginal	Decode is weight-bandwidth-bound; attention is not the dominant per-step cost

08

Quantisation Formats — Reference Table

Precision determines memory footprint, throughput, and which hardware provides native acceleration. Know this table for numerical and selection questions.

Format	Bits	Bytes/weight	Range (approx.)	Native HW	Primary use
FP32	32	4	±3.4×10³&sup8;	All	Training; inference reference only
BF16	16	2	±3.4×10³&sup8; (8-bit exp)	Ampere+, Hopper	Baseline serving; FP32 range with less mantissa
FP16	16	2	±65 504	All modern NVIDIA	Baseline serving; narrower range than BF16
FP8 E4M3	8	1	±448	Hopper, Ada	W+A inference; best accuracy at 8-bit
FP8 E5M2	8	1	±57 344	Hopper, Ada	Gradient storage during training
INT8	8	1	−128 to +127	Ampere+ tensor cores	SmoothQuant W8A8; solid accuracy on A100
NVFP4 / MX-FP4	4	0.5	Limited; micro-block scaling	Blackwell (B100, B200)	Maximum throughput; calibration required
INT4 (AWQ/GPTQ)	4	0.5	−8 to +7	All (dequant to FP16)	Weight-only; consumer GPU decode

BF16 vs FP16

BF16 has the same 8-bit exponent as FP32 (same dynamic range) but only 7 mantissa bits. FP16 has 10 mantissa bits but a narrower range (±65 504 vs ±3.4×10³&sup8;). For LLM serving, BF16 is generally preferred because large activations can overflow FP16.

09

Weight-Only vs Weight-and-Activation Quantisation

Two fundamentally different techniques. The distinction is a frequent exam topic because they target different bottlenecks.

Weight-only (AWQ, GPTQ, INT4)

Weights stored at INT4; dequantised to FP16 before the matmul. Compute still runs in FP16.

Win: fewer bytes loaded per weight from HBM → improves decode TPOT in bandwidth-bound regime
No compute benefit: INT4 tensor cores not used; FP16 matmul runs as before
Accuracy: typically 0.5–2% degradation at INT4
Best for: consumer GPU decode where HBM bandwidth is the binding constraint

Weight-and-activation (SmoothQuant, FP8)

Both weights and activations quantised; matmul runs at INT8 or FP8 on tensor cores.

Win: up to 2× tensor core throughput (INT8 on Ampere, FP8 on Hopper) plus bandwidth reduction
Challenge: activations have dynamic, per-token range; harder to quantise uniformly
SmoothQuant: migrates quantisation difficulty from activations to weights via per-channel scale factor
Best for: data-centre GPUs (A100, H100) with native INT8/FP8 tensor core throughput

Calibration vs QAT

PTQ with calibration: small representative dataset determines per-channel scales post-training. Fast; accuracy loss is higher at INT4. Quantisation-aware training (QAT): simulates quantisation in the forward pass during fine-tuning. Recovers accuracy lost by PTQ; requires a training run.

10

AWQ vs GPTQ

Both produce INT4 weight-only quantised models via post-training quantisation. They optimise for different objectives.

Aspect	AWQ	GPTQ
Paper	Lin et al., 2023 — arXiv:2306.00978	Frantar et al., 2022 — arXiv:2210.17323
Core method	Identifies 1% of "salient" weights by activation magnitude; protects them, quantises the rest to INT4	Layer-wise Hessian-based reconstruction; minimises second-order error in layer output
Calibration speed	Fast (minutes for 7B)	Slower (minutes to hours for 70B+)
Calibration data needed	Small representative set	Small representative set
Accuracy at INT4	Slightly better on instruction-tuned models	Competitive; can edge ahead on some base model benchmarks
TensorRT-LLM	Native (`--qformat awq`)	Native (`--qformat gptq`)
vLLM	Supported	Supported
Prefer when	Fast iteration; instruction-tuned deployment; Brendan's hardware	Maximising accuracy at fixed bit-width is primary concern

Practical guidance

For most production deployments, the accuracy difference between AWQ and GPTQ is below the noise of task-specific evaluation variance. Use AWQ for faster iteration. Always evaluate on your target task metric, not perplexity alone. Both are supported natively in TensorRT-LLM and vLLM.

11

FP8 and FP4 — Hardware-Native Precision

Unlike weight-only INT4 (which dequantises before compute), these formats run the matmul itself at reduced precision on dedicated tensor cores.

FP8 — Hopper (H100) and Ada Lovelace

Two encodings: E4M3 (4-bit exponent, 3-bit mantissa; range ±448) and E5M2 (5-bit exponent, 2-bit mantissa; range ±57 344). For inference, E4M3 is the primary format — higher precision, lower range. E5M2 is used for gradient storage. H100 FP8 tensor cores deliver roughly 2× the FLOPs of BF16 tensor cores. TensorRT-LLM FP8 quantisation via --qformat fp8 is the recommended production path on Hopper.

FP4 — Blackwell (B100, B200)

NVIDIA's NVFP4 and the MX-FP4 micro-scaling format apply a shared scale per small block of weights, providing more dynamic range than uniform INT4 within a 4-bit budget. Blackwell FP4 tensor cores provide the highest throughput in the current generation. TensorRT-LLM exposes this via --qformat nvfp4.

Format	Hardware target	Weight bits	Activation bits	Key benefit
FP8 E4M3	Hopper H100; Ada	8	8 (FP8)	~2× throughput vs BF16; near-lossless accuracy with calibration
NVFP4 / MX-FP4	Blackwell B100, B200	4 + block scale	8 (FP8)	Maximum throughput; QAT or careful calibration required
INT8 SmoothQuant	Ampere A100; Hopper	8	8 (INT8)	~2× throughput on Ampere; reliable accuracy

12

KV Cache Quantisation

Weight quantisation and KV cache quantisation are independent choices. Quantising the KV cache reduces long-context memory pressure without affecting model weight accuracy.

Why it is separate from weights

KV tensors are computed at inference time from activations. They are not model parameters. Quantising them to INT8 or FP8 post-computation compresses the in-HBM footprint during decoding.

TensorRT-LLM separates this as a distinct flag: --kv_cache_dtype fp8 or int8, independent of weight quantisation format.

Trade-offs

Memory win: INT8 KV halves KV cache size; FP8 KV provides the same halving with marginally better accuracy
Accuracy cost: small but task-dependent; long-context retrieval tasks are most sensitive
Hardware: INT8 KV on Ampere; FP8 KV on Hopper for native dequantise step
Combined with INT4 weights: the most memory-efficient configuration; practical on consumer hardware

Impact on Brendan's hardware

RTX 4000 Ada (20 GB): 7B FP16 model leaves ~6 GB for KV. INT8 KV doubles usable context length or batch size within that budget. RTX 3080 (10 GB): 7B INT4 weights occupy ~4 GB, leaving 6 GB; INT8 KV is essentially mandatory for any serious context length.

13

Speculative Decoding

Autoregressive decoding is sequential by construction. Speculative decoding breaks the serialisation with a draft-then-verify protocol that preserves the target model output distribution exactly.

Draft model generates k candidate tokens (fast; small model or extra heads)

↓

Target model verifies all k tokens in a single parallel forward pass

↓

Accept longest matching prefix; reject at first token where draft diverges

↓

On rejection: take target's corrected token; restart draft from there

Speedup arithmetic

Let α = draft token acceptance rate, k = draft tokens per cycle. Expected tokens per cycle ≈ 1 + α⋅k (one target token always accepted, plus accepted drafts). At α = 0.8, k = 5: expected 5 tokens per verification step instead of 1 → up to 2–3× TPOT improvement in practice.

Medusa (Cai et al., 2024 — arXiv:2401.10774)

Extra decoding heads on the target model; head k predicts token at position +k. No separate draft model. Lower acceptance rates than autoregressive drafts but zero separate-model cost. Supported in TensorRT-LLM.

EAGLE / EAGLE-3 (Li et al., 2024)

Auto-regressive draft model trained to predict one step ahead at hidden-state level, not token level. Higher acceptance rates than Medusa. EAGLE-3 is the current iteration; natively supported in TensorRT-LLM.

When it fails

Below α ≈ 0.5, speculative decoding adds overhead without throughput gain. Creative generation tasks with broad target distributions see low acceptance rates. Always profile acceptance rate before deploying a speculative setup in production.

14

Multi-LoRA Inference

Fine-tuned LoRA adapters share a base model. Batching cross-adapter requests in a single forward pass avoids reloading base weights on each adapter switch.

Base weights (shared, in HBM)

↓

Shared base forward pass for all concurrent requests

↓

Per-request LoRA delta A×B via fused CUDA kernel

↓

Each request uses its own adapter output

S-LoRA (Sheng et al., 2023)

Unified paging for LoRA adapters with custom BGMV CUDA kernels for batched LoRA computation. Demonstrated hundreds of adapters served from a single GPU. The foundation for vLLM multi-LoRA support.

Punica

Alternative implementation using BGMV (batched grouped matmul for variable ranks). Similar throughput to S-LoRA. Both S-LoRA and Punica underpin vLLM's multi-LoRA serving.

Memory arithmetic

A rank-16 LoRA adapter on a 7B model contributes on the order of 20–50 MB. The base model is ~14 GB at FP16. Storing 50 adapters adds 1–2.5 GB. The base model dominates; adapters are inexpensive to keep in VRAM simultaneously.

15

Inference Framework Matrix

For a detailed local-hardware comparison, see LLM_Hub_Local_LLM_Hosting.

Framework	Primary strength	Hardware	Pick when…
TensorRT-LLM	NVIDIA-optimised throughput; FP8/FP4; IFB; speculative decoding; engine compilation	NVIDIA only (A10G, A100, H100, Ada)	Maximum tok/s in production on NVIDIA; NIM deployment
vLLM	PagedAttention; OpenAI-compatible API; broad model zoo; multi-LoRA	NVIDIA + AMD (ROCm)	Good throughput without compilation; rapid deployment; team comfort with Python API
SGLang	RadixAttention (prefix caching); structured generation; fast TTFT	NVIDIA	Agentic workloads; structured output; multi-turn with long shared prefixes
TGI	HuggingFace ecosystem integration; simple deployment	NVIDIA + AMD + Intel	Already using HF Hub; simpler ops; less throughput tuning needed
llama.cpp	CPU + consumer GPU; GGUF quantisation; widest hardware support	CPU, NVIDIA, Apple Metal	Non-NVIDIA hardware; extreme quantisation (Q2–Q4); CPU-only fallback
Ollama	Zero-config local serving; wraps llama.cpp	CPU, NVIDIA, Apple	Individual developer local use; not production-grade

16

TensorRT-LLM In Depth

Open-source (Apache 2.0 since October 2023) Python/C++ library. For full coverage see NVIDIA_GPU_19_TensorRT_LLM.

Author model with TRT-LLM Python API (or use pre-built recipes for Llama, Mistral, Mixtral, Qwen, Gemma…)

↓

Quantise: --qformat fp8 | awq | gptq | nvfp4 | smoothquant

↓

trtllm-build — kernel fusion, tactic selection, bakes max_batch / max_seq / precision into binary

↓

engine.bin + config.json — frozen, GPU-architecture-specific

↓

Triton Inference Server (tensorrtllm_backend) or NIM container

Feature	TensorRT-LLM support
Quantisation	FP8 E4M3, NVFP4, INT4 AWQ, INT4 GPTQ, INT8 SmoothQuant
In-flight batching	Yes — iteration-level in C++ runtime; default when `batching_strategy: inflight_fused_batching`
Paged KV cache	Yes — configurable block size; prefix caching; host memory offload
Speculative decoding	Draft model, Medusa heads, EAGLE-3
Parallelism	Tensor (TP), pipeline (PP), expert (EP) for MoE
Multi-LoRA	Yes — multiple adapters, runtime switching

Key distinction from vLLM

TensorRT-LLM is an ahead-of-time compiler. The engine is frozen to a specific GPU architecture, precision, and shape configuration. Build time: 5–30 minutes for a 70B model. vLLM uses torch.compile JIT. TRT-LLM trades flexibility for maximum throughput; vLLM trades throughput for faster iteration.

17

Hardware Mapping — RTX 3080 vs RTX 4000 Ada

Consumer and workstation hardware with different precision capabilities. The cert focuses on A100/H100 distinctions, but understanding the consumer tier clarifies the Ampere/Ada capability split.

RTX 3080 (Ampere, 10 GB GDDR6X)

INT8 tensor cores Yes
FP8 native No (Hopper feature)
FP4 No
MIG Not on consumer Ampere
Memory bandwidth: ~760 GB/s
7B INT4: ~4 GB weights, ~6 GB KV budget
13B INT4: ~7 GB weights, ~3 GB KV — tight

RTX 4000 Ada (Ada Lovelace, 20 GB GDDR6)

INT8 tensor cores Yes
FP8 compute Yes (E4M3/E5M2)
FP4 No (Blackwell only)
MIG Not on Ada consumer/workstation
Memory bandwidth: ~432 GB/s
7B BF16: ~14 GB weights, 6 GB KV
13B INT4: ~7 GB weights, 13 GB KV budget
30B INT4: ~15–16 GB weights, 4–5 GB KV — short context only

Bandwidth note

The RTX 3080 has higher memory bandwidth (760 vs 432 GB/s) despite 2× less VRAM. For bandwidth-bound decode TPOT, the 3080 can outperform the 4000 Ada at equivalent model and precision if the model fits. The 4000 Ada wins on batch capacity and model size headroom (20 GB).

18

Worked Latency Budget

Back-of-envelope estimate for a 7B INT4 model on RTX 4000 Ada at batch 1. Derived from formulas, not measured benchmarks.

Latency estimate — 7B INT4, RTX 4000 Ada, batch 1, single request

--- TTFT (prefill, 512 input tokens) ---
Prefill is compute-bound.
FLOPs ≈ 2 × 7×10^9 params × 512 tokens = 7.2×10^12 FLOPs
RTX 4000 Ada peak BF16 tensor core throughput ≈ 40 TFLOPS
TTFT ≈ 7.2×10^12 / 40×10^12 ≈ ~180 ms (order of magnitude)

--- TPOT (decode, INT4 weights, bandwidth-bound) ---
INT4 7B weights: ~3.5 GB effective bytes loaded per forward pass
RTX 4000 Ada bandwidth: ~432 GB/s
TPOT ≈ 3.5×10^9 / 432×10^9 ≈ ~8 ms/token (~125 tok/s)

--- End-to-end, 256 output tokens ---
E2E ≈ TTFT + 256 × TPOT
    ≈ 180 ms + 256 × 8 ms ≈ ~2.2 s

TPOT dominates: decode accounts for ~93% of end-to-end latency.

Caveat

These are order-of-magnitude estimates from published formulas. Real throughput depends on kernel efficiency, memory allocation overhead, CUDA stream scheduling, and quantisation dequantisation cost. Use them to develop intuition about the relative contribution of TTFT vs TPOT, not as deployment-grade benchmarks.

19

Optimisation Priority List

What to reach for first, in order of typical expected impact for a standard LLM serving deployment.

Priority 1

Weight quantisation (INT4 AWQ/GPTQ or INT8 SmoothQuant / FP8) — largest free win

Priority 2

Continuous / in-flight batching — mandatory for variable-length production traffic

Priority 3

PagedAttention / paged KV — enables larger effective batch size without fragmentation

Priority 4

KV cache quantisation (INT8/FP8) — extends context length on constrained HBM

Priority 5

Model selection with GQA/MQA — choose architectures that reduce n_kv_heads

Priority 6

Speculative decoding — significant gain if draft acceptance > ~0.7

Items 1–3 are essentially universal. FlashAttention is assumed on by default in all modern frameworks and does not appear as a discrete optimisation step. Items 4–6 require profiling for the specific workload.

20

GQA / MQA — Reducing KV Head Count

An architecture-level choice that drastically reduces KV cache size with minimal accuracy impact at scale. Selecting models that implement GQA is therefore a free optimisation at deployment time.

MHA

Multi-Head Attention. Each query head has its own K and V head. n_kv_heads = n_q_heads. Llama 2 7B: 32 KV heads.

GQA

Grouped-Query Attention. Groups of query heads share one KV pair. Llama 3 8B: 8 KV heads vs 32 Q heads (4× reduction). Llama 3 70B: 8 KV heads vs 64 Q heads (8× reduction).

MQA

Multi-Query Attention. All Q heads share a single KV pair. Maximum KV reduction; some accuracy loss at smaller scales. Used by Falcon.

KV cache reduction with GQA — Llama 3 70B vs hypothetical MHA 70B

Hypothetical MHA 70B (64 KV heads):
  2 × 80 layers × 64 kv_heads × 128 head_dim × 4096 tokens × 2 bytes ≈ 42.9 GB

Actual GQA 70B (8 KV heads):
  2 × 80 layers × 8 kv_heads × 128 head_dim × 4096 tokens × 2 bytes ≈ 5.4 GB

Reduction: 8× — from formula alone, no measured benchmark.

21

Exam Angles

Common question patterns for the Model Optimisation (17%) and GPU Acceleration (14%) domains. Source: notes/08_inference_optimisation.md.

Numerical / calculation

Given architecture params, compute KV cache at N tokens, batch B
Given GQA n_kv_heads, compute reduction vs MHA
Given INT8 KV, compute memory saving
Given bandwidth and weight bytes at precision X, estimate TPOT

Technique selection

"KV cache fragments memory" → PagedAttention
"GPU idles waiting for longest request" → continuous batching
"Prefill of long prompt is slow" → FlashAttention (not KV quant)
"Decode TPOT bounded by HBM bandwidth" → weight-only quantisation
"Multiple fine-tuned adapters, one base" → S-LoRA / multi-LoRA

Framework / tooling

TRT-LLM ahead-of-time compilation vs vLLM JIT — know the trade-off
Which framework is NVIDIA's recommended production path?
AWQ vs GPTQ: what each optimises; when each wins
FP8 native support: Hopper yes; Ampere no (software emulation only)

Speculative decoding

Output distribution: identical to target model (not approximate)
When does it fail? Low draft acceptance rate (< ~0.5)
Medusa: extra decoding heads; EAGLE-3: feature-level draft
Speedup source: memory-bandwidth-bound regime; multiple tokens per verify step

22

Cross-References

Cert-focused tour: this deck maps techniques to exam domains. For depth on implementation, see the portfolio repos below.

Topic	Resource
TensorRT-LLM: engine builder, IFB, paged KV, FP8/FP4, speculative	NVIDIA_GPU_19_TensorRT_LLM
vLLM, Ollama, TGI, llama.cpp local comparison	LLM_Hub_Local_LLM_Hosting
NVIDIA GPU memory hierarchy and bandwidth	NVIDIA_GPU_04_Memory_Hierarchy
Tensor cores: INT8 / FP8 throughput arithmetic	NVIDIA_GPU_03_Tensor_Cores
Full NVIDIA GPU architecture series	LLM_Hub_NVIDIA_GPUs
NVIDIA stack: Triton, NIM, NeMo (deck 05 in this series)	05_nvidia_stack_overview
PagedAttention / vLLM paper	arXiv:2309.06180 — Kwon et al., 2023
FlashAttention v1	arXiv:2205.14135 — Dao et al., 2022
FlashAttention v2	arXiv:2307.08691 — Dao, 2023
AWQ	arXiv:2306.00978 — Lin et al., 2023
GPTQ	arXiv:2210.17323 — Frantar et al., 2022
Medusa speculative decoding	arXiv:2401.10774 — Cai et al., 2024
TensorRT-LLM official docs	developer.nvidia.com/tensorrt-llm