NVIDIA GenAI Cert Prep — Deck 04

Inference Optimisation

The single largest NCP domain at 17%. KV cache arithmetic, PagedAttention, continuous batching, FlashAttention, quantisation, speculative decoding, multi-LoRA, and the framework landscape.

KV CachePagedAttention FlashAttentionQuantisation AWQGPTQFP8 SpeculativeTensorRT-LLMvLLM
Weights Quantise Engine KV cache IFB Serve
00

Topics Covered

Twenty-two sections from first principles to exam prep. Work linearly or jump to the section you need.

01

Cert Framing

Model Optimisation is 17% of the NCP-GENL exam — the single largest domain. GPU Acceleration and Optimisation adds a further 14%. Together they account for nearly a third of the exam and both require a working understanding of the inference stack.

NCP-GENL domains (selected)

  • Model Optimisation — 17%
  • GPU Acceleration & Optimisation — 14%
  • Model Deployment — 9%
  • Production Monitoring & Reliability — 7%

What the exam expects

  • Numerical reasoning: KV cache formula, bit-width arithmetic
  • Technique selection: which optimisation targets which bottleneck
  • Framework knowledge: TensorRT-LLM, vLLM, SGLang feature sets
  • Hardware awareness: Ampere vs Hopper vs Ada capability differences
Study approach

Every optimisation technique in this deck maps to a specific vertex of the throughput/latency/memory triangle on the next slide. Knowing which vertex an optimisation addresses is the most efficient way to reason through exam scenarios.

02

The Throughput / Latency / Memory Triangle

Three coupled constraints. Every inference optimisation technique addresses one or more vertices. Understanding the coupling is more useful than a list of names.

THROUGHPUT tok/s across all requests; large batches; high arithmetic intensity MEMORY HBM: model weights + KV cache LATENCY TTFT (prefill) + TPOT (decode) Quantisation shrinks memory, improves throughput, may reduce accuracy KV management PagedAttention, GQA, KV quant Batching strategy IFB: throughput vs tail latency

TTFT

Time To First Token. Prefill phase — processes entire input in parallel. Compute-bound. FlashAttention and fast GPUs reduce TTFT.

TPOT

Time Per Output Token. Autoregressive decode — one token per forward pass. Memory-bandwidth-bound at batch 1: weights stream from HBM each step.

The coupling

Larger batch → higher throughput, higher per-request latency. Lower precision → smaller footprint, potential accuracy cost. Continuous batching is the lever that decouples throughput from tail latency most effectively.

03

KV Cache — Formula and Worked Example

The KV cache is the dominant memory consumer at long context. The exam will ask you to compute its size from architecture parameters.

KV cache footprint formula
KV_bytes = 2 × n_layers × n_kv_heads × head_dim × seq_len × bytes_per_element

Factor of 2: one tensor for K, one for V.
n_kv_heads may be < n_query_heads when GQA or MQA is used.

Worked example: 7B model, 4 096-token context, FP16

Architecture: 32 layers, 32 KV heads, head dimension 128, FP16 (2 bytes per element).

KV cache per sequence at 4k context, FP16
2 × 32 layers × 32 kv_heads × 128 head_dim × 4096 tokens × 2 bytes
= 2,147,483,648 bytes ≈ 2 GB per sequence

At batch size 8:  8 × 2 GB = 16 GB for KV alone.
7B weights at FP16 occupy ≈ 14 GB.
Total: 30 GB — exceeds RTX 4000 Ada 20 GB without optimisation.

Primary levers for reducing KV footprint

04

KV Cache Memory Scaling

KV cache grows linearly with sequence length. At long context it eclipses the model weight footprint. GQA and INT8 quantisation are the two main mitigations.

1k 2k 4k 8k 16k 32k 64k Sequence length (tokens) 4 GB 8 GB 16 GB 32 GB 0 ~14 GB 7B weights FP16 KV FP16, 32 KV heads KV INT8, 32 KV heads KV FP16, 8 KV heads (GQA)

At 64k context, FP16 KV with 32 heads needs ≈32 GB per sequence — impossible on any single consumer GPU. Either GQA (8 heads, 8× reduction) or INT8 quantisation (2× reduction) is required for extended context on constrained hardware.

05

PagedAttention

Kwon et al., 2023 — arXiv:2309.06180 (the vLLM paper). Borrowed from OS virtual memory: manage KV cache as pages rather than contiguous allocations.

Before PagedAttention

Each request pre-allocates a contiguous KV block for its maximum possible output length. Because output length is unknown at request time, systems reserve the worst case.

  • 20–80% of KV allocation wasted (internal fragmentation)
  • Hard batch-size ceiling from worst-case reserves
  • No sharing of identical prompt prefixes across concurrent requests

With PagedAttention

KV cache split into fixed-size blocks (pages), each holding keys and values for a fixed number of tokens. A per-request page table maps logical indices to physical memory locations.

  • Near-zero internal fragmentation (only last block partially filled)
  • Prefix caching: one copy of a shared system prompt, referenced by many requests via copy-on-write
  • 2–4× throughput improvement vs FasterTransformer (from the vLLM paper)
Request arrives with prompt
Block manager allocates pages for prompt KV
Each new token fills next slot in current block
Block full → allocate next free block (anywhere in HBM)
Sequence finishes → all blocks returned to free pool
Prefix caching payoff

A 1k-token system prompt used by every request is prefilled once; all concurrent requests reference the same KV blocks. On a busy API server this can eliminate >50% of prefill FLOP costs.

06

Continuous / In-Flight Batching

Static batching idles GPU slots waiting for the longest sequence. Continuous batching reclaims those slots at the iteration level.

Static batching — GPU idles until every request in the batch finishes req A req B req C batch drains when A finishes Continuous batching — new requests join as slots free slot 0 slot 1 slot 2 new request fills slot when previous sequence completes

Static batching (the baseline)

  • GPU utilisation: 20–40% on variable-length chat traffic
  • Tail latency: dominated by the longest sequence in the batch
  • New requests queue until the entire batch drains

Continuous batching (IFB)

  • GPU utilisation: 60–90% on the same traffic pattern
  • New request admitted within one decode step once a slot frees
  • Implemented in TensorRT-LLM C++ runtime and vLLM at the iteration level
07

FlashAttention v1 / v2 / v3

Standard attention materialises the full L×L score matrix in HBM. FlashAttention avoids this with a tiled algorithm that keeps intermediate results in on-chip SRAM.

FlashAttention v1 (2022)

arXiv:2205.14135. IO-optimal tiling: HBM reads/writes drop from O(L²) to O(L). The L×L score matrix is never materialised in HBM; softmax normalisers are recomputed on-chip. Up to 3× speedup on GPT-2 at 1k tokens; 16k training sequences became practical.

FlashAttention v2 (2023)

arXiv:2307.08691. Reached 50–73% peak FLOPs/s on A100 (v1 achieved ~25%). Fewer non-matmul ops; parallelism across sequence dimension; better warp-level work distribution. Roughly 2× over v1.

FlashAttention v3

Targets Hopper hardware specifically: FP8 tensor cores and async pipeline features (WGMMA/TMA). On consumer Ampere/Ada hardware, v2 is the relevant implementation.

What FlashAttention does and does not solve

AspectHelps?Reason
Prefill speed (TTFT)YesOptimises the L×L compute in the prefill phase directly
Long-context trainingYes16k+ sequence lengths become practical in HBM
KV cache size (memory)NoKV cache still exists; FlashAttention is a compute/IO optimisation, not memory reduction
Decode TPOTMarginalDecode is weight-bandwidth-bound; attention is not the dominant per-step cost
08

Quantisation Formats — Reference Table

Precision determines memory footprint, throughput, and which hardware provides native acceleration. Know this table for numerical and selection questions.

FormatBitsBytes/weightRange (approx.)Native HWPrimary use
FP32324±3.4×10³&sup8;AllTraining; inference reference only
BF16162±3.4×10³&sup8; (8-bit exp)Ampere+, HopperBaseline serving; FP32 range with less mantissa
FP16162±65 504All modern NVIDIABaseline serving; narrower range than BF16
FP8 E4M381±448Hopper, AdaW+A inference; best accuracy at 8-bit
FP8 E5M281±57 344Hopper, AdaGradient storage during training
INT881−128 to +127Ampere+ tensor coresSmoothQuant W8A8; solid accuracy on A100
NVFP4 / MX-FP440.5Limited; micro-block scalingBlackwell (B100, B200)Maximum throughput; calibration required
INT4 (AWQ/GPTQ)40.5−8 to +7All (dequant to FP16)Weight-only; consumer GPU decode
BF16 vs FP16

BF16 has the same 8-bit exponent as FP32 (same dynamic range) but only 7 mantissa bits. FP16 has 10 mantissa bits but a narrower range (±65 504 vs ±3.4×10³&sup8;). For LLM serving, BF16 is generally preferred because large activations can overflow FP16.

09

Weight-Only vs Weight-and-Activation Quantisation

Two fundamentally different techniques. The distinction is a frequent exam topic because they target different bottlenecks.

Weight-only (AWQ, GPTQ, INT4)

Weights stored at INT4; dequantised to FP16 before the matmul. Compute still runs in FP16.

  • Win: fewer bytes loaded per weight from HBM → improves decode TPOT in bandwidth-bound regime
  • No compute benefit: INT4 tensor cores not used; FP16 matmul runs as before
  • Accuracy: typically 0.5–2% degradation at INT4
  • Best for: consumer GPU decode where HBM bandwidth is the binding constraint

Weight-and-activation (SmoothQuant, FP8)

Both weights and activations quantised; matmul runs at INT8 or FP8 on tensor cores.

  • Win: up to 2× tensor core throughput (INT8 on Ampere, FP8 on Hopper) plus bandwidth reduction
  • Challenge: activations have dynamic, per-token range; harder to quantise uniformly
  • SmoothQuant: migrates quantisation difficulty from activations to weights via per-channel scale factor
  • Best for: data-centre GPUs (A100, H100) with native INT8/FP8 tensor core throughput

Calibration vs QAT

PTQ with calibration: small representative dataset determines per-channel scales post-training. Fast; accuracy loss is higher at INT4. Quantisation-aware training (QAT): simulates quantisation in the forward pass during fine-tuning. Recovers accuracy lost by PTQ; requires a training run.

10

AWQ vs GPTQ

Both produce INT4 weight-only quantised models via post-training quantisation. They optimise for different objectives.

AspectAWQGPTQ
PaperLin et al., 2023 — arXiv:2306.00978Frantar et al., 2022 — arXiv:2210.17323
Core methodIdentifies 1% of "salient" weights by activation magnitude; protects them, quantises the rest to INT4Layer-wise Hessian-based reconstruction; minimises second-order error in layer output
Calibration speedFast (minutes for 7B)Slower (minutes to hours for 70B+)
Calibration data neededSmall representative setSmall representative set
Accuracy at INT4Slightly better on instruction-tuned modelsCompetitive; can edge ahead on some base model benchmarks
TensorRT-LLMNative (--qformat awq)Native (--qformat gptq)
vLLMSupportedSupported
Prefer whenFast iteration; instruction-tuned deployment; Brendan's hardwareMaximising accuracy at fixed bit-width is primary concern
Practical guidance

For most production deployments, the accuracy difference between AWQ and GPTQ is below the noise of task-specific evaluation variance. Use AWQ for faster iteration. Always evaluate on your target task metric, not perplexity alone. Both are supported natively in TensorRT-LLM and vLLM.

11

FP8 and FP4 — Hardware-Native Precision

Unlike weight-only INT4 (which dequantises before compute), these formats run the matmul itself at reduced precision on dedicated tensor cores.

FP8 — Hopper (H100) and Ada Lovelace

Two encodings: E4M3 (4-bit exponent, 3-bit mantissa; range ±448) and E5M2 (5-bit exponent, 2-bit mantissa; range ±57 344). For inference, E4M3 is the primary format — higher precision, lower range. E5M2 is used for gradient storage. H100 FP8 tensor cores deliver roughly 2× the FLOPs of BF16 tensor cores. TensorRT-LLM FP8 quantisation via --qformat fp8 is the recommended production path on Hopper.

FP4 — Blackwell (B100, B200)

NVIDIA's NVFP4 and the MX-FP4 micro-scaling format apply a shared scale per small block of weights, providing more dynamic range than uniform INT4 within a 4-bit budget. Blackwell FP4 tensor cores provide the highest throughput in the current generation. TensorRT-LLM exposes this via --qformat nvfp4.

FormatHardware targetWeight bitsActivation bitsKey benefit
FP8 E4M3Hopper H100; Ada88 (FP8)~2× throughput vs BF16; near-lossless accuracy with calibration
NVFP4 / MX-FP4Blackwell B100, B2004 + block scale8 (FP8)Maximum throughput; QAT or careful calibration required
INT8 SmoothQuantAmpere A100; Hopper88 (INT8)~2× throughput on Ampere; reliable accuracy
12

KV Cache Quantisation

Weight quantisation and KV cache quantisation are independent choices. Quantising the KV cache reduces long-context memory pressure without affecting model weight accuracy.

Why it is separate from weights

KV tensors are computed at inference time from activations. They are not model parameters. Quantising them to INT8 or FP8 post-computation compresses the in-HBM footprint during decoding.

TensorRT-LLM separates this as a distinct flag: --kv_cache_dtype fp8 or int8, independent of weight quantisation format.

Trade-offs

  • Memory win: INT8 KV halves KV cache size; FP8 KV provides the same halving with marginally better accuracy
  • Accuracy cost: small but task-dependent; long-context retrieval tasks are most sensitive
  • Hardware: INT8 KV on Ampere; FP8 KV on Hopper for native dequantise step
  • Combined with INT4 weights: the most memory-efficient configuration; practical on consumer hardware
Impact on Brendan's hardware

RTX 4000 Ada (20 GB): 7B FP16 model leaves ~6 GB for KV. INT8 KV doubles usable context length or batch size within that budget. RTX 3080 (10 GB): 7B INT4 weights occupy ~4 GB, leaving 6 GB; INT8 KV is essentially mandatory for any serious context length.

13

Speculative Decoding

Autoregressive decoding is sequential by construction. Speculative decoding breaks the serialisation with a draft-then-verify protocol that preserves the target model output distribution exactly.

Draft model generates k candidate tokens (fast; small model or extra heads)
Target model verifies all k tokens in a single parallel forward pass
Accept longest matching prefix; reject at first token where draft diverges
On rejection: take target's corrected token; restart draft from there

Speedup arithmetic

Let α = draft token acceptance rate, k = draft tokens per cycle. Expected tokens per cycle ≈ 1 + α⋅k (one target token always accepted, plus accepted drafts). At α = 0.8, k = 5: expected 5 tokens per verification step instead of 1 → up to 2–3× TPOT improvement in practice.

Medusa (Cai et al., 2024 — arXiv:2401.10774)

Extra decoding heads on the target model; head k predicts token at position +k. No separate draft model. Lower acceptance rates than autoregressive drafts but zero separate-model cost. Supported in TensorRT-LLM.

EAGLE / EAGLE-3 (Li et al., 2024)

Auto-regressive draft model trained to predict one step ahead at hidden-state level, not token level. Higher acceptance rates than Medusa. EAGLE-3 is the current iteration; natively supported in TensorRT-LLM.

When it fails

Below α ≈ 0.5, speculative decoding adds overhead without throughput gain. Creative generation tasks with broad target distributions see low acceptance rates. Always profile acceptance rate before deploying a speculative setup in production.

14

Multi-LoRA Inference

Fine-tuned LoRA adapters share a base model. Batching cross-adapter requests in a single forward pass avoids reloading base weights on each adapter switch.

Base weights (shared, in HBM)
Shared base forward pass for all concurrent requests
Per-request LoRA delta A×B via fused CUDA kernel
Each request uses its own adapter output

S-LoRA (Sheng et al., 2023)

Unified paging for LoRA adapters with custom BGMV CUDA kernels for batched LoRA computation. Demonstrated hundreds of adapters served from a single GPU. The foundation for vLLM multi-LoRA support.

Punica

Alternative implementation using BGMV (batched grouped matmul for variable ranks). Similar throughput to S-LoRA. Both S-LoRA and Punica underpin vLLM's multi-LoRA serving.

Memory arithmetic

A rank-16 LoRA adapter on a 7B model contributes on the order of 20–50 MB. The base model is ~14 GB at FP16. Storing 50 adapters adds 1–2.5 GB. The base model dominates; adapters are inexpensive to keep in VRAM simultaneously.

15

Inference Framework Matrix

For a detailed local-hardware comparison, see LLM_Hub_Local_LLM_Hosting.

FrameworkPrimary strengthHardwarePick when…
TensorRT-LLM NVIDIA-optimised throughput; FP8/FP4; IFB; speculative decoding; engine compilation NVIDIA only (A10G, A100, H100, Ada) Maximum tok/s in production on NVIDIA; NIM deployment
vLLM PagedAttention; OpenAI-compatible API; broad model zoo; multi-LoRA NVIDIA + AMD (ROCm) Good throughput without compilation; rapid deployment; team comfort with Python API
SGLang RadixAttention (prefix caching); structured generation; fast TTFT NVIDIA Agentic workloads; structured output; multi-turn with long shared prefixes
TGI HuggingFace ecosystem integration; simple deployment NVIDIA + AMD + Intel Already using HF Hub; simpler ops; less throughput tuning needed
llama.cpp CPU + consumer GPU; GGUF quantisation; widest hardware support CPU, NVIDIA, Apple Metal Non-NVIDIA hardware; extreme quantisation (Q2–Q4); CPU-only fallback
Ollama Zero-config local serving; wraps llama.cpp CPU, NVIDIA, Apple Individual developer local use; not production-grade
16

TensorRT-LLM In Depth

Open-source (Apache 2.0 since October 2023) Python/C++ library. For full coverage see NVIDIA_GPU_19_TensorRT_LLM.

Author model with TRT-LLM Python API (or use pre-built recipes for Llama, Mistral, Mixtral, Qwen, Gemma…)
Quantise: --qformat fp8 | awq | gptq | nvfp4 | smoothquant
trtllm-build — kernel fusion, tactic selection, bakes max_batch / max_seq / precision into binary
engine.bin + config.json — frozen, GPU-architecture-specific
Triton Inference Server (tensorrtllm_backend) or NIM container
FeatureTensorRT-LLM support
QuantisationFP8 E4M3, NVFP4, INT4 AWQ, INT4 GPTQ, INT8 SmoothQuant
In-flight batchingYes — iteration-level in C++ runtime; default when batching_strategy: inflight_fused_batching
Paged KV cacheYes — configurable block size; prefix caching; host memory offload
Speculative decodingDraft model, Medusa heads, EAGLE-3
ParallelismTensor (TP), pipeline (PP), expert (EP) for MoE
Multi-LoRAYes — multiple adapters, runtime switching
Key distinction from vLLM

TensorRT-LLM is an ahead-of-time compiler. The engine is frozen to a specific GPU architecture, precision, and shape configuration. Build time: 5–30 minutes for a 70B model. vLLM uses torch.compile JIT. TRT-LLM trades flexibility for maximum throughput; vLLM trades throughput for faster iteration.

17

Hardware Mapping — RTX 3080 vs RTX 4000 Ada

Consumer and workstation hardware with different precision capabilities. The cert focuses on A100/H100 distinctions, but understanding the consumer tier clarifies the Ampere/Ada capability split.

RTX 3080 (Ampere, 10 GB GDDR6X)

  • INT8 tensor cores Yes
  • FP8 native No (Hopper feature)
  • FP4 No
  • MIG Not on consumer Ampere
  • Memory bandwidth: ~760 GB/s
  • 7B INT4: ~4 GB weights, ~6 GB KV budget
  • 13B INT4: ~7 GB weights, ~3 GB KV — tight

RTX 4000 Ada (Ada Lovelace, 20 GB GDDR6)

  • INT8 tensor cores Yes
  • FP8 compute Yes (E4M3/E5M2)
  • FP4 No (Blackwell only)
  • MIG Not on Ada consumer/workstation
  • Memory bandwidth: ~432 GB/s
  • 7B BF16: ~14 GB weights, 6 GB KV
  • 13B INT4: ~7 GB weights, 13 GB KV budget
  • 30B INT4: ~15–16 GB weights, 4–5 GB KV — short context only
Bandwidth note

The RTX 3080 has higher memory bandwidth (760 vs 432 GB/s) despite 2× less VRAM. For bandwidth-bound decode TPOT, the 3080 can outperform the 4000 Ada at equivalent model and precision if the model fits. The 4000 Ada wins on batch capacity and model size headroom (20 GB).

18

Worked Latency Budget

Back-of-envelope estimate for a 7B INT4 model on RTX 4000 Ada at batch 1. Derived from formulas, not measured benchmarks.

Latency estimate — 7B INT4, RTX 4000 Ada, batch 1, single request
--- TTFT (prefill, 512 input tokens) ---
Prefill is compute-bound.
FLOPs ≈ 2 × 7×10^9 params × 512 tokens = 7.2×10^12 FLOPs
RTX 4000 Ada peak BF16 tensor core throughput ≈ 40 TFLOPS
TTFT ≈ 7.2×10^12 / 40×10^12 ≈ ~180 ms (order of magnitude)

--- TPOT (decode, INT4 weights, bandwidth-bound) ---
INT4 7B weights: ~3.5 GB effective bytes loaded per forward pass
RTX 4000 Ada bandwidth: ~432 GB/s
TPOT ≈ 3.5×10^9 / 432×10^9 ≈ ~8 ms/token (~125 tok/s)

--- End-to-end, 256 output tokens ---
E2E ≈ TTFT + 256 × TPOT
    ≈ 180 ms + 256 × 8 ms ≈ ~2.2 s

TPOT dominates: decode accounts for ~93% of end-to-end latency.
Caveat

These are order-of-magnitude estimates from published formulas. Real throughput depends on kernel efficiency, memory allocation overhead, CUDA stream scheduling, and quantisation dequantisation cost. Use them to develop intuition about the relative contribution of TTFT vs TPOT, not as deployment-grade benchmarks.

19

Optimisation Priority List

What to reach for first, in order of typical expected impact for a standard LLM serving deployment.

Priority 1
Weight quantisation (INT4 AWQ/GPTQ or INT8 SmoothQuant / FP8) — largest free win
Priority 2
Continuous / in-flight batching — mandatory for variable-length production traffic
Priority 3
PagedAttention / paged KV — enables larger effective batch size without fragmentation
Priority 4
KV cache quantisation (INT8/FP8) — extends context length on constrained HBM
Priority 5
Model selection with GQA/MQA — choose architectures that reduce n_kv_heads
Priority 6
Speculative decoding — significant gain if draft acceptance > ~0.7

Items 1–3 are essentially universal. FlashAttention is assumed on by default in all modern frameworks and does not appear as a discrete optimisation step. Items 4–6 require profiling for the specific workload.

20

GQA / MQA — Reducing KV Head Count

An architecture-level choice that drastically reduces KV cache size with minimal accuracy impact at scale. Selecting models that implement GQA is therefore a free optimisation at deployment time.

MHA

Multi-Head Attention. Each query head has its own K and V head. n_kv_heads = n_q_heads. Llama 2 7B: 32 KV heads.

GQA

Grouped-Query Attention. Groups of query heads share one KV pair. Llama 3 8B: 8 KV heads vs 32 Q heads (4× reduction). Llama 3 70B: 8 KV heads vs 64 Q heads (8× reduction).

MQA

Multi-Query Attention. All Q heads share a single KV pair. Maximum KV reduction; some accuracy loss at smaller scales. Used by Falcon.

KV cache reduction with GQA — Llama 3 70B vs hypothetical MHA 70B
Hypothetical MHA 70B (64 KV heads):
  2 × 80 layers × 64 kv_heads × 128 head_dim × 4096 tokens × 2 bytes ≈ 42.9 GB

Actual GQA 70B (8 KV heads):
  2 × 80 layers × 8 kv_heads × 128 head_dim × 4096 tokens × 2 bytes ≈ 5.4 GB

Reduction: 8× — from formula alone, no measured benchmark.
21

Exam Angles

Common question patterns for the Model Optimisation (17%) and GPU Acceleration (14%) domains. Source: notes/08_inference_optimisation.md.

Numerical / calculation

  • Given architecture params, compute KV cache at N tokens, batch B
  • Given GQA n_kv_heads, compute reduction vs MHA
  • Given INT8 KV, compute memory saving
  • Given bandwidth and weight bytes at precision X, estimate TPOT

Technique selection

  • "KV cache fragments memory" → PagedAttention
  • "GPU idles waiting for longest request" → continuous batching
  • "Prefill of long prompt is slow" → FlashAttention (not KV quant)
  • "Decode TPOT bounded by HBM bandwidth" → weight-only quantisation
  • "Multiple fine-tuned adapters, one base" → S-LoRA / multi-LoRA

Framework / tooling

  • TRT-LLM ahead-of-time compilation vs vLLM JIT — know the trade-off
  • Which framework is NVIDIA's recommended production path?
  • AWQ vs GPTQ: what each optimises; when each wins
  • FP8 native support: Hopper yes; Ampere no (software emulation only)

Speculative decoding

  • Output distribution: identical to target model (not approximate)
  • When does it fail? Low draft acceptance rate (< ~0.5)
  • Medusa: extra decoding heads; EAGLE-3: feature-level draft
  • Speedup source: memory-bandwidth-bound regime; multiple tokens per verify step
22

Cross-References

Cert-focused tour: this deck maps techniques to exam domains. For depth on implementation, see the portfolio repos below.

TopicResource
TensorRT-LLM: engine builder, IFB, paged KV, FP8/FP4, speculativeNVIDIA_GPU_19_TensorRT_LLM
vLLM, Ollama, TGI, llama.cpp local comparisonLLM_Hub_Local_LLM_Hosting
NVIDIA GPU memory hierarchy and bandwidthNVIDIA_GPU_04_Memory_Hierarchy
Tensor cores: INT8 / FP8 throughput arithmeticNVIDIA_GPU_03_Tensor_Cores
Full NVIDIA GPU architecture seriesLLM_Hub_NVIDIA_GPUs
NVIDIA stack: Triton, NIM, NeMo (deck 05 in this series)05_nvidia_stack_overview
PagedAttention / vLLM paperarXiv:2309.06180 — Kwon et al., 2023
FlashAttention v1arXiv:2205.14135 — Dao et al., 2022
FlashAttention v2arXiv:2307.08691 — Dao, 2023
AWQarXiv:2306.00978 — Lin et al., 2023
GPTQarXiv:2210.17323 — Frantar et al., 2022
Medusa speculative decodingarXiv:2401.10774 — Cai et al., 2024
TensorRT-LLM official docsdeveloper.nvidia.com/tensorrt-llm