NVIDIA GPU 19 — TensorRT-LLM: NVIDIA's Optimised Inference Engine

00

Topics We'll Cover

A practical walk-through of TensorRT-LLM — the library NVIDIA reaches for whenever they want to publish a tok/s number that wins a slide. What it is, how it builds engines, what those engines actually do, and the honest comparison with vLLM.

What TensorRT-LLM Is (and Isn't)
The Engine Builder Workflow
What the Builder Does
In-Flight Batching (IFB) — The Throughput Win
Paged KV-Cache
Quantization Recipes
Speculative Decoding
Multi-GPU — TP, PP, EP
Production Serving — Triton + NIM
TensorRT-LLM vs vLLM
Interactive: Engine Configuration Helper

01

What TensorRT-LLM Is (and Isn't)

Three layers of stack that get conflated all the time. Pulling them apart is the first step to understanding why TensorRT-LLM looks the way it does.

TensorRT

NVIDIA's general-purpose inference compiler, shipping since 2017. Takes ONNX or PyTorch graphs, fuses kernels, picks the fastest tactic per matmul, and emits a serialised engine optimised for one specific (GPU, precision, shape) combination. Not LLM-aware on its own.

TensorRT-LLM

A Python library on top of TensorRT plus an LLM-specific runtime: paged-KV, in-flight batching, attention plugins, top-tier model recipes (Llama, Mistral, Mixtral, DeepSeek, Qwen, Phi, GPT-J/NeoX, Gemma). This is where all the LLM-specific magic lives.

Not a server

Pair it with Triton Inference Server (open-source, multi-model, full HTTP/gRPC) or NVIDIA NIM (a pre-baked container with the engine, Triton, and an OpenAI-compatible API) for the actual HTTP path. TRT-LLM itself is a library.

PyTorch / HF

→

TensorRT-LLM (Python)

→

TensorRT compiler

→

engine.bin

→

Triton / NIM

Mental model

Think of TensorRT-LLM as the analogue of vLLM, but built around an ahead-of-time compiler rather than a JIT runtime. You spend minutes building a frozen engine; in return the engine knows everything — shapes, precision, GPU, kernel choice — before the first token ever runs.

02

The Engine Builder Workflow

The end-to-end path from a HuggingFace checkpoint to a serialised TensorRT engine. The headline cost is build time; the headline win is that the engine is fully specialised before serving starts.

Minimal Llama-3-70B build — FP8 on H100

# 1. Convert HF checkpoint to TRT-LLM checkpoint format
python convert_checkpoint.py \
    --model_dir ./Meta-Llama-3-70B-Instruct \
    --output_dir ./tllm_ckpt/70b-fp8 \
    --dtype bfloat16 \
    --tp_size 8

# 2. Quantize (FP8 in this example; alternatives: AWQ-INT4, NVFP4, MX-FP4)
python ../quantization/quantize.py \
    --model_dir ./Meta-Llama-3-70B-Instruct \
    --output_dir ./tllm_ckpt/70b-fp8 \
    --qformat fp8 \
    --kv_cache_dtype fp8 \
    --calib_size 512

# 3. Build the TensorRT engine
trtllm-build \
    --checkpoint_dir ./tllm_ckpt/70b-fp8 \
    --output_dir ./engines/70b-fp8-h100-tp8 \
    --gemm_plugin fp8 \
    --max_batch_size 128 \
    --max_input_len 8192 \
    --max_output_len 2048 \
    --use_paged_context_fmha enable

# 4. engine.bin and config.json saved → ready to serve

What's baked into the engine

Dimension	Decision time	Cost of changing
GPU architecture	at build	full rebuild — an H100 engine doesn't run on B200
Precision (BF16, FP8, FP4, AWQ)	at build	full rebuild — tactics differ per precision
`max_batch_size`	at build	rebuild; KV pool sized for the cap
`max_input_len` / `max_output_len`	at build	rebuild; positional + KV shapes are fixed
Tensor / pipeline / expert parallel sizes	at build	rebuild; partitioning is graph-level
Sampler params (temp, top-p, top-k)	at runtime	free
Number of concurrent requests (≤ max_batch_size)	at runtime	free

Build-time reality check

For a 70B-class model, expect 5–30 minutes per build on the same hardware you'll serve on. The builder runs every candidate kernel against representative inputs and picks the best — this profiling is what makes the engine fast, but it's not free. Cache your engines and version them next to the checkpoint.

03

What the Builder Does

The compiler pipeline that turns a Python graph into a frozen, GPU-specific binary. Each step matters; the plugins step is the secret sauce.

parse model graph (Python → TRT network)

↓

operator fusion (layernorm-quant, gated-MLP, fused-MHA)

↓

tactic selection (try every kernel for every matmul, pick fastest)

↓

quantization insertion (FP8 scales, AWQ INT4 layout, NVFP4 / MX-FP4 microblocks)

↓

kernel templating (CUTLASS for matmul, attention plugin for MHA)

↓

serialise → engine.bin

Why fusion is a big deal

A naive Llama decoder layer is dozens of CUDA kernel launches per token: rmsnorm → quant → q_proj → k_proj → v_proj → rope → attention → o_proj → rmsnorm → gate_proj → up_proj → silu → multiply → down_proj. The builder collapses many of these into single kernels, eliminating round-trips through HBM and saving launch overhead.

Plugins — the closed-source secret sauce

Attention plugin

Hand-written fused multi-head attention with paged-KV awareness. Hopper variant uses WGMMA + TMA; Blackwell variant adds NVFP4 / MX-FP4 microblock support. This is where the bulk of the H100/B200 wins come from.

GEMM plugin

CUTLASS-based, picks per-shape tactics. Supports FP8, FP4, INT4 (AWQ packing), and W8A8 SmoothQuant layouts. Falls back to cuBLAS for shapes the templating doesn't cover.

MoE plugin

Expert-parallel scatter/gather with on-chip token sort. Implements grouped GEMM so all experts on a rank issue one fat matmul instead of N skinny ones.

RoPE / RMSNorm plugins

Tiny but ubiquitous — folded into the surrounding GEMMs whenever possible. RoPE is fused into the QKV projection; RMSNorm is fused with the following quantize.

Open source, opaque internals

TensorRT-LLM has been Apache 2.0 since October 2023, but the underlying TensorRT compiler is closed, and the engine binary format is opaque. You can read the Python wrapper, but you can't easily inspect what kernel ran for which matmul. Treat the engine as a black box you trust because nvidia-smi says it's hitting 80% SM utilisation.

04

In-Flight Batching (IFB) — The Throughput Win

The single largest throughput improvement TensorRT-LLM brings vs naive serving. Same fundamental idea as vLLM's continuous batching, executed in TRT-LLM's high-performance C++ runtime.

Static batching (the bad way)

All N requests in the batch must finish before the next batch starts. If request 1 generates 2000 tokens and request 2 generates 50, the GPU sits idle for thousands of steps while waiting for request 1.

Average GPU utilisation: 20–40% on chat workloads
Tail latency: dominated by the longest-output request in the batch
New requests block until the current batch drains

In-flight batching (IFB)

The active request set is recomputed every decode step. Finished requests leave; new requests join mid-step at the next prefill window.

Average GPU utilisation: 60–90% on chat workloads
2–10× throughput vs static batching for variable-length outputs
New requests admit within one step (if KV slots free)
Implemented in C++ behind the Python wrapper — minimal overhead

Operationally

IFB is on by default once you build with --use_paged_context_fmha enable and serve via Triton's tensorrtllm_backend with batching_strategy: inflight_fused_batching. Don't run TensorRT-LLM in production without it — you'd be leaving most of the throughput on the table.

05

Paged KV-Cache

Same insight as vLLM's PagedAttention: the KV cache fragments memory at runtime, so manage it like a virtual-memory system instead of a contiguous tensor.

The problem

Naive KV pre-allocates max_seq_len × max_batch_size × n_layers × 2 × head_dim × n_heads × sizeof(dtype) contiguously. On a 70B FP8 model with 8k context and batch 64, that's tens of GB reserved even when most slots are nearly empty.

The fix

Split the KV cache into fixed-size blocks (typically 64 or 128 tokens). Maintain a per-request page table. Allocate blocks on demand as sequences grow; free them when sequences finish. Any free block can serve any request — no fragmentation.

TensorRT-LLM's variant — what it adds

Configurable block size. 64 or 128 tokens per block; smaller blocks reduce internal fragmentation but increase page-table lookups.
Prefix sharing across requests. Identical system prompts share the same KV blocks — you don't re-encode the boilerplate on every chat call. Big win for RAG and agent workloads where every request starts with the same 1k-token system prompt.
KV offload to host memory. Inactive blocks (e.g. paused conversations) can spill to host LPDDR5x. Over NVLink-C2C on GH200/GB200 the cost is small (~450 GB/s, coherent). Over PCIe on a discrete GPU it's slower (~32 GB/s on Gen4 x16) but still useful for long-tail context.
Block reuse. If a finished sequence's blocks haven't been overwritten, the cache can still hit them when the same prefix shows up again — turns a re-run of the same prompt into an effectively-free prefill.

Paged-KV build flags worth knowing

trtllm-build \
    # enables paged KV (the default in modern versions)
    --use_paged_context_fmha enable \
    # block size — 64 or 128 typical
    --tokens_per_block 128 \
    # enable prefix sharing in the runtime
    --enable_kv_cache_reuse \
    # size the pool — number of blocks reserved at runtime
    --max_num_tokens 131072

Capacity, not just speed

Paged KV is the difference between serving 8 and 80 concurrent users on the same GPU. Without it the KV cache reservation eats all your headroom; with it the cache fills only with what's actively in flight.

06

Quantization Recipes

What precisions you can actually use, per architecture. The matrix is messier than vLLM because TRT-LLM exposes more knobs — that's also why it can extract more performance.

Precision	Ampere (8.x)	Ada (8.9)	Hopper (9.0)	Blackwell (10.x/12.x)	Notes
FP16 / BF16	✓	✓	✓	✓	Always available; the floor.
FP8 E4M3 weights+activations	—	✓	✓	✓	Silicon present across Ada (incl. consumer 4090, L40S); 2× over BF16 on prefill. Transformer Engine integration on Hopper for dynamic scaling.
FP8 KV cache	—	✓	✓	✓	Halves KV memory; tiny accuracy loss.
AWQ INT4 weights + FP16 activations	✓	✓	✓	✓	Best on Ampere; ubiquitous on consumer cards.
GPTQ INT4	✓	✓	✓	✓	Older format; AWQ usually preferred.
SmoothQuant W8A8	✓	✓	✓	✓	INT8 weights and activations; older recipe but cheap.
NVFP4 / MX-FP4 weights	—	—	—	✓	Microblock FP4; 2× FP8 throughput on Blackwell tensor cores. NVFP4 is the TRT-LLM default (16-elem blocks, E4M3 scale + per-tensor FP32 scale — more accurate); MX-FP4 is the OCP standard (32-elem blocks, E8M0 scale).

Why FP8 is the production default on H100/H200

Memory: 70B weights drop from 140 GB (BF16) to 70 GB (FP8) — fits one H100 80GB or one H200 with room for KV.
Compute: Hopper FP8 tensor cores are 2× the BF16 throughput, so prefill scales 2×.
Bandwidth: Decode is bandwidth-bound; halving weight bytes halves time-per-token.
Accuracy: calibrated FP8 (E4M3 with per-tensor scales) is within 0.1–0.3 points on standard benchmarks vs BF16. Acceptable for almost everything except paranoid math benchmarks.

FP8 calibration on Llama-3-70B

# quantize.py runs the model on calibration data,
# computes per-tensor scaling factors, and writes
# a TRT-LLM checkpoint with FP8 weights + scales baked in.
python ../quantization/quantize.py \
    --model_dir ./Meta-Llama-3-70B-Instruct \
    --dtype bfloat16 \
    --qformat fp8 \
    --kv_cache_dtype fp8 \
    --output_dir ./tllm_ckpt/70b-fp8 \
    --calib_size 512 \
    --calib_dataset cnn_dailymail \
    --tp_size 8

FP4 is the next cliff — NVFP4 is the TRT-LLM default

Blackwell accelerates two FP4 microblock formats. NVFP4 (the TRT-LLM default) uses 16-element blocks with an E4M3 per-block scale plus a per-tensor FP32 scale — more accurate. MX-FP4 is the open OCP standard with 32-element blocks and an E8M0 (8-bit exponent) per-block scale. Both run at 2× the FP8 tensor-core rate on a B200; a fully quantised 70B is then ~35 GB and fits with massive KV headroom on a single B200. This is the path to running 405B-class models on a single 8-GPU HGX board.

07

Speculative Decoding

Decode is bandwidth-bound: each step reads all weights to produce one token. Speculative decoding reads the weights once but verifies several token candidates at once — turning the same memory traffic into multiple accepted tokens.

(a) Draft model

A small fast model (e.g. Llama-3-8B) proposes N tokens; the big model (Llama-3-70B) verifies all N in a single forward pass. Accepted tokens come for free; rejected ones cost only the discarded suffix.

Speedup: 1.5–3× on decode
Cost: extra GPU memory for the draft
Best when draft and target are well-aligned (same family, same SFT lineage)

(b) Medusa heads

Extra LM heads attached to the same base model predict multiple future tokens simultaneously. The base model's last hidden state feeds N heads, each predicting position +1, +2, +3, ...

Speedup: 2–3× on decode
No extra model — heads share the trunk
Requires Medusa-trained checkpoints

(c) EAGLE / Lookahead

Search the runtime cache for token sequences the prefix has matched before. If "the quick brown" is followed by "fox" in cache history, propose "fox" without running any model. Combines well with (a) and (b).

Best on repetitive workloads (code, structured output)
Cheap — no extra model, no extra heads
Variable speedup; can be 1.2–2× on the right traffic

Build with draft-target speculative decoding

# Build the draft engine (small model)
trtllm-build \
    --checkpoint_dir ./tllm_ckpt/llama3-8b-fp8 \
    --output_dir ./engines/draft \
    --gemm_plugin fp8 \
    --max_batch_size 128

# Build the target engine, telling it to expect speculative decode
trtllm-build \
    --checkpoint_dir ./tllm_ckpt/llama3-70b-fp8 \
    --output_dir ./engines/target \
    --gemm_plugin fp8 \
    --speculative_decoding_mode draft_target \
    --max_draft_len 5

# Serve via Triton with both engines bound
tritonserver --model-repository=./repo \
             --draft_model llama-3-8b \
             --target_model llama-3-70b

When speculative decode helps

Best on workloads where most tokens are predictable: code completion, structured JSON output, RAG with pinned templates. Worst on creative generation where every token is uncertain — acceptance rate falls and the verification overhead dominates. Always benchmark on your traffic before turning it on.

08

Multi-GPU — TP, PP, EP

TensorRT-LLM partitions the compute graph at build time, so each rank's engine knows exactly what tensors it owns and which collectives it must emit. Three orthogonal axes; combine as needed.

Tensor Parallel (TP)

Each transformer layer is split across GPUs (e.g. each rank holds 1/8 of every projection). NCCL all-reduce after attention and MLP merges the partial results.

Latency-friendly — one big model becomes one fast box
Bandwidth-hungry — needs NVLink for good scaling beyond 2 GPUs
Typical: TP=8 within an HGX board

Pipeline Parallel (PP)

Different layers on different GPUs; activations flow GPU→GPU as a pipeline. Only hidden-state-sized tensors cross the link — works over PCIe.

Bandwidth-light — PCIe or even cross-node OK
Pipeline bubble hurts at low concurrency
Typical: PP=N across HGX boards or nodes

Expert Parallel (EP)

For MoE models: each rank owns a subset of experts. After routing, all-to-all sends each token to the rank holding its experts; another all-to-all returns results.

Specific to MoE (Mixtral, DeepSeek-V3, GPT-OSS class)
Bandwidth-hungry — needs NVLink/NVSwitch
Typical: EP=K aligned with TP=8 inside an NVL domain

Combining axes — the canonical layout

TP=8

×

PP=N

×

EP=K

=

total ranks

Within an HGX/DGX board (8 GPUs all on one NVSwitch domain), set TP=8. Across boards or nodes, add PP. For MoE, slice the experts across ranks with EP=K (typically EP≤TP). The build emits one engine per rank, with the collectives baked into the graph.

trtllm-build for TP=8 + PP=2 + EP=8 (Mixtral on 2× HGX)

python convert_checkpoint.py \
    --model_dir ./Mixtral-8x22B \
    --output_dir ./tllm_ckpt/mixtral-tp8-pp2 \
    --dtype bfloat16 \
    --tp_size 8 \
    --pp_size 2 \
    --moe_tp_size 1 \
    --moe_ep_size 8

trtllm-build \
    --checkpoint_dir ./tllm_ckpt/mixtral-tp8-pp2 \
    --output_dir ./engines/mixtral \
    --gemm_plugin fp8 \
    --max_batch_size 128

Build-time partitioning is a feature

Because TRT-LLM bakes the partitioning into the engine, every rank's binary is specialised to its slice — the runtime doesn't waste cycles deciding what to run. The downside is that changing TP/PP/EP requires a rebuild. Worth it for production; annoying for experimentation.

09

Production Serving — Triton Inference Server + NIM

TensorRT-LLM is a library, not a server. To put HTTP in front of it, you pick one of three paths.

(a) Triton Inference Server

NVIDIA's open-source server with the tensorrtllm_backend. Full HTTP/gRPC, dynamic batching, multi-model, ensemble pipelines, model versioning, metrics. The production standard inside NVIDIA itself.

Bring your own engine
Tune config.pbtxt per model
Mature, well-instrumented

(b) NVIDIA NIM

Pre-packaged microservices. Pull a container, get a TensorRT-LLM engine + Triton + an OpenAI-compatible API in one shot. NVIDIA pre-tunes the engine per model and per supported GPU. What big enterprises actually deploy.

No build step
OpenAI-compatible /v1/chat/completions
Licensed via NVIDIA AI Enterprise

(c) Custom embedding

Use tensorrt_llm.runtime.ModelRunner directly inside your own Python or C++ service. For tightly-coupled cases (audio + LLM, custom routing, embedded agents) where Triton is overkill or in the wrong place.

Most flexible
You own observability, batching, retries
Used inside NVIDIA Riva, NeMo agents, etc.

Why NIM matters — eliminating the build-and-tune step

For most enterprises, the cost isn't the GPU bill — it's the engineering time spent calibrating quantization, picking max_batch_size, sweeping precisions, and re-running benchmarks for every new model release. NIM ships an engine that NVIDIA has already tuned on that exact GPU SKU, with sensible defaults baked in. You lose flexibility; you gain weeks per model.

Pulling a NIM and serving it — Llama-3.3-70B on H100

docker login nvcr.io
docker pull nvcr.io/nim/meta/llama-3.3-70b-instruct:latest

docker run -d --gpus all \
    -p 8000:8000 \
    -e NGC_API_KEY=$NGC_API_KEY \
    -v ~/.cache/nim:/opt/nim/.cache \
    nvcr.io/nim/meta/llama-3.3-70b-instruct:latest

# OpenAI-compatible API ready on :8000/v1/chat/completions
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"meta/llama-3.3-70b-instruct","messages":[{"role":"user","content":"Hi"}]}'

Pick by team shape

Small platform team, many models, prefers ops < engineering → NIM. Big platform team, fewer models, willing to invest in tuning → Triton + your own engines. Embedded use case → ModelRunner. Don't fight the recommended path for your size.

10

TensorRT-LLM vs vLLM

The honest comparison. Both are first-rate. Different sweet spots; most serious teams end up using both.

Dimension	TensorRT-LLM	vLLM
Peak performance (H100/B200)	~10–30% faster on a tuned engine	Excellent; closer to TRT-LLM each release
Peak performance (Ampere)	Smaller margin — both are bandwidth-bound	Very close to TRT-LLM
Build time	5–30 min per (model × precision × shape)	Seconds — load and go
Flexibility at runtime	Engine is frozen; rebuild to change shapes	Same model serves any sequence length up to limit
New model time-to-support	Weeks — needs a recipe + plugin tuning	Days — HF integration usually drops fast
Quantization recipes	FP8, FP4 (NVFP4 default, MX-FP4), AWQ, GPTQ, SmoothQuant — calibrated	FP8, AWQ, GPTQ, BitsAndBytes — less hand-tuning
Speculative decoding	Draft, Medusa, EAGLE/Lookahead	Draft, Medusa (Eagle in progress)
Multi-GPU	TP, PP, EP — all build-time	TP, PP, EP — all runtime config
License	Apache 2.0 since 2023; plugins closed; engine format opaque	Apache 2.0; fully open
Server	Pair with Triton or NIM	Built-in OpenAI-compatible server
Best fit	Stable workload, fixed model, want every drop	Many models, frequent rotations, want simple ops

The pragmatic path most teams take

Prototype on vLLM: load model, hit a port, ship something. Iterate freely on prompt, tools, and product.

Scale on TRT-LLM when (a) the model has stabilised, (b) traffic is predictable, and (c) the throughput / TTFT SLOs justify the operational cost of the build pipeline.

When to skip TRT-LLM entirely

If you're rotating models weekly, or your traffic varies wildly in shape, or your team is small — the build-and-tune overhead is more than the 10–30% perf you'd recover. Stay on vLLM and spend the saved engineering on something else.

Don't religious-war this

NVIDIA themselves use vLLM internally for development and TRT-LLM for the published throughput numbers. Use the right tool per phase. The framework war is over — both win for different jobs.

11

Interactive: Engine Configuration Helper

Pick a model size, GPU, precision, and shape envelope. Get back a fit estimate, parallelism recommendation, rough tok/s, and the matching trtllm-build flags.

Model size

GPU

Precision

max_batch_size

max_seq_len

Weights

—

KV reserve

—

Total VRAM

—

Fits 1 GPU?

—

TP needed

—

Decode tok/s (est.)

—

Prefill tok/s (est.)

—

Estimates, not promises

The numbers above are first-order: weights are model_params × bytes/param, KV is batch × seq × layers × kv-bytes-per-token, decode tok/s is bandwidth-bound and prefill is FLOPS-bound. Real engines pay 10–25% overhead for activations, workspace, and CUDA graphs. Always benchmark before sizing the cluster.