When you need every drop of performance and you've already paid for an H100, TensorRT-LLM is what NVIDIA actually uses to publish those headline tok/s numbers. Walk through the engine builder, in-flight batching, paged-KV, FP8/FP4 quantization, speculative decoding, and the production serving paths — including how it stacks against vLLM.
A practical walk-through of TensorRT-LLM — the library NVIDIA reaches for whenever they want to publish a tok/s number that wins a slide. What it is, how it builds engines, what those engines actually do, and the honest comparison with vLLM.
Three layers of stack that get conflated all the time. Pulling them apart is the first step to understanding why TensorRT-LLM looks the way it does.
NVIDIA's general-purpose inference compiler, shipping since 2017. Takes ONNX or PyTorch graphs, fuses kernels, picks the fastest tactic per matmul, and emits a serialised engine optimised for one specific (GPU, precision, shape) combination. Not LLM-aware on its own.
A Python library on top of TensorRT plus an LLM-specific runtime: paged-KV, in-flight batching, attention plugins, top-tier model recipes (Llama, Mistral, Mixtral, DeepSeek, Qwen, Phi, GPT-J/NeoX, Gemma). This is where all the LLM-specific magic lives.
Pair it with Triton Inference Server (open-source, multi-model, full HTTP/gRPC) or NVIDIA NIM (a pre-baked container with the engine, Triton, and an OpenAI-compatible API) for the actual HTTP path. TRT-LLM itself is a library.
Think of TensorRT-LLM as the analogue of vLLM, but built around an ahead-of-time compiler rather than a JIT runtime. You spend minutes building a frozen engine; in return the engine knows everything — shapes, precision, GPU, kernel choice — before the first token ever runs.
The end-to-end path from a HuggingFace checkpoint to a serialised TensorRT engine. The headline cost is build time; the headline win is that the engine is fully specialised before serving starts.
# 1. Convert HF checkpoint to TRT-LLM checkpoint format
python convert_checkpoint.py \
--model_dir ./Meta-Llama-3-70B-Instruct \
--output_dir ./tllm_ckpt/70b-fp8 \
--dtype bfloat16 \
--tp_size 8
# 2. Quantize (FP8 in this example; alternatives: AWQ-INT4, NVFP4, MX-FP4)
python ../quantization/quantize.py \
--model_dir ./Meta-Llama-3-70B-Instruct \
--output_dir ./tllm_ckpt/70b-fp8 \
--qformat fp8 \
--kv_cache_dtype fp8 \
--calib_size 512
# 3. Build the TensorRT engine
trtllm-build \
--checkpoint_dir ./tllm_ckpt/70b-fp8 \
--output_dir ./engines/70b-fp8-h100-tp8 \
--gemm_plugin fp8 \
--max_batch_size 128 \
--max_input_len 8192 \
--max_output_len 2048 \
--use_paged_context_fmha enable
# 4. engine.bin and config.json saved → ready to serve
| Dimension | Decision time | Cost of changing |
|---|---|---|
| GPU architecture | at build | full rebuild — an H100 engine doesn't run on B200 |
| Precision (BF16, FP8, FP4, AWQ) | at build | full rebuild — tactics differ per precision |
max_batch_size | at build | rebuild; KV pool sized for the cap |
max_input_len / max_output_len | at build | rebuild; positional + KV shapes are fixed |
| Tensor / pipeline / expert parallel sizes | at build | rebuild; partitioning is graph-level |
| Sampler params (temp, top-p, top-k) | at runtime | free |
| Number of concurrent requests (≤ max_batch_size) | at runtime | free |
For a 70B-class model, expect 5–30 minutes per build on the same hardware you'll serve on. The builder runs every candidate kernel against representative inputs and picks the best — this profiling is what makes the engine fast, but it's not free. Cache your engines and version them next to the checkpoint.
The compiler pipeline that turns a Python graph into a frozen, GPU-specific binary. Each step matters; the plugins step is the secret sauce.
A naive Llama decoder layer is dozens of CUDA kernel launches per token: rmsnorm → quant → q_proj → k_proj → v_proj → rope → attention → o_proj → rmsnorm → gate_proj → up_proj → silu → multiply → down_proj. The builder collapses many of these into single kernels, eliminating round-trips through HBM and saving launch overhead.
Hand-written fused multi-head attention with paged-KV awareness. Hopper variant uses WGMMA + TMA; Blackwell variant adds NVFP4 / MX-FP4 microblock support. This is where the bulk of the H100/B200 wins come from.
CUTLASS-based, picks per-shape tactics. Supports FP8, FP4, INT4 (AWQ packing), and W8A8 SmoothQuant layouts. Falls back to cuBLAS for shapes the templating doesn't cover.
Expert-parallel scatter/gather with on-chip token sort. Implements grouped GEMM so all experts on a rank issue one fat matmul instead of N skinny ones.
Tiny but ubiquitous — folded into the surrounding GEMMs whenever possible. RoPE is fused into the QKV projection; RMSNorm is fused with the following quantize.
TensorRT-LLM has been Apache 2.0 since October 2023, but the underlying TensorRT compiler is closed, and the engine binary format is opaque. You can read the Python wrapper, but you can't easily inspect what kernel ran for which matmul. Treat the engine as a black box you trust because nvidia-smi says it's hitting 80% SM utilisation.
The single largest throughput improvement TensorRT-LLM brings vs naive serving. Same fundamental idea as vLLM's continuous batching, executed in TRT-LLM's high-performance C++ runtime.
All N requests in the batch must finish before the next batch starts. If request 1 generates 2000 tokens and request 2 generates 50, the GPU sits idle for thousands of steps while waiting for request 1.
The active request set is recomputed every decode step. Finished requests leave; new requests join mid-step at the next prefill window.
IFB is on by default once you build with --use_paged_context_fmha enable and serve via Triton's tensorrtllm_backend with batching_strategy: inflight_fused_batching. Don't run TensorRT-LLM in production without it — you'd be leaving most of the throughput on the table.
Same insight as vLLM's PagedAttention: the KV cache fragments memory at runtime, so manage it like a virtual-memory system instead of a contiguous tensor.
Naive KV pre-allocates max_seq_len × max_batch_size × n_layers × 2 × head_dim × n_heads × sizeof(dtype) contiguously. On a 70B FP8 model with 8k context and batch 64, that's tens of GB reserved even when most slots are nearly empty.
Split the KV cache into fixed-size blocks (typically 64 or 128 tokens). Maintain a per-request page table. Allocate blocks on demand as sequences grow; free them when sequences finish. Any free block can serve any request — no fragmentation.
trtllm-build \
# enables paged KV (the default in modern versions)
--use_paged_context_fmha enable \
# block size — 64 or 128 typical
--tokens_per_block 128 \
# enable prefix sharing in the runtime
--enable_kv_cache_reuse \
# size the pool — number of blocks reserved at runtime
--max_num_tokens 131072
Paged KV is the difference between serving 8 and 80 concurrent users on the same GPU. Without it the KV cache reservation eats all your headroom; with it the cache fills only with what's actively in flight.
What precisions you can actually use, per architecture. The matrix is messier than vLLM because TRT-LLM exposes more knobs — that's also why it can extract more performance.
| Precision | Ampere (8.x) | Ada (8.9) | Hopper (9.0) | Blackwell (10.x/12.x) | Notes |
|---|---|---|---|---|---|
| FP16 / BF16 | ✓ | ✓ | ✓ | ✓ | Always available; the floor. |
| FP8 E4M3 weights+activations | — | ✓ | ✓ | ✓ | Silicon present across Ada (incl. consumer 4090, L40S); 2× over BF16 on prefill. Transformer Engine integration on Hopper for dynamic scaling. |
| FP8 KV cache | — | ✓ | ✓ | ✓ | Halves KV memory; tiny accuracy loss. |
| AWQ INT4 weights + FP16 activations | ✓ | ✓ | ✓ | ✓ | Best on Ampere; ubiquitous on consumer cards. |
| GPTQ INT4 | ✓ | ✓ | ✓ | ✓ | Older format; AWQ usually preferred. |
| SmoothQuant W8A8 | ✓ | ✓ | ✓ | ✓ | INT8 weights and activations; older recipe but cheap. |
| NVFP4 / MX-FP4 weights | — | — | — | ✓ | Microblock FP4; 2× FP8 throughput on Blackwell tensor cores. NVFP4 is the TRT-LLM default (16-elem blocks, E4M3 scale + per-tensor FP32 scale — more accurate); MX-FP4 is the OCP standard (32-elem blocks, E8M0 scale). |
# quantize.py runs the model on calibration data,
# computes per-tensor scaling factors, and writes
# a TRT-LLM checkpoint with FP8 weights + scales baked in.
python ../quantization/quantize.py \
--model_dir ./Meta-Llama-3-70B-Instruct \
--dtype bfloat16 \
--qformat fp8 \
--kv_cache_dtype fp8 \
--output_dir ./tllm_ckpt/70b-fp8 \
--calib_size 512 \
--calib_dataset cnn_dailymail \
--tp_size 8
Blackwell accelerates two FP4 microblock formats. NVFP4 (the TRT-LLM default) uses 16-element blocks with an E4M3 per-block scale plus a per-tensor FP32 scale — more accurate. MX-FP4 is the open OCP standard with 32-element blocks and an E8M0 (8-bit exponent) per-block scale. Both run at 2× the FP8 tensor-core rate on a B200; a fully quantised 70B is then ~35 GB and fits with massive KV headroom on a single B200. This is the path to running 405B-class models on a single 8-GPU HGX board.
Decode is bandwidth-bound: each step reads all weights to produce one token. Speculative decoding reads the weights once but verifies several token candidates at once — turning the same memory traffic into multiple accepted tokens.
A small fast model (e.g. Llama-3-8B) proposes N tokens; the big model (Llama-3-70B) verifies all N in a single forward pass. Accepted tokens come for free; rejected ones cost only the discarded suffix.
Extra LM heads attached to the same base model predict multiple future tokens simultaneously. The base model's last hidden state feeds N heads, each predicting position +1, +2, +3, ...
Search the runtime cache for token sequences the prefix has matched before. If "the quick brown" is followed by "fox" in cache history, propose "fox" without running any model. Combines well with (a) and (b).
# Build the draft engine (small model)
trtllm-build \
--checkpoint_dir ./tllm_ckpt/llama3-8b-fp8 \
--output_dir ./engines/draft \
--gemm_plugin fp8 \
--max_batch_size 128
# Build the target engine, telling it to expect speculative decode
trtllm-build \
--checkpoint_dir ./tllm_ckpt/llama3-70b-fp8 \
--output_dir ./engines/target \
--gemm_plugin fp8 \
--speculative_decoding_mode draft_target \
--max_draft_len 5
# Serve via Triton with both engines bound
tritonserver --model-repository=./repo \
--draft_model llama-3-8b \
--target_model llama-3-70b
Best on workloads where most tokens are predictable: code completion, structured JSON output, RAG with pinned templates. Worst on creative generation where every token is uncertain — acceptance rate falls and the verification overhead dominates. Always benchmark on your traffic before turning it on.
TensorRT-LLM partitions the compute graph at build time, so each rank's engine knows exactly what tensors it owns and which collectives it must emit. Three orthogonal axes; combine as needed.
Each transformer layer is split across GPUs (e.g. each rank holds 1/8 of every projection). NCCL all-reduce after attention and MLP merges the partial results.
Different layers on different GPUs; activations flow GPU→GPU as a pipeline. Only hidden-state-sized tensors cross the link — works over PCIe.
For MoE models: each rank owns a subset of experts. After routing, all-to-all sends each token to the rank holding its experts; another all-to-all returns results.
Within an HGX/DGX board (8 GPUs all on one NVSwitch domain), set TP=8. Across boards or nodes, add PP. For MoE, slice the experts across ranks with EP=K (typically EP≤TP). The build emits one engine per rank, with the collectives baked into the graph.
python convert_checkpoint.py \
--model_dir ./Mixtral-8x22B \
--output_dir ./tllm_ckpt/mixtral-tp8-pp2 \
--dtype bfloat16 \
--tp_size 8 \
--pp_size 2 \
--moe_tp_size 1 \
--moe_ep_size 8
trtllm-build \
--checkpoint_dir ./tllm_ckpt/mixtral-tp8-pp2 \
--output_dir ./engines/mixtral \
--gemm_plugin fp8 \
--max_batch_size 128
Because TRT-LLM bakes the partitioning into the engine, every rank's binary is specialised to its slice — the runtime doesn't waste cycles deciding what to run. The downside is that changing TP/PP/EP requires a rebuild. Worth it for production; annoying for experimentation.
TensorRT-LLM is a library, not a server. To put HTTP in front of it, you pick one of three paths.
NVIDIA's open-source server with the tensorrtllm_backend. Full HTTP/gRPC, dynamic batching, multi-model, ensemble pipelines, model versioning, metrics. The production standard inside NVIDIA itself.
config.pbtxt per modelPre-packaged microservices. Pull a container, get a TensorRT-LLM engine + Triton + an OpenAI-compatible API in one shot. NVIDIA pre-tunes the engine per model and per supported GPU. What big enterprises actually deploy.
/v1/chat/completionsUse tensorrt_llm.runtime.ModelRunner directly inside your own Python or C++ service. For tightly-coupled cases (audio + LLM, custom routing, embedded agents) where Triton is overkill or in the wrong place.
For most enterprises, the cost isn't the GPU bill — it's the engineering time spent calibrating quantization, picking max_batch_size, sweeping precisions, and re-running benchmarks for every new model release. NIM ships an engine that NVIDIA has already tuned on that exact GPU SKU, with sensible defaults baked in. You lose flexibility; you gain weeks per model.
docker login nvcr.io
docker pull nvcr.io/nim/meta/llama-3.3-70b-instruct:latest
docker run -d --gpus all \
-p 8000:8000 \
-e NGC_API_KEY=$NGC_API_KEY \
-v ~/.cache/nim:/opt/nim/.cache \
nvcr.io/nim/meta/llama-3.3-70b-instruct:latest
# OpenAI-compatible API ready on :8000/v1/chat/completions
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"meta/llama-3.3-70b-instruct","messages":[{"role":"user","content":"Hi"}]}'
Small platform team, many models, prefers ops < engineering → NIM. Big platform team, fewer models, willing to invest in tuning → Triton + your own engines. Embedded use case → ModelRunner. Don't fight the recommended path for your size.
The honest comparison. Both are first-rate. Different sweet spots; most serious teams end up using both.
| Dimension | TensorRT-LLM | vLLM |
|---|---|---|
| Peak performance (H100/B200) | ~10–30% faster on a tuned engine | Excellent; closer to TRT-LLM each release |
| Peak performance (Ampere) | Smaller margin — both are bandwidth-bound | Very close to TRT-LLM |
| Build time | 5–30 min per (model × precision × shape) | Seconds — load and go |
| Flexibility at runtime | Engine is frozen; rebuild to change shapes | Same model serves any sequence length up to limit |
| New model time-to-support | Weeks — needs a recipe + plugin tuning | Days — HF integration usually drops fast |
| Quantization recipes | FP8, FP4 (NVFP4 default, MX-FP4), AWQ, GPTQ, SmoothQuant — calibrated | FP8, AWQ, GPTQ, BitsAndBytes — less hand-tuning |
| Speculative decoding | Draft, Medusa, EAGLE/Lookahead | Draft, Medusa (Eagle in progress) |
| Multi-GPU | TP, PP, EP — all build-time | TP, PP, EP — all runtime config |
| License | Apache 2.0 since 2023; plugins closed; engine format opaque | Apache 2.0; fully open |
| Server | Pair with Triton or NIM | Built-in OpenAI-compatible server |
| Best fit | Stable workload, fixed model, want every drop | Many models, frequent rotations, want simple ops |
Prototype on vLLM: load model, hit a port, ship something. Iterate freely on prompt, tools, and product.
Scale on TRT-LLM when (a) the model has stabilised, (b) traffic is predictable, and (c) the throughput / TTFT SLOs justify the operational cost of the build pipeline.
If you're rotating models weekly, or your traffic varies wildly in shape, or your team is small — the build-and-tune overhead is more than the 10–30% perf you'd recover. Stay on vLLM and spend the saved engineering on something else.
NVIDIA themselves use vLLM internally for development and TRT-LLM for the published throughput numbers. Use the right tool per phase. The framework war is over — both win for different jobs.
Pick a model size, GPU, precision, and shape envelope. Get back a fit estimate, parallelism recommendation, rough tok/s, and the matching trtllm-build flags.
The numbers above are first-order: weights are model_params × bytes/param, KV is batch × seq × layers × kv-bytes-per-token, decode tok/s is bandwidth-bound and prefill is FLOPS-bound. Real engines pay 10–25% overhead for activations, workspace, and CUDA graphs. Always benchmark before sizing the cluster.