The NVIDIA AI Stack — NVIDIA GenAI Cert Prep

00

Topics in This Deck

A cert-focused tour of the NVIDIA AI software stack. Three NCP domains touch this material; together they're roughly 30% of the Professional exam.

Cert Framing
The Stack at a Glance
CUDA, Driver, cuDNN, cuBLAS
NCCL
CUTLASS
NeMo Framework
NeMo RL (formerly NeMo Aligner)
NeMo Curator, Guardrails, Evaluator, Run
TensorRT vs TensorRT-LLM
Triton Inference Server
Triton + TRT-LLM Relationship
NIM — Inference Microservices
Base Command, Run.ai, DGX Cloud
NVIDIA AI Enterprise
RAPIDS
Nsight Systems vs Compute
MIG, MPS, vGPU
Decision Matrix
NGC Container Catalogue
Likely Exam Angles
Cross-References

01

Cert Framing

GPU Acceleration

14%

CUDA stack, libraries, multi-GPU primitives, profiling tools.

Model Deployment

9%

Triton, NIM, TensorRT-LLM, container packaging, gRPC/HTTP serving.

Production

7%

Monitoring with DCGM, AI Enterprise, Run.ai orchestration.

Combined: 30% of NCP. The single highest-leverage deck for that exam by domain weight.

02

The Stack at a Glance

Read bottom-up at deployment time, top-down at training time. The libraries layer (cuDNN, cuBLAS, NCCL, CUTLASS) is shared by everything above.

03

CUDA, Driver, cuDNN, cuBLAS

CUDA Toolkit

NVIDIA's C/C++ programming model and runtime for GPUs. As of April 2026 the current major is CUDA 13.x. The Toolkit ships nvcc (compiler), runtime libraries, and headers.

Driver compatibility

The driver and toolkit version pair is the most common deployment failure mode. Modern CUDA toolkit versions require a minimum driver branch (e.g. CUDA 13.x against driver R580+ or later, depending on point release). Verify with nvidia-smi against the toolkit's nvcc --version.

cuDNN

NVIDIA's deep learning primitives library — convolutions, activations, normalisation, RNN cells. Used by PyTorch and TensorFlow under the hood.

cuBLAS

BLAS for the GPU. Matrix multiplies and basic linear algebra. Underpins all transformer attention and FFN computations.

04

NCCL — Collective Communications

NVIDIA Collective Communications Library. The mandatory layer for any multi-GPU training or inference.

Collective	What it does	Where used
`all-reduce`	Sum gradients across all GPUs; result on every GPU	Data parallelism gradient sync
`all-gather`	Concatenate tensors from all GPUs onto every GPU	FSDP/ZeRO param all-gather
`reduce-scatter`	Sum and split tensor across GPUs	FSDP gradient reduce-scatter
`broadcast`	Send tensor from one GPU to all others	Init / loading checkpoints

NCCL uses NVLink within a node and InfiniBand (or Ethernet) across nodes. Performance depends on both bandwidth and latency: large all-reduces are bandwidth-bound, small ones are latency-bound. SHARP (Scalable Hierarchical Aggregation Reduction Protocol) on Quantum InfiniBand offloads parts of the reduction to the network switch.

05

CUTLASS

CUDA Templates for Linear Algebra Subroutines. Open-source C++ template library for writing custom GEMM kernels. The building blocks for FlashAttention, custom MoE kernels, and one-off optimisations where cuBLAS doesn't fit.

Why it matters at the cert level: CUTLASS is what lets framework authors hand-tune the matmul + epilogue + activation fusion that cuBLAS won't do for them. FlashAttention v2 and v3 are written using CUTLASS-style abstractions.

When you'd reach for it

You're optimising a kernel that cuBLAS can't fuse with the surrounding ops, and you need explicit control over tensor-core instructions (HMMA on Volta-Ampere, WGMMA on Hopper, the equivalents on Blackwell).

06

NeMo Framework

NVIDIA's end-to-end framework for generative AI training and customisation. Container images on NGC (currently the 26.x line). Built on PyTorch with NVIDIA-specific accelerations — Megatron-Core for parallelism, Apex for fused ops, Transformer Engine for FP8.

What sits inside

Megatron-Core — the tensor/pipeline/sequence parallelism engine. The training scaffolding for trillion-parameter models.
Transformer Engine — FP8 numerics on Hopper, MX-FP4 on Blackwell. The Hopper Transformer Engine paper figures (~2x speedup vs FP16) come from this.
NeMo Aligner — renamed to NeMo RL (see next slide).
NeMo Curator, NeMo Guardrails, NeMo Evaluator, NeMo Run — sub-projects, see slide 08.

NeMo runs on a single GPU up to multi-node clusters with the same code path; multi-node requires NCCL + InfiniBand and is configured via Megatron-Core options.

07

NeMo RL (Formerly NeMo Aligner)

NVIDIA's reinforcement learning and preference-tuning library. Renamed from NeMo Aligner to NeMo RL in the recent NeMo Framework releases (the wider rebrand reflects that the library does more than alignment alone).

What it covers

RLHF — full PPO loop with reward model, reference model, KL penalty.
DPO and variants — reference-based and reference-free.
GRPO — group-relative policy optimisation, used in DeepSeek-R1-style reasoning training.
Constitutional AI / RLAIF — AI labellers replacing humans in the preference pipeline.

Cert recall hook

"Which NVIDIA component does RLHF / DPO?" → NeMo RL (formerly NeMo Aligner). Distractors will use the old name to test whether you've kept up.

08

NeMo Curator, Guardrails, Evaluator, Run

Component	Purpose	When to reach for it
NeMo Curator	GPU-accelerated data preprocessing — deduplication, quality filtering, decontamination	Building a custom pretraining or SFT corpus at scale
NeMo Guardrails	Programmable safety rails between user and LLM — topical, dialogue, fact-checking, jailbreak detection	Production deployment that needs deterministic safety controls beyond model RLHF
NeMo Evaluator	Eval orchestration — benchmark runs, custom metrics, LLM-as-judge	Standardising eval pipelines across models or releases
NeMo Run	Experiment orchestrator — config-driven launch on local, Slurm, or cloud backends	Reproducible multi-config runs

Each of Guardrails, Curator, Evaluator, and Run can be used independently of the wider NeMo Framework — they're separately published packages.

09

TensorRT vs TensorRT-LLM

TensorRT

General-purpose inference compiler. Takes ONNX/PyTorch/TF models, applies layer fusion, kernel auto-tuning, and precision conversion. Used for vision, classical ML, smaller transformers.

TensorRT-LLM

LLM-specific inference engine on top of TensorRT primitives. Adds in-flight (continuous) batching, paged KV-cache, FP8/FP4 quantisation, speculative decoding (draft, Medusa, EAGLE), tensor/pipeline/expert parallelism.

The relationship

TRT-LLM uses TRT under the hood for the kernel-level work but adds a specialised Python and C++ runtime that knows about LLM-specific structures the generic TRT runtime doesn't — KV cache, request scheduling, multi-LoRA. For LLMs, you reach for TRT-LLM.

Depth: NVIDIA_GPU_19_TensorRT_LLM.

10

Triton Inference Server

Open-source model server. The serving runtime, not the optimisation engine. Triton's job: take many model artefacts (TRT engines, ONNX, PyTorch, TF, Python custom backends), expose them via gRPC and HTTP, batch requests, route them to the right backend.

Features that matter for the cert

Multi-backend — one server can host TRT, ONNX-Runtime, PyTorch, custom Python.
Dynamic batching — combines incoming requests up to max_batch_size and a queue delay budget.
Model ensembling — Triton can chain models (e.g. tokeniser → LLM → detokeniser) inside the server.
Metrics — Prometheus endpoint on port 8002; integrates with DCGM for GPU telemetry.
Health endpoints — /v2/health/ready for orchestrator probing.

Hands-on: exercises/04_triton_serving_demo/.

11

Triton + TRT-LLM Relationship

The reusable mental model: Triton is the server, TRT-LLM is one backend, the engine file is the compiled artefact. NIM (next slide) packages all three plus a model into a microservice container.

12

NIM — Inference Microservices

NVIDIA Inference Microservices. Pre-packaged container images on NGC, one per model + precision combination. Each NIM bundles:

A pre-built TensorRT-LLM engine (or other backend)
Triton Inference Server
An OpenAI-compatible REST API plus gRPC
Deployment configuration tuned per GPU family

Run a NIM with docker run nvcr.io/nim/<publisher>/<model>:<tag> and you have a production-ready endpoint. The trade-off vs building your own engine: less control, less tuning headroom, but predictable performance.

Where NIM fits in the stack

NIM is the layer above Triton+TRT-LLM that hides their composition behind a single deployable artefact. Cert framing: NIM is "production deployment without engine-building".

13

Base Command, Run.ai, DGX Cloud

Component	What it does
Base Command	Job orchestration for DGX systems — submit/monitor training jobs, manage datasets, track experiments. The DGX-native scheduler.
Run.ai	NVIDIA-acquired Kubernetes-native GPU orchestration. Fractional GPUs, fair sharing, multi-tenant scheduling, gang scheduling.
DGX Cloud	Hosted access to DGX clusters via NVIDIA's partner clouds (AWS, Azure, GCP, OCI). Reserve capacity, run NeMo workloads, no on-prem hardware.

Brendan's local hardware (RTX 3080, RTX 4000 Ada) gives no direct access to any of these — this slide is exam-knowledge only. The cert may ask: "Which NVIDIA component schedules training jobs on DGX SuperPOD?" → Base Command. "Which integrates with Kubernetes for fractional GPU sharing?" → Run.ai.

14

NVIDIA AI Enterprise

The licensed software bundle. Not a single product — a subscription that includes:

Production-supported builds of NeMo, Triton, TensorRT-LLM, NIM, Riva (speech), Morpheus (cybersecurity), Maxine (audio/video).
Validated container images via NGC.
Enterprise support contracts, hotfixes, security patches.
Integration with VMware, Red Hat OpenShift, major cloud providers.

When it matters

Enterprise customers running NIM in production need AI Enterprise. Open-source equivalents (NeMo, Triton, vLLM) are free to use without subscription — the licence covers the validated/supported builds and SLAs, not the technology itself.

15

RAPIDS

Open-source GPU-accelerated data science suite. Drop-in replacements for pandas (cuDF), scikit-learn (cuML), NetworkX (cuGraph), and geospatial workflows (cuSpatial).

How it fits LLM work

Data preparation — cuDF for fast tabular preprocessing of training datasets.
Vector search — cuVS provides GPU-accelerated approximate-nearest-neighbour for RAG, beating CPU HNSW at scale.
Embedding clustering — cuML for k-means / DBSCAN at scale on GPU.

Lighter cert weight than NeMo/Triton/TRT-LLM, but mentioning RAPIDS in a data-prep answer demonstrates you know the wider stack.

16

Nsight Systems vs Compute

Nsight Systems

System-wide profiling. CPU, GPU, NCCL, CUDA streams, OS scheduling. Timeline view across processes. Used to find where time goes at the application level.

"Why is this training step slow?"

Nsight Compute

Kernel-level profiling. Per-kernel SM utilisation, memory access patterns, occupancy, instruction mix, roofline analysis. Used to optimise individual CUDA kernels.

"Why is this kernel achieving only 40% peak FLOPs?"

The standard workflow: start with Nsight Systems to find the bottleneck step or kernel. Then drop into Nsight Compute for that specific kernel. NVTX annotations in your code mark regions for the timeline.

Other tools: DCGM for production telemetry (Prometheus exporter), nvbandwidth for memory benchmarks, CUPTI as the underlying profiling API.

17

MIG, MPS, vGPU

Mechanism	Isolation	Where supported	Use case
MIG	Hardware-partitioned (separate SMs, L2, memory)	A100, H100, H200, Blackwell datacentre	Strict multi-tenancy, predictable QoS
MPS	Software multiplexing on shared GPU	All NVIDIA GPUs	Higher utilisation, no strict isolation
vGPU	Hypervisor-mediated; licence required	Workstation cards (RTX A/Ada, L40S) and datacentre	VDI, virtualised desktops

For Brendan's hardware, neither MIG (RTX 3080 and RTX 4000 Ada don't support MIG) nor vGPU (no licence) applies. MPS is available but rarely useful for single-user workstations.

Depth: NVIDIA_GPU_18_GPU_Sharing_MIG_MPS_vGPU.

18

Decision Matrix

I want to...	Reach for...
Pretrain or full-FT a large model on multi-GPU	NeMo Framework (Megatron-Core)
Run RLHF, DPO, or Constitutional AI	NeMo RL (formerly NeMo Aligner)
Build a custom LLM dataset at scale	NeMo Curator
Add runtime safety controls to a deployed LLM	NeMo Guardrails
Compile a high-throughput LLM inference engine	TensorRT-LLM
Serve compiled engines with HTTP/gRPC and batching	Triton Inference Server
Deploy a pre-tuned production-ready endpoint	NIM
Schedule jobs on a Kubernetes GPU cluster	Run.ai
Schedule jobs on a DGX SuperPOD	Base Command
Profile a slow training step at system level	Nsight Systems
Profile a slow CUDA kernel	Nsight Compute
Partition a single A100/H100 across tenants	MIG
GPU-accelerate pandas / scikit-learn workflows	RAPIDS (cuDF / cuML)

19

NGC Container Catalogue

NVIDIA GPU Cloud catalogue at catalog.ngc.nvidia.com. The canonical source for NVIDIA-published containers.

What you'll pull

nvcr.io/nvidia/tritonserver:<tag> — Triton releases. Tag pattern YY.MM-py3 (monthly).
nvcr.io/nvidia/tensorrt:<tag> — TensorRT toolkit container.
nvcr.io/nvidia/tensorrt_llm:<tag> or trt_llm_backend — TensorRT-LLM. Driver requirements per release.
nvcr.io/nvidia/nemo:<tag> — NeMo Framework (currently the 26.x line as of April 2026).
nvcr.io/nim/<publisher>/<model>:<tag> — NIM microservices.

Always pin tags. latest changes underneath you and breaks reproducibility.

20

Likely Exam Angles

Stack layering. Given a task description, identify the right NVIDIA layer (driver/library/framework/runtime/microservice).
NeMo RL rename. Recognise NeMo Aligner is now NeMo RL; both names may appear in distractors.
Triton vs TRT-LLM vs engine file. The three-layer mental model. Don't conflate them.
NIM bundling. What's inside a NIM container; licensing implications.
NCCL collectives. Which collective each parallelism strategy uses.
MIG vs MPS vs vGPU. Hardware vs software isolation; supported GPU families.
Nsight Systems vs Compute. System-wide vs kernel-level.
Run.ai vs Base Command. Kubernetes-native vs DGX-native scheduler.

21

Cross-References and Further Reading

Portfolio repos (depth treatment)

LLM_Hub_NVIDIA_GPUs — index of 37 GPU/architecture presentations.
NVIDIA_GPU_19_TensorRT_LLM — engine builder, IFB, paged KV, FP8/FP4.
NVIDIA_GPU_20_NeMo_NIM_AI_Enterprise — the NeMo and NIM stack in detail.
NVIDIA_GPU_17_Profiling_and_Debug — Nsight Systems / Compute / NVTX / DCGM workflow.
NVIDIA_GPU_18_GPU_Sharing_MIG_MPS_vGPU — GPU sharing trade-offs.
LLM_Hub_CUDA — CUDA programming series.
LLM_Hub_Local_LLM_Hosting — vLLM, TGI, SGLang, llama.cpp, Ollama.

Cert-prep repo resources

notes/10_nvidia_software_stack.md — the notes this deck is built from.
cheatsheets/nvidia_stack_one_pager.md — printable one-pager.
exercises/04_triton_serving_demo/ — hands-on Triton.
exercises/05_tensorrt_llm_quantisation/ — TRT-LLM walkthrough.

Official documentation

NeMo Framework: docs.nvidia.com/nemo-framework
Triton Inference Server: github.com/triton-inference-server/server
TensorRT-LLM: github.com/NVIDIA/TensorRT-LLM
NGC: catalog.ngc.nvidia.com