NVIDIA GenAI Cert Prep — Presentation 05

The NVIDIA AI Stack

From driver and CUDA toolkit up to NIM microservices and Base Command. The cert-stake highest deck for the NCP exam — what each NVIDIA component does, where it sits in the stack, and when you reach for it.

NCP GPU Acceleration 14% NCP Deployment 9% NCP Production 7% NeMo TensorRT-LLM Triton NIM
Driver / CUDA Libraries Frameworks Runtime Microservices Orchestration
00

Topics in This Deck

A cert-focused tour of the NVIDIA AI software stack. Three NCP domains touch this material; together they're roughly 30% of the Professional exam.

01

Cert Framing

GPU Acceleration

14%

CUDA stack, libraries, multi-GPU primitives, profiling tools.

Model Deployment

9%

Triton, NIM, TensorRT-LLM, container packaging, gRPC/HTTP serving.

Production

7%

Monitoring with DCGM, AI Enterprise, Run.ai orchestration.

Combined: 30% of NCP. The single highest-leverage deck for that exam by domain weight.

02

The Stack at a Glance

Hardware — GPU + NVLink + InfiniBand Driver / CUDA Toolkit cuDNN · cuBLAS · NCCL · CUTLASS · cuSPARSE NeMo Framework Pretraining, SFT, RL, Curator, Guardrails TensorRT · TensorRT-LLM Engine builder, IFB, paged KV, FP8/FP4 Triton Inference Server NIM — Pre-packaged inference microservices Base Command · Run.ai · DGX Cloud (orchestration)

Read bottom-up at deployment time, top-down at training time. The libraries layer (cuDNN, cuBLAS, NCCL, CUTLASS) is shared by everything above.

03

CUDA, Driver, cuDNN, cuBLAS

CUDA Toolkit

NVIDIA's C/C++ programming model and runtime for GPUs. As of April 2026 the current major is CUDA 13.x. The Toolkit ships nvcc (compiler), runtime libraries, and headers.

Driver compatibility

The driver and toolkit version pair is the most common deployment failure mode. Modern CUDA toolkit versions require a minimum driver branch (e.g. CUDA 13.x against driver R580+ or later, depending on point release). Verify with nvidia-smi against the toolkit's nvcc --version.

cuDNN

NVIDIA's deep learning primitives library — convolutions, activations, normalisation, RNN cells. Used by PyTorch and TensorFlow under the hood.

cuBLAS

BLAS for the GPU. Matrix multiplies and basic linear algebra. Underpins all transformer attention and FFN computations.

04

NCCL — Collective Communications

NVIDIA Collective Communications Library. The mandatory layer for any multi-GPU training or inference.

CollectiveWhat it doesWhere used
all-reduceSum gradients across all GPUs; result on every GPUData parallelism gradient sync
all-gatherConcatenate tensors from all GPUs onto every GPUFSDP/ZeRO param all-gather
reduce-scatterSum and split tensor across GPUsFSDP gradient reduce-scatter
broadcastSend tensor from one GPU to all othersInit / loading checkpoints

NCCL uses NVLink within a node and InfiniBand (or Ethernet) across nodes. Performance depends on both bandwidth and latency: large all-reduces are bandwidth-bound, small ones are latency-bound. SHARP (Scalable Hierarchical Aggregation Reduction Protocol) on Quantum InfiniBand offloads parts of the reduction to the network switch.

05

CUTLASS

CUDA Templates for Linear Algebra Subroutines. Open-source C++ template library for writing custom GEMM kernels. The building blocks for FlashAttention, custom MoE kernels, and one-off optimisations where cuBLAS doesn't fit.

Why it matters at the cert level: CUTLASS is what lets framework authors hand-tune the matmul + epilogue + activation fusion that cuBLAS won't do for them. FlashAttention v2 and v3 are written using CUTLASS-style abstractions.

When you'd reach for it

You're optimising a kernel that cuBLAS can't fuse with the surrounding ops, and you need explicit control over tensor-core instructions (HMMA on Volta-Ampere, WGMMA on Hopper, the equivalents on Blackwell).

06

NeMo Framework

NVIDIA's end-to-end framework for generative AI training and customisation. Container images on NGC (currently the 26.x line). Built on PyTorch with NVIDIA-specific accelerations — Megatron-Core for parallelism, Apex for fused ops, Transformer Engine for FP8.

What sits inside

NeMo runs on a single GPU up to multi-node clusters with the same code path; multi-node requires NCCL + InfiniBand and is configured via Megatron-Core options.

07

NeMo RL (Formerly NeMo Aligner)

NVIDIA's reinforcement learning and preference-tuning library. Renamed from NeMo Aligner to NeMo RL in the recent NeMo Framework releases (the wider rebrand reflects that the library does more than alignment alone).

What it covers

Cert recall hook

"Which NVIDIA component does RLHF / DPO?" → NeMo RL (formerly NeMo Aligner). Distractors will use the old name to test whether you've kept up.

08

NeMo Curator, Guardrails, Evaluator, Run

ComponentPurposeWhen to reach for it
NeMo CuratorGPU-accelerated data preprocessing — deduplication, quality filtering, decontaminationBuilding a custom pretraining or SFT corpus at scale
NeMo GuardrailsProgrammable safety rails between user and LLM — topical, dialogue, fact-checking, jailbreak detectionProduction deployment that needs deterministic safety controls beyond model RLHF
NeMo EvaluatorEval orchestration — benchmark runs, custom metrics, LLM-as-judgeStandardising eval pipelines across models or releases
NeMo RunExperiment orchestrator — config-driven launch on local, Slurm, or cloud backendsReproducible multi-config runs

Each of Guardrails, Curator, Evaluator, and Run can be used independently of the wider NeMo Framework — they're separately published packages.

09

TensorRT vs TensorRT-LLM

TensorRT

General-purpose inference compiler. Takes ONNX/PyTorch/TF models, applies layer fusion, kernel auto-tuning, and precision conversion. Used for vision, classical ML, smaller transformers.

TensorRT-LLM

LLM-specific inference engine on top of TensorRT primitives. Adds in-flight (continuous) batching, paged KV-cache, FP8/FP4 quantisation, speculative decoding (draft, Medusa, EAGLE), tensor/pipeline/expert parallelism.

The relationship

TRT-LLM uses TRT under the hood for the kernel-level work but adds a specialised Python and C++ runtime that knows about LLM-specific structures the generic TRT runtime doesn't — KV cache, request scheduling, multi-LoRA. For LLMs, you reach for TRT-LLM.

Depth: NVIDIA_GPU_19_TensorRT_LLM.

10

Triton Inference Server

Open-source model server. The serving runtime, not the optimisation engine. Triton's job: take many model artefacts (TRT engines, ONNX, PyTorch, TF, Python custom backends), expose them via gRPC and HTTP, batch requests, route them to the right backend.

Features that matter for the cert

Hands-on: exercises/04_triton_serving_demo/.

11

Triton + TRT-LLM Relationship

Triton gRPC / HTTP Dynamic batch Metrics Routing TRT-LLM Backend In-flight batching Paged KV cache Speculative decoding TP / PP / EP TRT-LLM Engine Compiled .engine file FP8 / FP4 / INT4 Tensor cores Optimised kernels

The reusable mental model: Triton is the server, TRT-LLM is one backend, the engine file is the compiled artefact. NIM (next slide) packages all three plus a model into a microservice container.

12

NIM — Inference Microservices

NVIDIA Inference Microservices. Pre-packaged container images on NGC, one per model + precision combination. Each NIM bundles:

Run a NIM with docker run nvcr.io/nim/<publisher>/<model>:<tag> and you have a production-ready endpoint. The trade-off vs building your own engine: less control, less tuning headroom, but predictable performance.

Where NIM fits in the stack

NIM is the layer above Triton+TRT-LLM that hides their composition behind a single deployable artefact. Cert framing: NIM is "production deployment without engine-building".

13

Base Command, Run.ai, DGX Cloud

ComponentWhat it does
Base CommandJob orchestration for DGX systems — submit/monitor training jobs, manage datasets, track experiments. The DGX-native scheduler.
Run.aiNVIDIA-acquired Kubernetes-native GPU orchestration. Fractional GPUs, fair sharing, multi-tenant scheduling, gang scheduling.
DGX CloudHosted access to DGX clusters via NVIDIA's partner clouds (AWS, Azure, GCP, OCI). Reserve capacity, run NeMo workloads, no on-prem hardware.

Brendan's local hardware (RTX 3080, RTX 4000 Ada) gives no direct access to any of these — this slide is exam-knowledge only. The cert may ask: "Which NVIDIA component schedules training jobs on DGX SuperPOD?" → Base Command. "Which integrates with Kubernetes for fractional GPU sharing?" → Run.ai.

14

NVIDIA AI Enterprise

The licensed software bundle. Not a single product — a subscription that includes:

When it matters

Enterprise customers running NIM in production need AI Enterprise. Open-source equivalents (NeMo, Triton, vLLM) are free to use without subscription — the licence covers the validated/supported builds and SLAs, not the technology itself.

15

RAPIDS

Open-source GPU-accelerated data science suite. Drop-in replacements for pandas (cuDF), scikit-learn (cuML), NetworkX (cuGraph), and geospatial workflows (cuSpatial).

How it fits LLM work

Lighter cert weight than NeMo/Triton/TRT-LLM, but mentioning RAPIDS in a data-prep answer demonstrates you know the wider stack.

16

Nsight Systems vs Compute

Nsight Systems

System-wide profiling. CPU, GPU, NCCL, CUDA streams, OS scheduling. Timeline view across processes. Used to find where time goes at the application level.

"Why is this training step slow?"

Nsight Compute

Kernel-level profiling. Per-kernel SM utilisation, memory access patterns, occupancy, instruction mix, roofline analysis. Used to optimise individual CUDA kernels.

"Why is this kernel achieving only 40% peak FLOPs?"

The standard workflow: start with Nsight Systems to find the bottleneck step or kernel. Then drop into Nsight Compute for that specific kernel. NVTX annotations in your code mark regions for the timeline.

Other tools: DCGM for production telemetry (Prometheus exporter), nvbandwidth for memory benchmarks, CUPTI as the underlying profiling API.

17

MIG, MPS, vGPU

MechanismIsolationWhere supportedUse case
MIGHardware-partitioned (separate SMs, L2, memory)A100, H100, H200, Blackwell datacentreStrict multi-tenancy, predictable QoS
MPSSoftware multiplexing on shared GPUAll NVIDIA GPUsHigher utilisation, no strict isolation
vGPUHypervisor-mediated; licence requiredWorkstation cards (RTX A/Ada, L40S) and datacentreVDI, virtualised desktops

For Brendan's hardware, neither MIG (RTX 3080 and RTX 4000 Ada don't support MIG) nor vGPU (no licence) applies. MPS is available but rarely useful for single-user workstations.

Depth: NVIDIA_GPU_18_GPU_Sharing_MIG_MPS_vGPU.

18

Decision Matrix

I want to...Reach for...
Pretrain or full-FT a large model on multi-GPUNeMo Framework (Megatron-Core)
Run RLHF, DPO, or Constitutional AINeMo RL (formerly NeMo Aligner)
Build a custom LLM dataset at scaleNeMo Curator
Add runtime safety controls to a deployed LLMNeMo Guardrails
Compile a high-throughput LLM inference engineTensorRT-LLM
Serve compiled engines with HTTP/gRPC and batchingTriton Inference Server
Deploy a pre-tuned production-ready endpointNIM
Schedule jobs on a Kubernetes GPU clusterRun.ai
Schedule jobs on a DGX SuperPODBase Command
Profile a slow training step at system levelNsight Systems
Profile a slow CUDA kernelNsight Compute
Partition a single A100/H100 across tenantsMIG
GPU-accelerate pandas / scikit-learn workflowsRAPIDS (cuDF / cuML)
19

NGC Container Catalogue

NVIDIA GPU Cloud catalogue at catalog.ngc.nvidia.com. The canonical source for NVIDIA-published containers.

What you'll pull

Always pin tags. latest changes underneath you and breaks reproducibility.

20

Likely Exam Angles

21

Cross-References and Further Reading

Portfolio repos (depth treatment)

Cert-prep repo resources

Official documentation