From driver and CUDA toolkit up to NIM microservices and Base Command. The cert-stake highest deck for the NCP exam — what each NVIDIA component does, where it sits in the stack, and when you reach for it.
A cert-focused tour of the NVIDIA AI software stack. Three NCP domains touch this material; together they're roughly 30% of the Professional exam.
14%
CUDA stack, libraries, multi-GPU primitives, profiling tools.
9%
Triton, NIM, TensorRT-LLM, container packaging, gRPC/HTTP serving.
7%
Monitoring with DCGM, AI Enterprise, Run.ai orchestration.
Combined: 30% of NCP. The single highest-leverage deck for that exam by domain weight.
Read bottom-up at deployment time, top-down at training time. The libraries layer (cuDNN, cuBLAS, NCCL, CUTLASS) is shared by everything above.
NVIDIA's C/C++ programming model and runtime for GPUs. As of April 2026 the current major is CUDA 13.x. The Toolkit ships nvcc (compiler), runtime libraries, and headers.
The driver and toolkit version pair is the most common deployment failure mode. Modern CUDA toolkit versions require a minimum driver branch (e.g. CUDA 13.x against driver R580+ or later, depending on point release). Verify with nvidia-smi against the toolkit's nvcc --version.
NVIDIA's deep learning primitives library — convolutions, activations, normalisation, RNN cells. Used by PyTorch and TensorFlow under the hood.
BLAS for the GPU. Matrix multiplies and basic linear algebra. Underpins all transformer attention and FFN computations.
NVIDIA Collective Communications Library. The mandatory layer for any multi-GPU training or inference.
| Collective | What it does | Where used |
|---|---|---|
all-reduce | Sum gradients across all GPUs; result on every GPU | Data parallelism gradient sync |
all-gather | Concatenate tensors from all GPUs onto every GPU | FSDP/ZeRO param all-gather |
reduce-scatter | Sum and split tensor across GPUs | FSDP gradient reduce-scatter |
broadcast | Send tensor from one GPU to all others | Init / loading checkpoints |
NCCL uses NVLink within a node and InfiniBand (or Ethernet) across nodes. Performance depends on both bandwidth and latency: large all-reduces are bandwidth-bound, small ones are latency-bound. SHARP (Scalable Hierarchical Aggregation Reduction Protocol) on Quantum InfiniBand offloads parts of the reduction to the network switch.
CUDA Templates for Linear Algebra Subroutines. Open-source C++ template library for writing custom GEMM kernels. The building blocks for FlashAttention, custom MoE kernels, and one-off optimisations where cuBLAS doesn't fit.
Why it matters at the cert level: CUTLASS is what lets framework authors hand-tune the matmul + epilogue + activation fusion that cuBLAS won't do for them. FlashAttention v2 and v3 are written using CUTLASS-style abstractions.
You're optimising a kernel that cuBLAS can't fuse with the surrounding ops, and you need explicit control over tensor-core instructions (HMMA on Volta-Ampere, WGMMA on Hopper, the equivalents on Blackwell).
NVIDIA's end-to-end framework for generative AI training and customisation. Container images on NGC (currently the 26.x line). Built on PyTorch with NVIDIA-specific accelerations — Megatron-Core for parallelism, Apex for fused ops, Transformer Engine for FP8.
NeMo runs on a single GPU up to multi-node clusters with the same code path; multi-node requires NCCL + InfiniBand and is configured via Megatron-Core options.
NVIDIA's reinforcement learning and preference-tuning library. Renamed from NeMo Aligner to NeMo RL in the recent NeMo Framework releases (the wider rebrand reflects that the library does more than alignment alone).
"Which NVIDIA component does RLHF / DPO?" → NeMo RL (formerly NeMo Aligner). Distractors will use the old name to test whether you've kept up.
| Component | Purpose | When to reach for it |
|---|---|---|
| NeMo Curator | GPU-accelerated data preprocessing — deduplication, quality filtering, decontamination | Building a custom pretraining or SFT corpus at scale |
| NeMo Guardrails | Programmable safety rails between user and LLM — topical, dialogue, fact-checking, jailbreak detection | Production deployment that needs deterministic safety controls beyond model RLHF |
| NeMo Evaluator | Eval orchestration — benchmark runs, custom metrics, LLM-as-judge | Standardising eval pipelines across models or releases |
| NeMo Run | Experiment orchestrator — config-driven launch on local, Slurm, or cloud backends | Reproducible multi-config runs |
Each of Guardrails, Curator, Evaluator, and Run can be used independently of the wider NeMo Framework — they're separately published packages.
General-purpose inference compiler. Takes ONNX/PyTorch/TF models, applies layer fusion, kernel auto-tuning, and precision conversion. Used for vision, classical ML, smaller transformers.
LLM-specific inference engine on top of TensorRT primitives. Adds in-flight (continuous) batching, paged KV-cache, FP8/FP4 quantisation, speculative decoding (draft, Medusa, EAGLE), tensor/pipeline/expert parallelism.
TRT-LLM uses TRT under the hood for the kernel-level work but adds a specialised Python and C++ runtime that knows about LLM-specific structures the generic TRT runtime doesn't — KV cache, request scheduling, multi-LoRA. For LLMs, you reach for TRT-LLM.
Depth: NVIDIA_GPU_19_TensorRT_LLM.
Open-source model server. The serving runtime, not the optimisation engine. Triton's job: take many model artefacts (TRT engines, ONNX, PyTorch, TF, Python custom backends), expose them via gRPC and HTTP, batch requests, route them to the right backend.
max_batch_size and a queue delay budget./v2/health/ready for orchestrator probing.Hands-on: exercises/04_triton_serving_demo/.
The reusable mental model: Triton is the server, TRT-LLM is one backend, the engine file is the compiled artefact. NIM (next slide) packages all three plus a model into a microservice container.
NVIDIA Inference Microservices. Pre-packaged container images on NGC, one per model + precision combination. Each NIM bundles:
Run a NIM with docker run nvcr.io/nim/<publisher>/<model>:<tag> and you have a production-ready endpoint. The trade-off vs building your own engine: less control, less tuning headroom, but predictable performance.
NIM is the layer above Triton+TRT-LLM that hides their composition behind a single deployable artefact. Cert framing: NIM is "production deployment without engine-building".
| Component | What it does |
|---|---|
| Base Command | Job orchestration for DGX systems — submit/monitor training jobs, manage datasets, track experiments. The DGX-native scheduler. |
| Run.ai | NVIDIA-acquired Kubernetes-native GPU orchestration. Fractional GPUs, fair sharing, multi-tenant scheduling, gang scheduling. |
| DGX Cloud | Hosted access to DGX clusters via NVIDIA's partner clouds (AWS, Azure, GCP, OCI). Reserve capacity, run NeMo workloads, no on-prem hardware. |
Brendan's local hardware (RTX 3080, RTX 4000 Ada) gives no direct access to any of these — this slide is exam-knowledge only. The cert may ask: "Which NVIDIA component schedules training jobs on DGX SuperPOD?" → Base Command. "Which integrates with Kubernetes for fractional GPU sharing?" → Run.ai.
The licensed software bundle. Not a single product — a subscription that includes:
Enterprise customers running NIM in production need AI Enterprise. Open-source equivalents (NeMo, Triton, vLLM) are free to use without subscription — the licence covers the validated/supported builds and SLAs, not the technology itself.
Open-source GPU-accelerated data science suite. Drop-in replacements for pandas (cuDF), scikit-learn (cuML), NetworkX (cuGraph), and geospatial workflows (cuSpatial).
Lighter cert weight than NeMo/Triton/TRT-LLM, but mentioning RAPIDS in a data-prep answer demonstrates you know the wider stack.
System-wide profiling. CPU, GPU, NCCL, CUDA streams, OS scheduling. Timeline view across processes. Used to find where time goes at the application level.
"Why is this training step slow?"
Kernel-level profiling. Per-kernel SM utilisation, memory access patterns, occupancy, instruction mix, roofline analysis. Used to optimise individual CUDA kernels.
"Why is this kernel achieving only 40% peak FLOPs?"
The standard workflow: start with Nsight Systems to find the bottleneck step or kernel. Then drop into Nsight Compute for that specific kernel. NVTX annotations in your code mark regions for the timeline.
Other tools: DCGM for production telemetry (Prometheus exporter), nvbandwidth for memory benchmarks, CUPTI as the underlying profiling API.
| Mechanism | Isolation | Where supported | Use case |
|---|---|---|---|
| MIG | Hardware-partitioned (separate SMs, L2, memory) | A100, H100, H200, Blackwell datacentre | Strict multi-tenancy, predictable QoS |
| MPS | Software multiplexing on shared GPU | All NVIDIA GPUs | Higher utilisation, no strict isolation |
| vGPU | Hypervisor-mediated; licence required | Workstation cards (RTX A/Ada, L40S) and datacentre | VDI, virtualised desktops |
For Brendan's hardware, neither MIG (RTX 3080 and RTX 4000 Ada don't support MIG) nor vGPU (no licence) applies. MPS is available but rarely useful for single-user workstations.
| I want to... | Reach for... |
|---|---|
| Pretrain or full-FT a large model on multi-GPU | NeMo Framework (Megatron-Core) |
| Run RLHF, DPO, or Constitutional AI | NeMo RL (formerly NeMo Aligner) |
| Build a custom LLM dataset at scale | NeMo Curator |
| Add runtime safety controls to a deployed LLM | NeMo Guardrails |
| Compile a high-throughput LLM inference engine | TensorRT-LLM |
| Serve compiled engines with HTTP/gRPC and batching | Triton Inference Server |
| Deploy a pre-tuned production-ready endpoint | NIM |
| Schedule jobs on a Kubernetes GPU cluster | Run.ai |
| Schedule jobs on a DGX SuperPOD | Base Command |
| Profile a slow training step at system level | Nsight Systems |
| Profile a slow CUDA kernel | Nsight Compute |
| Partition a single A100/H100 across tenants | MIG |
| GPU-accelerate pandas / scikit-learn workflows | RAPIDS (cuDF / cuML) |
NVIDIA GPU Cloud catalogue at catalog.ngc.nvidia.com. The canonical source for NVIDIA-published containers.
nvcr.io/nvidia/tritonserver:<tag> — Triton releases. Tag pattern YY.MM-py3 (monthly).nvcr.io/nvidia/tensorrt:<tag> — TensorRT toolkit container.nvcr.io/nvidia/tensorrt_llm:<tag> or trt_llm_backend — TensorRT-LLM. Driver requirements per release.nvcr.io/nvidia/nemo:<tag> — NeMo Framework (currently the 26.x line as of April 2026).nvcr.io/nim/<publisher>/<model>:<tag> — NIM microservices.Always pin tags. latest changes underneath you and breaks reproducibility.