This note covers three NCP-GENL domains: GPU Acceleration and Optimisation (14%), Model Deployment (9%), and Production Monitoring and Reliability (7%). It is also relevant to NCA-GENL Software Development (24%). The NVIDIA stack is high-stakes for the cert — questions will probe component boundaries, the relationship between tools, and where each belongs in a production workflow.
Primary portfolio references: NVIDIA_GPU_19_TensorRT_LLM and NVIDIA_GPU_20_NeMo_NIM_AI_Enterprise. The full GPU architecture series lives at LLM_Hub_NVIDIA_GPUs.
The NVIDIA software ecosystem is layered. From silicon to application:
┌────────────────────────────────────────────────────────────────┐
│ Applications: NeMo, NIM, user workloads │
├────────────────────────────────────────────────────────────────┤
│ Serving: Triton Inference Server + TensorRT-LLM backend │
├────────────────────────────────────────────────────────────────┤
│ Optimisation: TensorRT / TensorRT-LLM (engine compilation) │
├────────────────────────────────────────────────────────────────┤
│ ML frameworks: PyTorch, JAX, TensorFlow │
├────────────────────────────────────────────────────────────────┤
│ CUDA libraries: cuDNN, cuBLAS, NCCL, CUTLASS, RAPIDS │
├────────────────────────────────────────────────────────────────┤
│ CUDA Toolkit (13.x as of April 2026) + CUDA Runtime │
├────────────────────────────────────────────────────────────────┤
│ GPU Driver (≥ 580 for CUDA 13.x) │
├────────────────────────────────────────────────────────────────┤
│ GPU Hardware (Ampere / Ada / Hopper / Blackwell) │
└────────────────────────────────────────────────────────────────┘
Each layer is independently versioned. Compatibility matrices matter in practice: a CUDA 13.x toolkit requires driver ≥ 580; cuDNN versions are tied to specific CUDA major versions; TensorRT-LLM container images bundle a specific combination.
| Library | Purpose |
|---|---|
| cuDNN | Primitives for deep neural networks: convolution, attention, normalisation, activation. Used by PyTorch and TensorFlow transparently |
| cuBLAS | Dense linear algebra (GEMM, GEMV, batch GEMM). The workhorse for matrix multiplications in LLM layers |
| NCCL | Collective communications (all-reduce, all-gather, reduce-scatter, all-to-all) for multi-GPU and multi-node training and inference |
| CUTLASS | Composable CUDA C++ templates for GEMM and convolution; used as the foundation for custom high-performance kernels including those in TensorRT-LLM |
| cuSPARSE | Sparse matrix operations; less central for LLMs but relevant for sparse attention |
For GPU memory hierarchy and how these libraries use it, see NVIDIA_GPU_04_Memory_Hierarchy and NVIDIA_GPU_03_Tensor_Cores.
NeMo Framework is NVIDIA’s scalable, cloud-native library for pretraining and fine-tuning large language models, multimodal models, and speech AI. It is built on PyTorch and leverages Megatron Core (the Megatron-LM training engine) for the parallelism infrastructure — tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, and expert parallelism.
The framework has evolved significantly. As of the 26.x container releases, NeMo references Megatron Core via NeMo Megatron Bridge for LLM/VLM training, while speech AI uses the traditional NeMo collections. The main subcomponents relevant to the cert:
Data curation library for preparing pretraining and fine-tuning datasets at scale. Capabilities include: document deduplication (exact and fuzzy, via MinHash), quality filtering (heuristics and classifier-based), language identification, toxic content filtering, and data blending. Designed for GPU-accelerated processing of internet-scale corpora. The data preparation pipeline that feeds NeMo pretraining runs typically starts here.
The alignment training library. Implements RLHF (PPO), DPO, REINFORCE/GRPO, and related algorithms for post-training alignment. Integrates with the Megatron-based training infrastructure for efficient large-scale alignment runs. This is NVIDIA’s answer to libraries like TRL (Hugging Face); it targets multi-GPU cluster operation rather than single-GPU fine-tuning.
A runtime safety and controllability framework for LLM-powered applications. It is not a training-time tool — it operates during inference, between the application and the LLM. Guardrails are defined in Colang (a domain-specific language) and enforce conversational policies: topic restrictions, fact-checking connections, output formatting rules, jailbreak mitigations. Guardrails can wrap any LLM endpoint.
An evaluation service within the NeMo ecosystem for running benchmark suites against trained models. Integrates with NeMo Run for experiment management.
Experiment management tool: configures, launches, and tracks training and evaluation runs across local and cluster environments.
For detailed coverage of NeMo and its components, see NVIDIA_GPU_20_NeMo_NIM_AI_Enterprise.
TensorRT is NVIDIA’s general-purpose inference optimisation engine. Given a neural network graph (from ONNX, PyTorch, TensorFlow), it compiles a hardware-specific engine that fuses operations, selects optimal kernel implementations, and applies precision calibration. Supports FP32, FP16, INT8, and FP8 on appropriate hardware. TensorRT is the foundation for optimising non-LLM models (CNNs, embedding models, encoders).
TensorRT-LLM is the LLM-specific extension of TensorRT. It is an open-source Python/C++ library that exposes both a Python model-authoring API and a high-performance C++ runtime. Key features verified from official documentation:
The workflow is: author the model with the TensorRT-LLM Python API (or use pre-built model implementations) → compile an optimised engine → serve via the TRT-LLM C++ runtime or expose via Triton Inference Server.
For full TensorRT-LLM coverage, see NVIDIA_GPU_19_TensorRT_LLM.
Triton Inference Server is NVIDIA’s open-source model serving system. The key conceptual distinction: Triton is the server; TensorRT-LLM is one of its backends.
Triton’s architecture:
For LLM serving, the standard NVIDIA-recommended path is: TensorRT-LLM engine → TensorRT-LLM Triton backend → Triton server. NIM packages this stack as a ready-to-deploy container.
NIM is a set of containerised inference microservices for deploying AI models in production. Each NIM container bundles:
The value proposition is operational: instead of manually compiling TensorRT-LLM engines, configuring Triton, and writing deployment scripts, a developer runs a single docker run invocation (or Helm chart in Kubernetes). The NIM catalog (hosted on NGC) covers LLMs (Llama, Mistral, Nemotron families), embedding models, rerankers, and domain-specific models.
NIM requires NVIDIA AI Enterprise licensing for production use, though developer-tier access is available with an NVIDIA developer account.
For NIM and its relationship to the broader NVIDIA stack, see NVIDIA_GPU_20_NeMo_NIM_AI_Enterprise.
These are the orchestration and infrastructure layer, relevant to enterprise cluster operation. Brendan has no DGX or cloud GPU access — this is exam-knowledge only.
NVIDIA Base Command is the cluster management and job scheduling platform for DGX systems. It provides a unified control plane for submitting training jobs, managing datasets, tracking experiments, and monitoring cluster health across on-premises DGX infrastructure.
Run.ai (acquired by NVIDIA) is a Kubernetes-native workload scheduler optimised for GPU workloads. It implements fractional GPU sharing, advanced queuing policies, and cluster fairness across teams. Run.ai is included in NVIDIA AI Enterprise as the scheduling layer for multi-tenant GPU clusters. It sits above the raw Kubernetes scheduler and below NeMo or user applications.
DGX Cloud is NVIDIA’s managed cloud service providing on-demand access to DGX-grade GPU clusters across major cloud providers (Microsoft Azure, Google Cloud, Oracle Cloud). It integrates Base Command for job management and provides the full AI Enterprise software stack pre-configured.
AI Enterprise is NVIDIA’s commercial software subscription that bundles:
The licensing model is subscription-based. Open-source community versions of NeMo and Triton exist and are freely available; AI Enterprise adds long-term support, security patches, validated configurations, and the NIM catalog.
When AI Enterprise matters: regulated industries (healthcare, finance) requiring an LTS driver branch and formal support SLA; organisations deploying NIM at scale; any deployment where NVIDIA’s indemnification and support is contractually required.
RAPIDS is NVIDIA’s open-source GPU-accelerated data science library suite, part of the CUDA-X ecosystem. It is lighter-weight in cert coverage than the inference/training stack but appears in data preparation and MLOps contexts.
| Library | Drop-in replacement for | What it accelerates |
|---|---|---|
| cuDF | pandas | DataFrame operations — groupby, join, sort, I/O |
| cuML | scikit-learn | Classical ML — regression, clustering, dimensionality reduction, UMAP, HDBSCAN |
| cuGraph | NetworkX | Graph analytics — PageRank, connected components, community detection |
| cuVS | FAISS (partially) | Vector search and approximate nearest neighbour |
RAPIDS is relevant for GPU-accelerated ETL and feature engineering pipelines upstream of model training, and for embedding search in RAG systems. NeMo Curator uses RAPIDS internally for deduplication and filtering.
NVIDIA provides two primary profiling tools, often confused:
System-level profiler. Captures CPU thread activity, GPU kernel launches, CUDA API calls, memory transfers, NCCL collectives, I/O, and their timing relationships in a unified timeline. Use Nsight Systems first — it answers “where is the time going?” and identifies bottlenecks at the system level (e.g., data loading bottleneck, CPU-GPU synchronisation stalls, pipeline idle periods).
Kernel-level profiler. Given a specific CUDA kernel, it collects roofline model data, memory throughput, warp occupancy, instruction mix, and stall reasons. Use Nsight Compute after Nsight Systems has identified a hot kernel — it answers “why is this kernel slow?” and what architectural ceiling it is hitting.
Basic profiling workflow:
For CUDA programming and kernel optimisation, see LLM_Hub_CUDA.
Three mechanisms for sharing a single physical GPU across multiple workloads:
| Mechanism | Full name | What it does | Best for |
|---|---|---|---|
| MIG | Multi-Instance GPU | Hardware partitions the GPU into isolated slices; each has dedicated memory and compute. Supported on A100, H100, H200. Not available on RTX/Ada consumer GPUs | Cloud multi-tenancy; strict isolation between workloads |
| MPS | Multi-Process Service | A CUDA daemon that allows multiple CUDA processes to share a single GPU context; lower overhead than process-switching; no memory isolation | Same-user workloads needing GPU sharing without full isolation |
| vGPU | Virtual GPU | Hypervisor-level GPU sharing for virtual machines; requires a licensed NVIDIA vGPU driver | VDI, virtualised data centres |
For a full comparison and when to use each, see NVIDIA_GPU_18_GPU_Sharing_MIG_MPS_vGPU.
Note: MIG is not available on the RTX 3080 or RTX 4000 Ada. MPS can be used on both.
| I want to… | Reach for… |
|---|---|
| Pretrain an LLM from scratch | NeMo Framework (Megatron Bridge) + NeMo Curator for data |
| Fine-tune with LoRA / SFT | NeMo Framework; or HF TRL + Accelerate for lighter setups |
| Align with RLHF / DPO | NeMo RL |
| Add runtime safety guardrails | NeMo Guardrails |
| Compile an optimised inference engine | TensorRT-LLM |
| Serve models at production throughput | Triton Inference Server + TensorRT-LLM backend |
| Deploy with zero inference configuration | NIM (pull from NGC catalog) |
| Profile the whole system for bottlenecks | Nsight Systems |
| Optimise a specific CUDA kernel | Nsight Compute |
| Accelerate data science / ETL on GPU | RAPIDS (cuDF, cuML, cuGraph) |
| Schedule GPU jobs in a multi-tenant cluster | Run.ai (via AI Enterprise) |