NVIDIA GPU Architectures Series — Presentation 20

NeMo, NIM & AI Enterprise — NVIDIA's Production AI Stack

NVIDIA's full-stack answer to "how do I actually train and serve enterprise AI?": NeMo Framework for pretraining and fine-tuning, NeMo Aligner for RLHF and DPO, NIM microservices for one-line model deployment, Base Command for cluster control, and the AI Enterprise commercial bundle that ties it all together.

NeMoNeMo Megatron NeMo AlignerNIM Base CommandAI Enterprise NGCGuardrails TritonNeMo Curator
Curate Pretrain Fine-tune Align Quantize NIM Serve
00

Topics We'll Cover

The NVIDIA AI Enterprise bundle is six products glued together with a commercial wrapper. We work top-to-bottom: data prep, training, alignment, deployment, governance, scheduling.

01

The AI Enterprise Bundle in One Page

NVIDIA AI Enterprise is a single licensed bundle that wraps six distinct products. Each is independently useful; together they cover the full lifecycle from raw web data to a production endpoint.

NeMo Framework

Training. PyTorch-Lightning + Hydra wrapper around Megatron-LM. Pretrain or fine-tune GPT, Llama, Mistral, Mixtral, Nemotron at any scale from 300 M to 1 T params.

NeMo Aligner

RLHF / DPO / PPO. Post-training alignment toolbox — SFT, reward-model training, PPO, DPO, RLOO, KTO, IPO, SteerLM. Used to align Nemotron-70B.

NeMo Curator

Data prep at scale. Distributed crawl, language ID, exact + fuzzy dedup, quality scoring, PII redaction, classifier filtering, blending. Trillion-token-scale.

NeMo Guardrails

Safety filtering. Open-source library plus a packaged NIM. Colang DSL for input / dialog / execution / output rails. Integrates with any LLM endpoint.

NIM microservices

Packaged inference. Containerised, OpenAI-compatible endpoints with TensorRT-LLM engines, Triton, Prometheus metrics and autoscaling probes built-in.

Base Command

Cluster orchestration. Web UI + CLI for queueing, multi-tenant fair-share, dataset registry, MIG-aware scheduling. Sits above Slurm and Kubernetes.

The commercial wrapper

All six are commercially supported under the NVIDIA AI Enterprise licence (~$4,500 / GPU / yr list, frequently discounted via OEM resellers). The licence buys: validated container builds in NGC, security patches against the driver and CUDA matrix, a Day-0 support SLA, and certified compatibility with the underlying RHEL / Ubuntu / VMware host stacks.

The pitch in one sentence

"You can already pip install nemo and docker pull nim — AI Enterprise is what you buy when your CIO needs a phone number, a CVE feed, and a vendor to point at when something breaks at 2 AM."

Where each piece fits in the lifecycle

Raw Web Curator NeMo Framework Aligner TRT-LLM NIM App
02

NeMo Curator — Data Pipeline at Scale

Pretraining a frontier model means moving on the order of 10–15 trillion tokens — the equivalent of every book ever written, dozens of times over. Curator is the GPU-accelerated pipeline that turns Common Crawl WARC files into clean, deduplicated, blended pretraining shards.

Pipeline stages

1. Web crawl loaders

Common Crawl WARC, ArXiv, Stack Exchange, Wikipedia, GitHub. Streaming readers that turn shards into PyArrow / cuDF dataframes without materialising everything in RAM.

2. Language ID

FastText classifier; runs on CPU but is one of the bottlenecks at scale. Drops everything that isn't on the configured allow-list (typically English + a target list of 30+ languages).

3. Exact dedup

Document-level hash (xxhash on normalised text). Drops verbatim duplicates — the same blog post mirrored across 200 hosts. Cuts ~30% of Common Crawl in one pass.

4. Fuzzy dedup (MinHash / LSH)

MinHash signatures over n-gram shingles bucketed via locality-sensitive hashing. Catches near-duplicates — same article with a different ad header. The single biggest contributor to dataset quality.

5. Quality scoring

Heuristics (mean line length, ratio of alphanumerics, fraction of stop-words) plus a small classifier trained against a curated good/bad split. Filter the bottom 30%.

6. PII redaction

Regex + NER classifier for emails, phone numbers, credit cards, SSNs. Replaced with placeholder tokens. Mandatory for enterprise corpora subject to GDPR / CCPA.

7. Classifier-based filtering

Domain classifiers (toxicity, code-vs-prose, technical-vs-spam). Run as Triton-served NIMs. The compute-heavy step — this is where the GPUs earn their keep.

8. Blending

Weighted sampling across sources (e.g. 50% Common Crawl, 15% code, 10% books, 25% curated). Output written as MMapped IndexedDataset shards directly consumable by Megatron.

Execution model

Curator runs on top of Dask + Ray, with cuDF accelerating the dataframe stages. The MinHash + LSH stage and the classifier-inference stage are the two GPU-bound bottlenecks; everything else is CPU-cheap. On a 64-node DGX cluster, processing one Common Crawl snapshot (~3 PB compressed) takes days, not weeks.

curator_pipeline.py — CC WARC to cleaned shards
import nemo_curator as nc
from nemo_curator import Sequential, Modify, Filter
from nemo_curator.modules import ExactDuplicates, FuzzyDuplicates
from nemo_curator.filters import FastTextLangId, RepeatingTopNGramsFilter

# 1. Load Common Crawl WARC shards into a distributed cuDF dataset
ds = nc.read_warc("s3://commoncrawl/cc-2026-04/segments/*/*.warc.gz")

# 2. Build a stages pipeline (Dask graph; nothing executes yet)
pipeline = Sequential([
    Modify(nc.UnicodeReformatter()),
    Filter(FastTextLangId(model="lid.176.bin", lang="en")),
    Filter(RepeatingTopNGramsFilter(n=10, max_repetition=0.18)),
    ExactDuplicates(id_field="id", text_field="text"),
    FuzzyDuplicates(num_hashes=260, num_buckets=20, jaccard_threshold=0.8),
    Modify(nc.PiiRedactor(supported_entities=["EMAIL", "PHONE", "SSN"])),
    Filter(nc.QualityClassifier(model="nvidia/quality-classifier-deberta")),
])

# 3. Run distributed across 64 GPUs via Dask
client = nc.get_client(scheduler="slurm", n_workers=64)
clean = pipeline(ds)
clean.to_indexed_dataset("/lustre/pretraining/cc_clean_v3", shard_size="4GB")
Why GPU-accelerated

FastText LangID is fast on CPU but the classifier-filter stage (DeBERTa quality / toxicity heads) is unavoidable Transformer inference. Without GPUs that single stage dominates wall-time by 50×. Curator's design exists because Common Crawl cannot be processed on CPU clusters in any reasonable budget.

03

NeMo Framework — Pretraining

NeMo Framework is the training half of the bundle. Internally it is Megatron-LM — the parallelism engine that NVIDIA pioneered — wrapped in PyTorch Lightning for the trainer loop and Hydra for configuration. Megatron and NeMo were two separate projects until 2024, when Megatron-Core was upstreamed and NeMo became the canonical entry point.

What Megatron pioneered

NeMo composes all of these via Hydra YAML, with sensible defaults per model class.

Models supported out-of-the-box

FamilySizesVariantsNotes
GPT300 M – 1 TDecoder-only causalThe reference recipe; everything else is a delta from this.
Llama1 B – 405 BLlama-2, Llama-3.x, code-LlamaRoPE, SwiGLU, GQA preset.
Mistral7 BSliding-window attentionDrop-in via Llama recipe + window flag.
Mixtral (MoE)8×7 B, 8×22 BSparse MoENeeds Expert Parallel; routing-aware loss.
Falcon7 B – 180 BMulti-query attentionTII-trained reference.
Phi1.5 B – 3.8 BMicrosoft small modelsDistillation-friendly; runs in NeMo with the GPT recipe.
Gemma2 B / 7 BGoogle openDifferent RMSNorm constants.
Nemotron340 B / 70 B / 4 BNVIDIA's ownThe reference release that exercises the full Aligner stack.

A pretrain config in 30 lines

conf/llama3_70b_pretrain.yaml
defaults:
  - _self_
  - optim: distributed_fused_adam
  - data: blended_v3

name: llama3_70b
trainer:
  devices: 8
  num_nodes: 64            # 512 H100s
  max_steps: 280000
  precision: bf16-mixed
  gradient_clip_val: 1.0

model:
  num_layers: 80
  hidden_size: 8192
  num_attention_heads: 64
  num_query_groups: 8      # GQA
  ffn_hidden_size: 28672
  position_embedding_type: rope
  activation: fast-swiglu
  normalization: rmsnorm
  seq_length: 8192
  global_batch_size: 2048
  micro_batch_size: 1
  # --- parallelism ---
  tensor_model_parallel_size: 8
  pipeline_model_parallel_size: 4
  context_parallel_size: 1
  sequence_parallel: true
  # --- optim & checkpointing ---
  fp8: true
  fp8_format: hybrid           # E4M3 fwd, E5M2 bwd
  activations_checkpoint_granularity: selective

Launch this as python megatron_gpt_pretraining.py --config-path conf --config-name llama3_70b_pretrain via Slurm on a DGX SuperPOD or via Base Command jobs on managed clusters.

04

Parallelism Strategies in NeMo

Modern frontier training is never a single dimension of parallelism. NeMo composes six axes; the right combination depends on model shape, GPU count, and interconnect.

TP — Tensor Parallel

Split each linear layer's weight matrix column-wise across GPUs; gather via NCCL all-reduce after every layer. NVSwitch-bound: works inside a single NVLink domain but falls off a cliff over PCIe. Typical: TP = 8 inside a DGX node.

PP — Pipeline Parallel

Split layers across GPUs. Only hidden states cross between stages, so the per-step traffic is tiny — PCIe-friendly and cross-node-friendly. Tradeoff: a "bubble" of idle time at start/end of each microbatch. NeMo uses interleaved 1F1B to shrink it.

SP — Sequence Parallel

Inside the dropout / layer-norm regions, split activations along the sequence dimension across the TP group. Doesn't add comms (same all-reduce) but recovers ~30% of activation memory. Always-on for TP > 1.

DP — Data Parallel

Different batches per group of GPUs. ZeRO-style optimiser-state sharding (Megatron's "distributed-optim") shards the AdamW state along DP for free memory savings. Combine: outer-DP, inner-(TP×PP).

EP — Expert Parallel

For MoE models (Mixtral, Nemotron-MoE, DeepSeek-style). Each expert lives on a different GPU; routing is an all-to-all. Communicates only the activated tokens, so cost scales with the number of routed tokens, not the parameter count.

CP — Context Parallel

Split very long sequences along the sequence axis across attention heads, with ring-attention reducing the KV across the CP group. Required for 1 M+ context training. Added to NeMo with the long-context push of 2024–25.

A composed example

A 70 B Llama-3 pretrain on 512 H100s typically runs:

Total GPUs = TP × PP × DP = 8 × 4 × 16 = 512. Add CP for > 32 K context, EP for MoE.

Outer
DP × 16
replicas, sync via all-reduce
Middle
PP × 4
layer stages, IB-friendly
Inner
TP × 8
SP on
NVSwitch-local
Optional
EP for MoE
CP for >1M ctx
Rule of thumb

Push TP as high as your NVLink/NVSwitch domain allows (8 inside a DGX). Use PP next, across nodes via InfiniBand. Fill out remaining GPUs with DP. Add EP only when you actually have experts; add CP only when you actually have long context. Don't overcompose.

05

NeMo Aligner — RLHF, DPO, PPO

Pretraining produces a base model that is fluent but unaligned: it will happily complete a request to write malware. Alignment — the post-training phase — turns a base model into an instruction-following assistant. NeMo Aligner is the toolbox.

The standard pipeline

Base pretrained model
SFT — supervised fine-tune on instruction data
Reward Model training (preference pairs)
Policy Optimisation (PPO / DPO / RLOO / KTO / SteerLM)
Aligned model ready for serving

The methods (and when to pick which)

MethodYearNeedsWhen to use
PPO2017 / RLHF 2022Reward model + reference + policy + value headClassic RLHF. Highest quality, most expensive (4 model copies).
DPORafailov 2023Preference pairs; no reward modelSimpler, cheaper. The default for most fine-tunes today.
RLOOAhmadian 2024Reward model; no value headLighter than PPO, retains on-policy benefits.
KTOEthayarajh 2024Single-rated samples (good / bad), no pairsWhen you don't have curated A-vs-B preference pairs.
IPOAzar 2023Preference pairsDPO with a more conservative regularisation; less prone to over-optimisation.
SteerLMNVIDIA 2023Multi-attribute labelsConditional alignment — tune helpfulness, toxicity, verbosity at inference time.

Distributed architecture

PPO-style training is the hard case: you need four model copies live simultaneously. Aligner places them on separate GPU groups and routes traffic between them:

Rollouts (the actual generation step that PPO depends on) are run via TensorRT-LLM for 5–10× throughput vs naive PyTorch generation. The reward model and reference live on dedicated GPU groups and are queried over NCCL.

A DPO config

conf/dpo_llama3_70b.yaml
name: dpo_llama3_70b
trainer:
  devices: 8
  num_nodes: 8             # 64 H100s
  max_steps: 2000
  precision: bf16-mixed

model:
  pretrained_checkpoint: "/ckpt/llama3_70b_sft.nemo"
  ref_policy_checkpoint: "/ckpt/llama3_70b_sft.nemo"
  tensor_model_parallel_size: 8
  pipeline_model_parallel_size: 2

dpo:
  beta: 0.1                  # KL strength
  loss_type: sigmoid           # or 'ipo' / 'rso'
  label_smoothing: 0.0
  preference_average_log_probs: true

data:
  train_ds:
    file_path: "/data/anthropic_hh_rlhf_pairs.jsonl"
    micro_batch_size: 1
    global_batch_size: 128
    max_seq_length: 4096
  validation_ds:
    file_path: "/data/hh_rlhf_val.jsonl"

optim:
  name: distributed_fused_adam
  lr: 5.0e-7
  weight_decay: 0.1
  sched:
    name: CosineAnnealing
    warmup_steps: 50
Production reality

DPO has eaten most of the alignment-startup market because it's much simpler: no reward model, no rollouts, no value head — just a contrastive loss over preference pairs. PPO still wins on the very largest reasoning-heavy tasks. Aligner supports both behind one config knob.

06

NIM — One-Line Model Deployment

Once you have a trained model, you need an endpoint. NIM (NVIDIA Inference Microservices) is NVIDIA's answer: a single Docker pull, a single port, and you have a production-ready, OpenAI-compatible inference server.

deploy a Llama-3.3-70B endpoint
# That's it. No model build, no engine compile, no Triton config.
$ docker run --gpus all -p 8000:8000 \
    nvcr.io/nim/meta/llama-3.3-70b-instruct:latest

# Talk to it like OpenAI:
$ curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"meta/llama-3.3-70b-instruct",
         "messages":[{"role":"user","content":"hi"}]}'

What's actually inside the container

TensorRT-LLM engine

Pre-built per GPU class (A100 / H100 / H200 / L40S / B200). On first launch, NIM detects the host GPU and pulls (or selects) the matching engine. Days of build time avoided.

Triton Inference Server

Triton fronts the engine, exposing the OpenAI-compatible HTTP / gRPC API. Handles dynamic batching, KV-cache reuse, request queueing.

Observability hooks

Prometheus /metrics endpoint with TTFT, TBT, queue depth, KV usage. OpenTelemetry traces propagate through to Triton internals.

Liveness / readiness probes

Standard /health/live and /health/ready endpoints — autoscaler-friendly. Readiness only flips true once the engine has finished loading and warmup runs are complete.

Three editions

NIM (on-prem)

Pulled from nvcr.io; runs on your own DGX, RTX server, or workstation. Requires AI Enterprise licence for production use, free for dev/eval.

NIM on cloud

One-click deploy buttons on AWS Marketplace, Azure Marketplace, GCP Marketplace. Same container image; cloud handles billing through your usual provider.

build.nvidia.com

NVIDIA-hosted free tier for prototyping. Same OpenAI-compatible API, served from NVIDIA's own DGX Cloud. Great for sketching demos before sinking GPU spend.

The catalog — ~100 models and growing

The trick

NIM hides the "which TensorRT-LLM engine binary, on which GPU, with which plugin set" matrix entirely. The same docker pull on a 4090 dev workstation, an L40S serving box, or a B200 cluster Just Works — the container picks the right engine at startup. That is the actual product value.

07

NIM Catalog — Beyond LLMs

The Docker-pull pattern works for any TensorRT-optimisable model. NVIDIA has packaged enough of these into NIMs that the catalog now spans most of the production AI surface, not just chat.

Text LLMs

Llama-3.1 / 3.3 (8B / 70B / 405B), Mixtral 8×7B / 8×22B, Nemotron-70B-Instruct (NVIDIA's flagship aligned model), NeMo Megatron 530B reference, Phi-3-Mini, Gemma-7B.

Embeddings + reranking

NV-EmbedQA-Mistral-7B-v2 (top of MTEB), NV-EmbedQA-E5-v5, Snowflake-Arctic-Embed-L, NV-Rerank-QA-Mistral-4B. Drop-in replacements for OpenAI text-embedding-3 in your RAG stack.

ASR (Riva)

Parakeet RNN-T (English), Parakeet-TDT (multi-lingual streaming), Conformer-CTC. All exposed as gRPC / WebSocket streaming endpoints with diarisation hooks.

TTS (Riva)

FastPitch acoustic model + HifiGAN vocoder. Sub-50 ms first-audio-out on H100 for streaming. Voice cloning via fine-tuning. Multi-speaker, prosody-controllable.

Vision-Language

NV-VLM-D (NVIDIA's multimodal), ChatRTX (local vision chat), NeVA (LLaVA-style), Llama-3.2-Vision NIM. Effects: Eye Contact (gaze redirection), Background Blur (segmentation NIM).

RAG building blocks

NV-Reranker (cross-encoder), NV-Embed (bi-encoder), ColPali (document-image embeddings — retrieve PDFs by visual layout, not OCR text). Compose them into a multi-stage retrieval pipeline with one docker compose.

Digital biology — BioNeMo

DiffDock-NIM (protein-ligand docking), ESM-2 NIM (protein language model embeddings), MolMIM (molecular generation). Same Docker-pull pattern; abstracts away build-and-tune for non-LLM scientific models.

Code & agents

Llama-3.1-Nemotron-70B-Reward (judge model), code-Llama variants, Nemotron-4-340B-Reward. Increasingly used as agent-evaluators inside larger pipelines.

Strategic point

The same Docker-pull pattern reused across modalities is what makes NIM a platform, not just a model server. A drug-discovery team and a chatbot team end up running the same kind of Kubernetes pods, with the same Prometheus dashboards, the same Triton-level batching, and the same AI Enterprise support contract — despite using radically different models underneath.

08

NeMo Guardrails — Safety on Rails

An LLM endpoint is dangerous by default: it'll happily reveal system prompts, follow injections, leak training data, or call tools it shouldn't. NeMo Guardrails is the runtime-safety layer that wraps any LLM endpoint — OpenAI, Anthropic, an in-house NIM — with policy.

Four kinds of rail

Input rails

Run before the LLM call. Block off-topic prompts, jailbreak attempts, prompt-injection signatures, PII in the user's input. Can short-circuit and return a canned refusal without ever invoking the LLM.

Dialog rails

Multi-turn flow control. "If the user asks for medical advice, route to a disclaimer flow first." Implemented as Colang flow definitions; compiles to a state machine.

Execution rails

Tool-use restrictions: which functions can the LLM actually call, with which argument schemas, with which RBAC. Regex-validated arguments before fn(...) ever fires.

Output rails

Run after the LLM has produced a draft. Filter unsafe content, hallucinated facts (cross-check against retrieval), or PII leaks. Optionally re-prompt the LLM to fix the violation.

How it actually enforces

policies/banking_assistant.co
# --- canonical user intents ---
define user ask_balance
  "what's my balance"
  "how much money do I have"
  "current account total"

define user ask_for_legal_advice
  "is this contract legal"
  "am I being sued"

define user attempt_jailbreak
  "ignore previous instructions"
  "act as a different model"

# --- bot responses ---
define bot refuse_legal
  "I can't give legal advice. Please consult a qualified solicitor."

define bot refuse_jailbreak
  "I'm here to help with your banking. Let's stick to that."

# --- flows (input rails) ---
define flow handle_legal
  user ask_for_legal_advice
  bot refuse_legal

define flow handle_jailbreak
  user attempt_jailbreak
  bot refuse_jailbreak

# --- flows (execution rails) ---
define flow handle_balance
  user ask_balance
  $auth = execute verify_session_jwt
  if not $auth.valid
    bot "Please re-authenticate."
    stop
  $bal = execute get_balance(account_id=$auth.account_id)
  bot "Your balance is " + $bal

Guardrails ships as both an open-source Python library (pip install nemoguardrails) and as a packaged NIM — the latter is appropriate when you want guardrails enforced for traffic that doesn't go through your application code (e.g. as a sidecar for a third-party endpoint).

Composability

Guardrails is LLM-agnostic. The same Colang policy can be pointed at a NIM-hosted Llama, an OpenAI GPT-4o endpoint, an Anthropic Claude endpoint, or a local Ollama instance. That's deliberate — NVIDIA wants to be the safety layer regardless of whose model you actually use.

09

Base Command — Cluster Orchestration

Slurm is the de-facto HPC scheduler — but it's a 2000s-era tool with a 1990s UX. Base Command is NVIDIA's commercial replacement: a web UI plus CLI that handles queueing, dataset management, MIG-aware scheduling, and integrated dashboards on top of the underlying Slurm or Kubernetes cluster.

What Base Command actually provides

Queue management

Submit, monitor, cancel jobs from a browser. Multi-tenant fair-share with per-team GPU-hour budgets; configurable preemption policy; integrated with LDAP / SAML for auth.

Job templates

Pre-configured presets for NeMo pretrain, NeMo SFT, Aligner DPO, and inference benchmarks. Pick a model size + GPU count, click submit. Sane defaults that stop you from misconfiguring TP / PP at 3 AM.

Checkpoint & dataset registries

Versioned object-store buckets for dataset and checkpoint artefacts. bcprun jobs reference them by name; provenance is tracked. Integrates with Lustre / GPFS / S3-compatible.

MIG-aware scheduling

Jobs declare their GPU-slice requirement (e.g. 1g.10gb); scheduler allocates the right MIG profile, reconfigures GPUs as needed, and tears down between job classes.

Integrated dashboards

Live GPU utilisation, NVLink traffic, IB bandwidth, KV usage, and per-job throughput — all wired up by default. No need to stand up a separate Grafana stack.

Ships with the driver

Bundled with the AI Enterprise driver / CUDA matrix — you don't separately install or upgrade it. Tight version coupling is the point.

Where it sits

User — Web UI / CLI / Python SDK
Base Command
Slurm or Kubernetes (driver included)
DGX / DGX SuperPOD / DGX Cloud

Used at NVIDIA itself

NVIDIA's internal R&D clusters — the ones they used to train Nemotron and pre-release Llama partners — run Base Command. DGX Cloud (NVIDIA-managed bare-metal DGX clusters on OCI / Azure / GCP) ships with Base Command pre-installed: that's how external customers get the same experience.

Alternatives in the wild

Why the Run.ai acquisition matters

NVIDIA's bet: Slurm-shop customers stay on Slurm, but every new AI cluster being built is on Kubernetes. Run.ai is the K8s scheduling layer they bought to make sure the AI Enterprise UX (workspaces, fair-share, MIG, dashboards) works identically on K8s as it does on Slurm. Expect Run.ai and Base Command to converge over the next two years.

10

NGC, AI Enterprise, DGX Cloud — How These Connect

Three NVIDIA brand names get confused all the time. They are not the same thing.

NGC catalog

The registry. A free public catalog of NVIDIA-validated containers, models, and Helm charts. PyTorch, TensorFlow, NeMo, Triton, NIM, RAPIDS, Modulus, BioNeMo — all live in nvcr.io. Pull anonymously for personal / dev; log in with an NGC API key for protected images.

NVIDIA AI Enterprise

The licence. A commercial software bundle: NeMo + NIM + Triton + Base Command + Riva + RAPIDS + Run.ai + vGPU. Per-GPU-per-year subscription (~$4,500 list, OEM-discounted). Buys you support SLAs, validated stacks, security patches, and certified host-OS compatibility.

DGX Cloud

The hosted offering. Managed bare-metal DGX clusters on Oracle Cloud, Azure, and GCP. Comes with AI Enterprise pre-installed and Base Command as the front-end. Pay hourly for short jobs or reserve for long pretrains. NVIDIA handles the hardware; you bring the workload.

How they fit together

NGC catalog — where containers live
AI Enterprise licence — what you pay for support on those containers
DGX Cloud — NVIDIA-hosted infra running both
Base Command + Run.ai — scheduling layer on top

The buying decision

You are…You probably want…Why
An individual / hobbyistNGC pulls onlyFree; the open-source containers do everything you need for dev.
A startup, < 50 GPUs on-premNGC + selective NIM trialsUse the free build.nvidia.com NIM tier; only buy AI Enterprise when production support is mandatory.
An enterprise with on-prem DGXNGC + AI Enterprise + Base CommandSLAs and patch cadence justify the per-GPU licence. Base Command bundled.
An enterprise wanting elastic scaleDGX Cloud (incl. AI Enterprise)You pay per GPU-hour; NVIDIA handles cluster ops. Burst-friendly.
A hyperscaler customerNIM via AWS / Azure / GCP marketplaceOne-click deploy on an existing cloud account; NIM billing rolls into your hyperscaler invoice.
The Run.ai inflection

Since the late-2024 Run.ai acquisition, the K8s scheduling story has consolidated under NVIDIA. Expect the AI Enterprise SKU to increasingly include "Run.ai workspaces" as a first-class feature for Kubernetes shops — mirroring what Base Command provides on Slurm. The two will likely share a UI by 2027.

11

Interactive: NVIDIA AI Stack Builder

Pick your goal, team size, and infrastructure. The builder recommends which NeMo modules and NIM containers to use, whether AI Enterprise is needed, and prints a 3-line stack recipe.

AI Enterprise
Base Command
Licence band
Infra band
A real-world example output

Goal = "Fine-tune Llama-70B", team = 25, infra = on-prem DGX → DGX H100 SuperPOD + NeMo Megatron for SFT → NeMo Aligner for DPO → quantize via TRT-LLM → deploy as NIM behind Triton; AI Enterprise required for support, Base Command bundled.