NVIDIA GPU Architectures 05 — NVLink, NVSwitch, and Scale-Up

00

Topics We'll Cover

Why scale-up exists, how NVIDIA built it generation by generation, and what the rack-scale NVL72 actually is — framed in numbers a tensor-parallel run actually cares about.

Why Scale-Up Matters — The All-Reduce Bill
NVLink 1 — Pascal (2016)
NVLink 2 — Volta (2017) and the First NVSwitch
NVLink 3 — Ampere (2020)
NVLink 4 — Hopper (2022) and NVLink-C2C
NVLink 5 — Blackwell (2024)
NVSwitch — Point-to-Point Becomes a Fabric
NVL72 — The Rack as a GPU
NVLink-C2C — Coherent CPU-GPU Memory
vs PCIe and InfiniBand
Interactive: Fabric Planner

01

Why Scale-Up Matters — The All-Reduce Bill

Tensor parallelism shards every weight matrix across N GPUs. Each transformer layer ends with an all-reduce over the activations so every shard sees the full output. Pipeline parallelism passes activations between layers; data parallelism syncs gradients. All three regimes turn into collective bandwidth.

The 2× factor in the heuristic below comes from the ring all-reduce algorithm: it does a reduce-scatter (one full pass around the ring) then an all-gather (a second pass), so each GPU sends and receives 2·(N−1)/N times the per-GPU shard size — close enough to 2× for N≥4.

Per-step traffic, ring all-reduce

traffic per layer ≈ hidden_dim × seq_len × bytes × 2 across the ring. The N-1/N factor approaches 1 quickly, so the cost is set by the activation tensor size, not by the number of GPUs.

Llama-3 70B, hidden=8192, seq=4096, bf16

One layer's activation tensor is 8192 × 4096 × 2 = 64 MiB per replica. The ring all-reduce moves ~2× that around: ~128 MiB per layer per step across the ring. Llama-3 70B has 80 layers, so a single forward pass during prefill exchanges ~10 GiB of activation traffic over the interconnect. Multiply by batch and you can saturate any link the host CPU can offer.

(The "~512 MB per layer across all GPUs" framing in the brief counts both directions across all N−1 ring hops at once — the per-GPU half-duplex bill is what matters for sizing the link, and that is the ~128 MiB number above for an 8-way split.)

PCIe 4 x16

32 GB/s

collapses

PCIe 5 x16

64 GB/s

marginal

NVLink 1

80 GB/s

P100

NVLink 2

150 GB/s

V100

NVLink 3

600 GB/s

A100

NVLink 4

900 GB/s

H100

NVLink 5

1800 GB/s

B200

Empirically, NVLink delivers near-linear TP scaling up to 8 GPUs on a single NVSwitch. PCIe 4 x16 already collapses at TP=2 on a 70B model: the all-reduce stalls long enough that GPUs idle on activations.

The point of every slide that follows

Every NVLink generation roughly doubles per-GPU bandwidth. Every NVSwitch generation extends the domain over which that bandwidth is uniform. Together they push the wall at which TP stops scaling further out, generation by generation.

02

NVLink 1 — Pascal (2016)

The P100 introduced NVLink: NVIDIA's answer to "PCIe 3.0 is far too slow to gang GPUs". Each P100 has 4 NVLink 1.0 links, each carrying 20 GB/s/dir bidirectional — aggregate 80 GB/s/GPU. That is more than 5× PCIe 3.0 x16 (16 GB/s).

Physical layer

NRZ (non-return-to-zero) signalling at 20 Gb/s per lane, 8 lanes per link, 8b/10b coded.
Twin-ax cables on SXM modules, or interposer-borne traces on the cluster board.
Each link is a packet-switched, point-to-point connection. The protocol is NVIDIA's own — not PCIe.
Uses a custom flit-level credit protocol; no need to traverse the host PCIe root complex.

Topology — DGX-1 mesh

DGX-1 packs 8 P100s. With only 4 NVLink ports per GPU, a fully-connected mesh is impossible. NVIDIA chose a hybrid cube-mesh: each GPU has 4 direct neighbours over NVLink, and reaches the other 3 GPUs in two hops. Collectives have to be choreographed to avoid contended links.

P100

↔

P100

↔

P100

↔

P100

4 NVLink 1.0 ports per P100, mesh in DGX-1 → 80 GB/s aggregate per GPU.

Historical aside

NVLink 1 was also the link used in the IBM Power8 / Power9 "OpenCAPI" systems — the only time an NVLink endpoint sat on a non-NVIDIA processor outside of the C2C era. AC922 nodes pushed coherent NVLink 1 between Power9 cores and V100 GPUs.

Software view: NCCL learned NVLink topology from nvidia-smi topo -m and chose ring orderings that hugged the direct edges. Even at 80 GB/s the wall was visible — an 8-way ring all-reduce on P100 was already a measurable fraction of step time on small models.

03

NVLink 2 — Volta (2017) and the First NVSwitch

V100 widened NVLink: 6 NVLink 2.0 links per GPU, each 25 GB/s/dir — aggregate 150 GB/s/GPU, almost double the P100. The lane rate climbed to 25 Gb/s with the same NRZ signalling.

DGX-1V mesh and its limits

DGX-1V (V100 edition) used a similar hybrid mesh. With 6 ports per GPU, a fully-connected 8-way mesh is still impossible (would need 7 ports), so two-hop paths remained for some GPU pairs. NVIDIA's answer was to build a switch.

NVSwitch 1 in DGX-2

DGX-2 (2018) introduced NVSwitch 1: a 18-port crossbar at NVLink 2.0 speeds. A DGX-2 has 12 NVSwitch chips wired so that all 16 V100 GPUs see each other at full link bandwidth simultaneously. Per-GPU effective NVLink bandwidth into the switch fabric: 300 GB/s (each V100 contributes its 6 links at 25 GB/s × 2 = 300 GB/s aggregate full-duplex into the fabric).

DGX-2:

16× V100 12× NVSwitch 1 2.4 TB/s bisection 300 GB/s per GPU

vs DGX-1V:

8× V100 mesh contended hops 150 GB/s per GPU

What changed conceptually

NVSwitch turned NVLink from a set of point-to-point cables into a fabric. Any GPU pair gets one switch hop at full bandwidth, regardless of their position. NCCL no longer has to map its collective onto a fragile mesh; it can assume uniform any-to-any bandwidth.

This is also when NVLink stopped being a "two GPUs cooperating" feature and started being the basis of tensor parallelism at scale. Megatron-LM's launch on 16-GPU DGX-2s would not have been possible on a mesh.

04

NVLink 3 — Ampere (2020)

A100 again roughly doubled it: 12 third-generation NVLinks per GPU at 50 GB/s/dir each — aggregate 600 GB/s/GPU. The link rate hit 50 Gb/s; signalling moved to NRZ at 50 GBaud, with much tighter equalisation than NVLink 2.

HGX A100 with NVSwitch 2

The HGX A100 baseboard hosts 8 A100s wired through 6 NVSwitch 2 chips. Every A100 reaches every other A100 over the switch fabric at the full 600 GB/s aggregate. Bisection bandwidth: ~4.8 TB/s.

Datacenter (A100 SXM4)

12 NVLink 3 ports → 600 GB/s
HGX A100 8-GPU board with NVSwitch 2
Any-to-any at full bandwidth
Foundation of MLPerf scaling 2020–2022

Workstation/consumer (3090, A6000)

NVLink bridge: 4 links exposed = 112 GB/s
Two-card only — no fabric, no switch
3090 was the last GeForce with any NVLink at all
4090 / 5090: no NVLink

What NVLink 3 unlocks

TP=8 at full rate — Megatron / DeepSpeed scale near-linearly within a single HGX A100.
Cross-node still needs InfiniBand — HDR 200 Gb/s ports per GPU × 8 in DGX A100, with GPUDirect RDMA over IB. NVLink does not yet leave the chassis.
3rd-gen NVLink + 3rd-gen NVSwitch become the workhorse of the GPT-3 era. Every "DGX SuperPOD" diagram from 2020–2022 has this exact topology.

600 GB/s vs HBM2e

An A100's HBM2e delivers 1.55–2.0 TB/s. NVLink 3 hits ~30% of HBM bandwidth, which is good enough that an all-reduce overlapping a forward pass barely shows in profiles. This 30%-of-HBM heuristic is what NVLink targets every generation — below it, TP starts to bite.

05

NVLink 4 — Hopper (2022) and NVLink-C2C

H100 brought 18 fourth-generation NVLinks at 50 GB/s/dir each — aggregate 900 GB/s/GPU. The per-link rate stayed at 50 GB/s; NVIDIA scaled out by adding more links rather than faster ones. Signalling moved to PAM4 at 100 Gb/s per lane (50 GBaud × 2 bits/symbol).

NVSwitch 3 in HGX H100

HGX H100 keeps the 8-GPU baseboard but with 4 NVSwitch 3 chips. Every H100 reaches every other at full 900 GB/s. Bisection bandwidth: ~7.2 TB/s on a single HGX baseboard.

NVLink-C2C — the Grace-Hopper variant

The big architectural surprise was NVLink-C2C ("chip-to-chip"): a tightened, lower-overhead variant used to connect a Grace CPU to a Hopper GPU on the GH200 superchip. 900 GB/s coherent, packaged on a single substrate. Cache-coherent in a way the inter-GPU NVLink had never been.

Grace CPU
72 Arm Neoverse

↔

NVLink-C2C
900 GB/s coherent

↔

Hopper H100
96 GB HBM3

This is the moment NVLink stops being just "GPU-to-GPU" and becomes the in-package coherent fabric of NVIDIA's superchips. The Grace CPU's LPDDR5x acts as a slow second tier of GPU memory, addressable transparently from CUDA.

FP8, transformer engine, and NVLink 4

Hopper's FP8 throughput on tensor cores roughly tripled compute. If NVLink had stayed at 600 GB/s, the all-reduce-to-compute ratio would have collapsed and TP=8 would have stopped scaling. NVLink 4 was sized to keep that ratio steady.

06

NVLink 5 — Blackwell (2024)

B200 doubles per-link bandwidth: 18 fifth-generation NVLinks at 100 GB/s/dir each — aggregate 1.8 TB/s/GPU. Signalling: 200 Gb/s per lane (PAM4, 100 GBaud), the same SerDes generation found in 800G Ethernet.

NVSwitch 4 + NVLink-Sharp

NVSwitch 4 keeps the radix model but adds NVLink-Sharp: in-network reductions. The switch can compute the partial sum of an all-reduce at the fabric level, halving the data each GPU has to receive. This is most of how the NVL72 hits its claimed all-reduce speedups versus an HGX H100 baseline.

The fabric leaves the chassis

For the first time, NVLink is designed to span more than one box at full bandwidth. The NVL72 (next slide) is the first product to show what that means in practice.

B200 SXM6

2-die chiplet, 10 TB/s die-to-die
192 GB HBM3e @ 8 TB/s
18 NVLink 5 ports → 1.8 TB/s
FP4 / FP8 / FP16 / BF16 / TF32

NVSwitch 4

14.4 TB/s per switch
NVLink-Sharp in-network reduction
9 switches per NVL72 rack
Optical DACs link compute trays

Doubling, again

NVLink 1 80 GB/s, NVLink 2 150 GB/s, NVLink 3 600 GB/s, NVLink 4 900 GB/s, NVLink 5 1800 GB/s. Roughly 1.7–2× per generation. The rate of growth has actually matched compute growth across the same span — a deliberate co-design choice.

07

NVSwitch — Point-to-Point Becomes a Fabric

The NVSwitch is what turns N−1 NVLink links per GPU from a sparse mesh into a uniform fabric. The difference matters most when more than two GPUs are doing a collective at once.

Mesh (NVLink 1, no switch)

4-GPU example: 6 direct edges. But if GPU 0 and GPU 3 are not directly wired, their traffic shares a link through GPU 1 or GPU 2. With three concurrent collectives, every link is contended.

NVSwitch fabric

Every GPU has one or more NVLinks into the switch. Any GPU pair sees one switch hop. Three concurrent flows do not contend as long as each endpoint's link rate isn't exceeded.

Why TP>2 needs a switch

An N-way ring all-reduce moves 2(N−1)/N times the per-GPU shard. On a mesh with 2-hop pairs, that traffic shares physical links. NCCL has to pick a ring that hugs direct edges, but in the worst case some links carry double traffic. NVSwitch makes every ring choice equivalent.

This is also why NVLink-Sharp (NVSwitch 4) is so important on Blackwell — with the switch already in the path, having it perform the partial reduction itself saves one full pass through the fabric.

08

NVL72 — The Rack as a GPU

The GB200 NVL72 is the first product where an NVLink domain spans an entire rack. 72 B200 GPUs in 18 compute trays (4 GPUs + 2 Grace CPUs each), interleaved with 9 NVLink Switch trays, all connected by NVLink 5 in one logical domain. The whole rack behaves to NCCL like a single 72-GPU NVSwitch fabric.

Compute

72× B200 → 13.5 TB HBM3e, ~720 PFLOPS FP8 dense.

NVLink fabric

9 switch trays → 130 TB/s aggregate NVLink bandwidth, every GPU sees every other at full 1.8 TB/s.

What it replaces

Workloads that previously needed InfiniBand between racks now stay inside one NVLink domain.

Why this matters for TP/PP

Tensor parallelism scales until per-step all-reduce eats more time than compute. At 1.8 TB/s per GPU and 72-way fabric, TP=8 to TP=16 fits comfortably inside one NVL72; pipeline parallelism across 8–9 stages also stays in NVLink. No InfiniBand round trips at every layer.

09

NVLink-C2C — Coherent CPU-GPU Memory

Hopper introduced NVLink-C2C, a variant of NVLink optimised for chip-to-chip coherence. Unlike GPU-to-GPU NVLink (which is a packet-switched DMA fabric), C2C carries a cache-coherent protocol — in spirit and bandwidth, comparable to Intel UPI or CXL.cache, but at 900 GB/s.

Grace-Hopper (GH200)

Grace CPU: 72 Arm Neoverse V2 cores with up to 480 GB LPDDR5x on-package at 384 GB/s.
Hopper GPU: 96 GB HBM3 at ~3 TB/s.
NVLink-C2C: 900 GB/s coherent.
From CUDA's perspective, the LPDDR5x looks like a slow tier of GPU memory — addressable, allocatable, page-migratable.

Grace-Blackwell (GB200)

Same C2C link, two B200s per Grace in the GB200 superchip.
Per Grace CPU: up to 480 GB LPDDR5x — gives the pair of B200s effectively 480 GB of cheap "swap" tier on top of 384 GB HBM3e.
This is what makes the NVL72 hold > 30 TB of total addressable memory across the rack — the LPDDR5x dominates the absolute capacity.

DGX Spark

The DGX Spark uses NVLink-C2C between an Arm Grace-style CPU and a Blackwell GPU on a single substrate, delivering 128 GB unified memory at LPDDR5x speed (~273 GB/s). The C2C link itself is faster than the LPDDR can sustain — the link is no longer the bottleneck for CPU-side memory.

CPU code: cudaMallocManaged(&p, ...)

↓

Unified Memory page table spans Grace LPDDR5x + Hopper HBM3

↓

GPU kernel touches p[i] → coherent fetch over C2C

↓

Hot pages migrate to HBM; cold pages stay in LPDDR5x

Why coherence is a big deal

With pre-Grace systems you copied with cudaMemcpy and the page wasn't visible to the CPU until you copied back. NVLink-C2C means the CPU and GPU read the same cache line directly. Pointers cross the boundary. KV-cache offload to LPDDR5x becomes a transparent vLLM feature instead of an explicit pinned-memory dance.

10

vs PCIe and InfiniBand

NVLink, PCIe, and InfiniBand do related things at very different scales. The only honest way to compare them is at the per-GPU effective bandwidth level, because that is what bottlenecks an all-reduce.

Link	Per-port BW	Per-GPU BW	Reach	Coherent?
PCIe 4 x16	32 GB/s	32 GB/s	chassis	no
PCIe 5 x16	64 GB/s	64 GB/s	chassis	no (CXL.cache adds it)
NVLink bridge (consumer)	56 GB/s/link	112 GB/s (3090)	2 cards	no
NVLink 3 (A100)	50 GB/s/link	600 GB/s	chassis (HGX)	no
NVLink 4 (H100)	50 GB/s/link	900 GB/s	chassis (HGX)	no (C2C is coherent)
NVLink 5 (B200)	100 GB/s/link	1800 GB/s	rack (NVL72)	no (C2C is coherent)
NVLink-C2C	900 GB/s	900 GB/s	same package	yes
InfiniBand HDR (200 Gb)	~25 GB/s	~25 GB/s/port	building	no
InfiniBand NDR (400 Gb)	~50 GB/s	~50 GB/s/port	building	no
InfiniBand XDR (800 Gb)	~100 GB/s	~100 GB/s/port	building	no

GPUDirect RDMA

GPUDirect RDMA lets an InfiniBand NIC DMA straight into GPU HBM, bypassing the host CPU and PCIe bounce buffers. This is what makes inter-node TP tolerable. But the per-port InfiniBand bandwidth (50 GB/s NDR, 100 GB/s XDR) is still ~20–40× less than NVLink 5. Cross-node TP slows down by exactly that ratio — which is why TP is kept inside an NVLink domain and PP/DP is what spans nodes.

The modern superpod stack

intra-rack

NVLink 5 NVSwitch 4 1.8 TB/s/GPU

inter-rack

InfiniBand NDR/XDR RoCE v2 ~50–100 GB/s/port

storage

NVMe-oF 100–400 GbE parallel FS (Lustre/GPFS)

Rule of thumb

Run TP inside an NVLink domain (one HGX, or one NVL72). Run PP and DP across the InfiniBand fabric between racks. Plan around the slowest link in your collective chain — if any leg traverses IB, pick a parallelism strategy that keeps activation traffic off it.

11

Interactive: Fabric Planner

Pick a GPU class, a node size, and the inter-GPU link. The planner estimates per-GPU NVLink bandwidth, the size of the coherent NVLink domain, the rough TP scaling factor, and the maximum coherent memory you can address.

GPU class

GPUs per node

Inter-GPU link

Per-GPU NVLink BW

—

NVLink domain

—

TP scaling factor

—

Coherent memory

—

Heuristics, not benchmarks

The TP scaling factor uses a simple log-N attenuation (each ring hop costs 1 − f) and per-link multipliers of 0.95 NVSwitch / 0.85 bridge / 0.70 PCIe 5 / 0.55 PCIe 4. Reality varies with model size, NCCL version, kernel overlap and PCIe topology — treat the percentage as a sanity check, not a commitment.