NVIDIA GPU Architectures Series — Presentation 05

NVLink & NVSwitch — How NVIDIA Builds One Big GPU Out of Many

Tensor parallelism only works because GPUs can talk faster to each other than to host memory. Trace NVLink from Pascal's first 20 GB/s links through the GB200 NVL72's rack-scale 1.8 TB/s domains, plus how NVSwitch turned point-to-point into a fabric.

NVLink 1NVLink 2NVLink 3 NVLink 4NVLink 5 NVSwitchNVL72 C2CGPUDirect InfiniBand
NVLink 1 2 3 4 5 NVSwitch C2C NVL72
00

Topics We'll Cover

Why scale-up exists, how NVIDIA built it generation by generation, and what the rack-scale NVL72 actually is — framed in numbers a tensor-parallel run actually cares about.

01

Why Scale-Up Matters — The All-Reduce Bill

Tensor parallelism shards every weight matrix across N GPUs. Each transformer layer ends with an all-reduce over the activations so every shard sees the full output. Pipeline parallelism passes activations between layers; data parallelism syncs gradients. All three regimes turn into collective bandwidth.

The 2× factor in the heuristic below comes from the ring all-reduce algorithm: it does a reduce-scatter (one full pass around the ring) then an all-gather (a second pass), so each GPU sends and receives 2·(N−1)/N times the per-GPU shard size — close enough to 2× for N≥4.

Per-step traffic, ring all-reduce

traffic per layer ≈ hidden_dim × seq_len × bytes × 2 across the ring. The N-1/N factor approaches 1 quickly, so the cost is set by the activation tensor size, not by the number of GPUs.

Llama-3 70B, hidden=8192, seq=4096, bf16

One layer's activation tensor is 8192 × 4096 × 2 = 64 MiB per replica. The ring all-reduce moves ~2× that around: ~128 MiB per layer per step across the ring. Llama-3 70B has 80 layers, so a single forward pass during prefill exchanges ~10 GiB of activation traffic over the interconnect. Multiply by batch and you can saturate any link the host CPU can offer.

(The "~512 MB per layer across all GPUs" framing in the brief counts both directions across all N−1 ring hops at once — the per-GPU half-duplex bill is what matters for sizing the link, and that is the ~128 MiB number above for an 8-way split.)

PCIe 4 x16
32 GB/s
collapses
PCIe 5 x16
64 GB/s
marginal
NVLink 1
80 GB/s
P100
NVLink 2
150 GB/s
V100
NVLink 3
600 GB/s
A100
NVLink 4
900 GB/s
H100
NVLink 5
1800 GB/s
B200

Empirically, NVLink delivers near-linear TP scaling up to 8 GPUs on a single NVSwitch. PCIe 4 x16 already collapses at TP=2 on a 70B model: the all-reduce stalls long enough that GPUs idle on activations.

The point of every slide that follows

Every NVLink generation roughly doubles per-GPU bandwidth. Every NVSwitch generation extends the domain over which that bandwidth is uniform. Together they push the wall at which TP stops scaling further out, generation by generation.

02

NVLink 1 — Pascal (2016)

The P100 introduced NVLink: NVIDIA's answer to "PCIe 3.0 is far too slow to gang GPUs". Each P100 has 4 NVLink 1.0 links, each carrying 20 GB/s/dir bidirectional — aggregate 80 GB/s/GPU. That is more than 5× PCIe 3.0 x16 (16 GB/s).

Physical layer

Topology — DGX-1 mesh

DGX-1 packs 8 P100s. With only 4 NVLink ports per GPU, a fully-connected mesh is impossible. NVIDIA chose a hybrid cube-mesh: each GPU has 4 direct neighbours over NVLink, and reaches the other 3 GPUs in two hops. Collectives have to be choreographed to avoid contended links.

P100
P100
P100
P100

4 NVLink 1.0 ports per P100, mesh in DGX-1 → 80 GB/s aggregate per GPU.

Historical aside

NVLink 1 was also the link used in the IBM Power8 / Power9 "OpenCAPI" systems — the only time an NVLink endpoint sat on a non-NVIDIA processor outside of the C2C era. AC922 nodes pushed coherent NVLink 1 between Power9 cores and V100 GPUs.

Software view: NCCL learned NVLink topology from nvidia-smi topo -m and chose ring orderings that hugged the direct edges. Even at 80 GB/s the wall was visible — an 8-way ring all-reduce on P100 was already a measurable fraction of step time on small models.

03

NVLink 2 — Volta (2017) and the First NVSwitch

V100 widened NVLink: 6 NVLink 2.0 links per GPU, each 25 GB/s/dir — aggregate 150 GB/s/GPU, almost double the P100. The lane rate climbed to 25 Gb/s with the same NRZ signalling.

DGX-1V mesh and its limits

DGX-1V (V100 edition) used a similar hybrid mesh. With 6 ports per GPU, a fully-connected 8-way mesh is still impossible (would need 7 ports), so two-hop paths remained for some GPU pairs. NVIDIA's answer was to build a switch.

NVSwitch 1 in DGX-2

DGX-2 (2018) introduced NVSwitch 1: a 18-port crossbar at NVLink 2.0 speeds. A DGX-2 has 12 NVSwitch chips wired so that all 16 V100 GPUs see each other at full link bandwidth simultaneously. Per-GPU effective NVLink bandwidth into the switch fabric: 300 GB/s (each V100 contributes its 6 links at 25 GB/s × 2 = 300 GB/s aggregate full-duplex into the fabric).

DGX-2:
16× V100 12× NVSwitch 1 2.4 TB/s bisection 300 GB/s per GPU
vs DGX-1V:
8× V100 mesh contended hops 150 GB/s per GPU
What changed conceptually

NVSwitch turned NVLink from a set of point-to-point cables into a fabric. Any GPU pair gets one switch hop at full bandwidth, regardless of their position. NCCL no longer has to map its collective onto a fragile mesh; it can assume uniform any-to-any bandwidth.

This is also when NVLink stopped being a "two GPUs cooperating" feature and started being the basis of tensor parallelism at scale. Megatron-LM's launch on 16-GPU DGX-2s would not have been possible on a mesh.

04

NVLink 3 — Ampere (2020)

A100 again roughly doubled it: 12 third-generation NVLinks per GPU at 50 GB/s/dir each — aggregate 600 GB/s/GPU. The link rate hit 50 Gb/s; signalling moved to NRZ at 50 GBaud, with much tighter equalisation than NVLink 2.

HGX A100 with NVSwitch 2

The HGX A100 baseboard hosts 8 A100s wired through 6 NVSwitch 2 chips. Every A100 reaches every other A100 over the switch fabric at the full 600 GB/s aggregate. Bisection bandwidth: ~4.8 TB/s.

Datacenter (A100 SXM4)

  • 12 NVLink 3 ports → 600 GB/s
  • HGX A100 8-GPU board with NVSwitch 2
  • Any-to-any at full bandwidth
  • Foundation of MLPerf scaling 2020–2022

Workstation/consumer (3090, A6000)

  • NVLink bridge: 4 links exposed = 112 GB/s
  • Two-card only — no fabric, no switch
  • 3090 was the last GeForce with any NVLink at all
  • 4090 / 5090: no NVLink

What NVLink 3 unlocks

600 GB/s vs HBM2e

An A100's HBM2e delivers 1.55–2.0 TB/s. NVLink 3 hits ~30% of HBM bandwidth, which is good enough that an all-reduce overlapping a forward pass barely shows in profiles. This 30%-of-HBM heuristic is what NVLink targets every generation — below it, TP starts to bite.

05

NVLink 4 — Hopper (2022) and NVLink-C2C

H100 brought 18 fourth-generation NVLinks at 50 GB/s/dir each — aggregate 900 GB/s/GPU. The per-link rate stayed at 50 GB/s; NVIDIA scaled out by adding more links rather than faster ones. Signalling moved to PAM4 at 100 Gb/s per lane (50 GBaud × 2 bits/symbol).

NVSwitch 3 in HGX H100

HGX H100 keeps the 8-GPU baseboard but with 4 NVSwitch 3 chips. Every H100 reaches every other at full 900 GB/s. Bisection bandwidth: ~7.2 TB/s on a single HGX baseboard.

NVLink-C2C — the Grace-Hopper variant

The big architectural surprise was NVLink-C2C ("chip-to-chip"): a tightened, lower-overhead variant used to connect a Grace CPU to a Hopper GPU on the GH200 superchip. 900 GB/s coherent, packaged on a single substrate. Cache-coherent in a way the inter-GPU NVLink had never been.

Grace CPU
72 Arm Neoverse
NVLink-C2C
900 GB/s coherent
Hopper H100
96 GB HBM3

This is the moment NVLink stops being just "GPU-to-GPU" and becomes the in-package coherent fabric of NVIDIA's superchips. The Grace CPU's LPDDR5x acts as a slow second tier of GPU memory, addressable transparently from CUDA.

FP8, transformer engine, and NVLink 4

Hopper's FP8 throughput on tensor cores roughly tripled compute. If NVLink had stayed at 600 GB/s, the all-reduce-to-compute ratio would have collapsed and TP=8 would have stopped scaling. NVLink 4 was sized to keep that ratio steady.

06

NVLink 5 — Blackwell (2024)

B200 doubles per-link bandwidth: 18 fifth-generation NVLinks at 100 GB/s/dir each — aggregate 1.8 TB/s/GPU. Signalling: 200 Gb/s per lane (PAM4, 100 GBaud), the same SerDes generation found in 800G Ethernet.

NVSwitch 4 + NVLink-Sharp

NVSwitch 4 keeps the radix model but adds NVLink-Sharp: in-network reductions. The switch can compute the partial sum of an all-reduce at the fabric level, halving the data each GPU has to receive. This is most of how the NVL72 hits its claimed all-reduce speedups versus an HGX H100 baseline.

The fabric leaves the chassis

For the first time, NVLink is designed to span more than one box at full bandwidth. The NVL72 (next slide) is the first product to show what that means in practice.

B200 SXM6

  • 2-die chiplet, 10 TB/s die-to-die
  • 192 GB HBM3e @ 8 TB/s
  • 18 NVLink 5 ports → 1.8 TB/s
  • FP4 / FP8 / FP16 / BF16 / TF32

NVSwitch 4

  • 14.4 TB/s per switch
  • NVLink-Sharp in-network reduction
  • 9 switches per NVL72 rack
  • Optical DACs link compute trays
Doubling, again

NVLink 1 80 GB/s, NVLink 2 150 GB/s, NVLink 3 600 GB/s, NVLink 4 900 GB/s, NVLink 5 1800 GB/s. Roughly 1.7–2× per generation. The rate of growth has actually matched compute growth across the same span — a deliberate co-design choice.

07

NVSwitch — Point-to-Point Becomes a Fabric

The NVSwitch is what turns N−1 NVLink links per GPU from a sparse mesh into a uniform fabric. The difference matters most when more than two GPUs are doing a collective at once.

Mesh (NVLink 1, no switch)

4-GPU example: 6 direct edges. But if GPU 0 and GPU 3 are not directly wired, their traffic shares a link through GPU 1 or GPU 2. With three concurrent collectives, every link is contended.

G0 G1 G2 G3 solid = direct, dashed = shared diagonal

NVSwitch fabric

Every GPU has one or more NVLinks into the switch. Any GPU pair sees one switch hop. Three concurrent flows do not contend as long as each endpoint's link rate isn't exceeded.

NVSwitch G0 G1 G2 G3 any-to-any, full bandwidth, no shared edges
Why TP>2 needs a switch

An N-way ring all-reduce moves 2(N−1)/N times the per-GPU shard. On a mesh with 2-hop pairs, that traffic shares physical links. NCCL has to pick a ring that hugs direct edges, but in the worst case some links carry double traffic. NVSwitch makes every ring choice equivalent.

This is also why NVLink-Sharp (NVSwitch 4) is so important on Blackwell — with the switch already in the path, having it perform the partial reduction itself saves one full pass through the fabric.

08

NVL72 — The Rack as a GPU

The GB200 NVL72 is the first product where an NVLink domain spans an entire rack. 72 B200 GPUs in 18 compute trays (4 GPUs + 2 Grace CPUs each), interleaved with 9 NVLink Switch trays, all connected by NVLink 5 in one logical domain. The whole rack behaves to NCCL like a single 72-GPU NVSwitch fabric.

NVL72 rack — 18 compute + 9 switch trays C1 B200 x4 | Grace x2 C2 B200 x4 | Grace x2 S1 NVSwitch 4 (NVLink-Sharp) C3 B200 x4 | Grace x2 C4 B200 x4 | Grace x2 S2 NVSwitch 4 C5 B200 x4 | Grace x2 C6 B200 x4 | Grace x2 S3 NVSwitch 4 C7 B200 x4 | Grace x2 C8 B200 x4 | Grace x2 S4 NVSwitch 4 C9 B200 x4 | Grace x2 C10 B200 x4 | Grace x2 S5 NVSwitch 4 C11 B200 x4 | Grace x2 C12 B200 x4 | Grace x2 S6 NVSwitch 4 C13 B200 x4 | Grace x2 C14 B200 x4 | Grace x2 S7 NVSwitch 4 C15 B200 x4 | Grace x2 C16 B200 x4 | Grace x2 S8 NVSwitch 4 C17 B200 x4 | Grace x2 C18 B200 x4 | Grace x2 S9 NVSwitch 4 NVLink 5 fabric 72 B200 + 36 Grace + 9 NVSwitch trays = one NVLink domain = 130 TB/s aggregate, 13.5 TB HBM3e

Compute

72× B200 → 13.5 TB HBM3e, ~720 PFLOPS FP8 dense.

NVLink fabric

9 switch trays → 130 TB/s aggregate NVLink bandwidth, every GPU sees every other at full 1.8 TB/s.

What it replaces

Workloads that previously needed InfiniBand between racks now stay inside one NVLink domain.

Why this matters for TP/PP

Tensor parallelism scales until per-step all-reduce eats more time than compute. At 1.8 TB/s per GPU and 72-way fabric, TP=8 to TP=16 fits comfortably inside one NVL72; pipeline parallelism across 8–9 stages also stays in NVLink. No InfiniBand round trips at every layer.

09

NVLink-C2C — Coherent CPU-GPU Memory

Hopper introduced NVLink-C2C, a variant of NVLink optimised for chip-to-chip coherence. Unlike GPU-to-GPU NVLink (which is a packet-switched DMA fabric), C2C carries a cache-coherent protocol — in spirit and bandwidth, comparable to Intel UPI or CXL.cache, but at 900 GB/s.

Grace-Hopper (GH200)

Grace-Blackwell (GB200)

DGX Spark

The DGX Spark uses NVLink-C2C between an Arm Grace-style CPU and a Blackwell GPU on a single substrate, delivering 128 GB unified memory at LPDDR5x speed (~273 GB/s). The C2C link itself is faster than the LPDDR can sustain — the link is no longer the bottleneck for CPU-side memory.

CPU code: cudaMallocManaged(&p, ...)
Unified Memory page table spans Grace LPDDR5x + Hopper HBM3
GPU kernel touches p[i] → coherent fetch over C2C
Hot pages migrate to HBM; cold pages stay in LPDDR5x
Why coherence is a big deal

With pre-Grace systems you copied with cudaMemcpy and the page wasn't visible to the CPU until you copied back. NVLink-C2C means the CPU and GPU read the same cache line directly. Pointers cross the boundary. KV-cache offload to LPDDR5x becomes a transparent vLLM feature instead of an explicit pinned-memory dance.

10

vs PCIe and InfiniBand

NVLink, PCIe, and InfiniBand do related things at very different scales. The only honest way to compare them is at the per-GPU effective bandwidth level, because that is what bottlenecks an all-reduce.

LinkPer-port BWPer-GPU BWReachCoherent?
PCIe 4 x1632 GB/s32 GB/schassisno
PCIe 5 x1664 GB/s64 GB/schassisno (CXL.cache adds it)
NVLink bridge (consumer)56 GB/s/link112 GB/s (3090)2 cardsno
NVLink 3 (A100)50 GB/s/link600 GB/schassis (HGX)no
NVLink 4 (H100)50 GB/s/link900 GB/schassis (HGX)no (C2C is coherent)
NVLink 5 (B200)100 GB/s/link1800 GB/srack (NVL72)no (C2C is coherent)
NVLink-C2C900 GB/s900 GB/ssame packageyes
InfiniBand HDR (200 Gb)~25 GB/s~25 GB/s/portbuildingno
InfiniBand NDR (400 Gb)~50 GB/s~50 GB/s/portbuildingno
InfiniBand XDR (800 Gb)~100 GB/s~100 GB/s/portbuildingno

GPUDirect RDMA

GPUDirect RDMA lets an InfiniBand NIC DMA straight into GPU HBM, bypassing the host CPU and PCIe bounce buffers. This is what makes inter-node TP tolerable. But the per-port InfiniBand bandwidth (50 GB/s NDR, 100 GB/s XDR) is still ~20–40× less than NVLink 5. Cross-node TP slows down by exactly that ratio — which is why TP is kept inside an NVLink domain and PP/DP is what spans nodes.

The modern superpod stack

intra-rack
NVLink 5 NVSwitch 4 1.8 TB/s/GPU
inter-rack
InfiniBand NDR/XDR RoCE v2 ~50–100 GB/s/port
storage
NVMe-oF 100–400 GbE parallel FS (Lustre/GPFS)
Rule of thumb

Run TP inside an NVLink domain (one HGX, or one NVL72). Run PP and DP across the InfiniBand fabric between racks. Plan around the slowest link in your collective chain — if any leg traverses IB, pick a parallelism strategy that keeps activation traffic off it.

11

Interactive: Fabric Planner

Pick a GPU class, a node size, and the inter-GPU link. The planner estimates per-GPU NVLink bandwidth, the size of the coherent NVLink domain, the rough TP scaling factor, and the maximum coherent memory you can address.

Per-GPU NVLink BW
NVLink domain
TP scaling factor
Coherent memory
Heuristics, not benchmarks

The TP scaling factor uses a simple log-N attenuation (each ring hop costs 1 − f) and per-link multipliers of 0.95 NVSwitch / 0.85 bridge / 0.70 PCIe 5 / 0.55 PCIe 4. Reality varies with model size, NCCL version, kernel overlap and PCIe topology — treat the percentage as a sanity check, not a commitment.