Tensor parallelism only works because GPUs can talk faster to each other than to host memory. Trace NVLink from Pascal's first 20 GB/s links through the GB200 NVL72's rack-scale 1.8 TB/s domains, plus how NVSwitch turned point-to-point into a fabric.
Why scale-up exists, how NVIDIA built it generation by generation, and what the rack-scale NVL72 actually is — framed in numbers a tensor-parallel run actually cares about.
Tensor parallelism shards every weight matrix across N GPUs. Each transformer layer ends with an all-reduce over the activations so every shard sees the full output. Pipeline parallelism passes activations between layers; data parallelism syncs gradients. All three regimes turn into collective bandwidth.
The 2× factor in the heuristic below comes from the ring all-reduce algorithm: it does a reduce-scatter (one full pass around the ring) then an all-gather (a second pass), so each GPU sends and receives 2·(N−1)/N times the per-GPU shard size — close enough to 2× for N≥4.
traffic per layer ≈ hidden_dim × seq_len × bytes × 2 across the ring. The N-1/N factor approaches 1 quickly, so the cost is set by the activation tensor size, not by the number of GPUs.
One layer's activation tensor is 8192 × 4096 × 2 = 64 MiB per replica. The ring all-reduce moves ~2× that around: ~128 MiB per layer per step across the ring. Llama-3 70B has 80 layers, so a single forward pass during prefill exchanges ~10 GiB of activation traffic over the interconnect. Multiply by batch and you can saturate any link the host CPU can offer.
(The "~512 MB per layer across all GPUs" framing in the brief counts both directions across all N−1 ring hops at once — the per-GPU half-duplex bill is what matters for sizing the link, and that is the ~128 MiB number above for an 8-way split.)
Empirically, NVLink delivers near-linear TP scaling up to 8 GPUs on a single NVSwitch. PCIe 4 x16 already collapses at TP=2 on a 70B model: the all-reduce stalls long enough that GPUs idle on activations.
Every NVLink generation roughly doubles per-GPU bandwidth. Every NVSwitch generation extends the domain over which that bandwidth is uniform. Together they push the wall at which TP stops scaling further out, generation by generation.
The P100 introduced NVLink: NVIDIA's answer to "PCIe 3.0 is far too slow to gang GPUs". Each P100 has 4 NVLink 1.0 links, each carrying 20 GB/s/dir bidirectional — aggregate 80 GB/s/GPU. That is more than 5× PCIe 3.0 x16 (16 GB/s).
DGX-1 packs 8 P100s. With only 4 NVLink ports per GPU, a fully-connected mesh is impossible. NVIDIA chose a hybrid cube-mesh: each GPU has 4 direct neighbours over NVLink, and reaches the other 3 GPUs in two hops. Collectives have to be choreographed to avoid contended links.
4 NVLink 1.0 ports per P100, mesh in DGX-1 → 80 GB/s aggregate per GPU.
NVLink 1 was also the link used in the IBM Power8 / Power9 "OpenCAPI" systems — the only time an NVLink endpoint sat on a non-NVIDIA processor outside of the C2C era. AC922 nodes pushed coherent NVLink 1 between Power9 cores and V100 GPUs.
Software view: NCCL learned NVLink topology from nvidia-smi topo -m and chose ring orderings that hugged the direct edges. Even at 80 GB/s the wall was visible — an 8-way ring all-reduce on P100 was already a measurable fraction of step time on small models.
V100 widened NVLink: 6 NVLink 2.0 links per GPU, each 25 GB/s/dir — aggregate 150 GB/s/GPU, almost double the P100. The lane rate climbed to 25 Gb/s with the same NRZ signalling.
DGX-1V (V100 edition) used a similar hybrid mesh. With 6 ports per GPU, a fully-connected 8-way mesh is still impossible (would need 7 ports), so two-hop paths remained for some GPU pairs. NVIDIA's answer was to build a switch.
DGX-2 (2018) introduced NVSwitch 1: a 18-port crossbar at NVLink 2.0 speeds. A DGX-2 has 12 NVSwitch chips wired so that all 16 V100 GPUs see each other at full link bandwidth simultaneously. Per-GPU effective NVLink bandwidth into the switch fabric: 300 GB/s (each V100 contributes its 6 links at 25 GB/s × 2 = 300 GB/s aggregate full-duplex into the fabric).
NVSwitch turned NVLink from a set of point-to-point cables into a fabric. Any GPU pair gets one switch hop at full bandwidth, regardless of their position. NCCL no longer has to map its collective onto a fragile mesh; it can assume uniform any-to-any bandwidth.
This is also when NVLink stopped being a "two GPUs cooperating" feature and started being the basis of tensor parallelism at scale. Megatron-LM's launch on 16-GPU DGX-2s would not have been possible on a mesh.
A100 again roughly doubled it: 12 third-generation NVLinks per GPU at 50 GB/s/dir each — aggregate 600 GB/s/GPU. The link rate hit 50 Gb/s; signalling moved to NRZ at 50 GBaud, with much tighter equalisation than NVLink 2.
The HGX A100 baseboard hosts 8 A100s wired through 6 NVSwitch 2 chips. Every A100 reaches every other A100 over the switch fabric at the full 600 GB/s aggregate. Bisection bandwidth: ~4.8 TB/s.
An A100's HBM2e delivers 1.55–2.0 TB/s. NVLink 3 hits ~30% of HBM bandwidth, which is good enough that an all-reduce overlapping a forward pass barely shows in profiles. This 30%-of-HBM heuristic is what NVLink targets every generation — below it, TP starts to bite.
H100 brought 18 fourth-generation NVLinks at 50 GB/s/dir each — aggregate 900 GB/s/GPU. The per-link rate stayed at 50 GB/s; NVIDIA scaled out by adding more links rather than faster ones. Signalling moved to PAM4 at 100 Gb/s per lane (50 GBaud × 2 bits/symbol).
HGX H100 keeps the 8-GPU baseboard but with 4 NVSwitch 3 chips. Every H100 reaches every other at full 900 GB/s. Bisection bandwidth: ~7.2 TB/s on a single HGX baseboard.
The big architectural surprise was NVLink-C2C ("chip-to-chip"): a tightened, lower-overhead variant used to connect a Grace CPU to a Hopper GPU on the GH200 superchip. 900 GB/s coherent, packaged on a single substrate. Cache-coherent in a way the inter-GPU NVLink had never been.
This is the moment NVLink stops being just "GPU-to-GPU" and becomes the in-package coherent fabric of NVIDIA's superchips. The Grace CPU's LPDDR5x acts as a slow second tier of GPU memory, addressable transparently from CUDA.
Hopper's FP8 throughput on tensor cores roughly tripled compute. If NVLink had stayed at 600 GB/s, the all-reduce-to-compute ratio would have collapsed and TP=8 would have stopped scaling. NVLink 4 was sized to keep that ratio steady.
B200 doubles per-link bandwidth: 18 fifth-generation NVLinks at 100 GB/s/dir each — aggregate 1.8 TB/s/GPU. Signalling: 200 Gb/s per lane (PAM4, 100 GBaud), the same SerDes generation found in 800G Ethernet.
NVSwitch 4 keeps the radix model but adds NVLink-Sharp: in-network reductions. The switch can compute the partial sum of an all-reduce at the fabric level, halving the data each GPU has to receive. This is most of how the NVL72 hits its claimed all-reduce speedups versus an HGX H100 baseline.
For the first time, NVLink is designed to span more than one box at full bandwidth. The NVL72 (next slide) is the first product to show what that means in practice.
NVLink 1 80 GB/s, NVLink 2 150 GB/s, NVLink 3 600 GB/s, NVLink 4 900 GB/s, NVLink 5 1800 GB/s. Roughly 1.7–2× per generation. The rate of growth has actually matched compute growth across the same span — a deliberate co-design choice.
The NVSwitch is what turns N−1 NVLink links per GPU from a sparse mesh into a uniform fabric. The difference matters most when more than two GPUs are doing a collective at once.
4-GPU example: 6 direct edges. But if GPU 0 and GPU 3 are not directly wired, their traffic shares a link through GPU 1 or GPU 2. With three concurrent collectives, every link is contended.
Every GPU has one or more NVLinks into the switch. Any GPU pair sees one switch hop. Three concurrent flows do not contend as long as each endpoint's link rate isn't exceeded.
An N-way ring all-reduce moves 2(N−1)/N times the per-GPU shard. On a mesh with 2-hop pairs, that traffic shares physical links. NCCL has to pick a ring that hugs direct edges, but in the worst case some links carry double traffic. NVSwitch makes every ring choice equivalent.
This is also why NVLink-Sharp (NVSwitch 4) is so important on Blackwell — with the switch already in the path, having it perform the partial reduction itself saves one full pass through the fabric.
The GB200 NVL72 is the first product where an NVLink domain spans an entire rack. 72 B200 GPUs in 18 compute trays (4 GPUs + 2 Grace CPUs each), interleaved with 9 NVLink Switch trays, all connected by NVLink 5 in one logical domain. The whole rack behaves to NCCL like a single 72-GPU NVSwitch fabric.
72× B200 → 13.5 TB HBM3e, ~720 PFLOPS FP8 dense.
9 switch trays → 130 TB/s aggregate NVLink bandwidth, every GPU sees every other at full 1.8 TB/s.
Workloads that previously needed InfiniBand between racks now stay inside one NVLink domain.
Tensor parallelism scales until per-step all-reduce eats more time than compute. At 1.8 TB/s per GPU and 72-way fabric, TP=8 to TP=16 fits comfortably inside one NVL72; pipeline parallelism across 8–9 stages also stays in NVLink. No InfiniBand round trips at every layer.
Hopper introduced NVLink-C2C, a variant of NVLink optimised for chip-to-chip coherence. Unlike GPU-to-GPU NVLink (which is a packet-switched DMA fabric), C2C carries a cache-coherent protocol — in spirit and bandwidth, comparable to Intel UPI or CXL.cache, but at 900 GB/s.
The DGX Spark uses NVLink-C2C between an Arm Grace-style CPU and a Blackwell GPU on a single substrate, delivering 128 GB unified memory at LPDDR5x speed (~273 GB/s). The C2C link itself is faster than the LPDDR can sustain — the link is no longer the bottleneck for CPU-side memory.
cudaMallocManaged(&p, ...)p[i] → coherent fetch over C2CWith pre-Grace systems you copied with cudaMemcpy and the page wasn't visible to the CPU until you copied back. NVLink-C2C means the CPU and GPU read the same cache line directly. Pointers cross the boundary. KV-cache offload to LPDDR5x becomes a transparent vLLM feature instead of an explicit pinned-memory dance.
NVLink, PCIe, and InfiniBand do related things at very different scales. The only honest way to compare them is at the per-GPU effective bandwidth level, because that is what bottlenecks an all-reduce.
| Link | Per-port BW | Per-GPU BW | Reach | Coherent? |
|---|---|---|---|---|
| PCIe 4 x16 | 32 GB/s | 32 GB/s | chassis | no |
| PCIe 5 x16 | 64 GB/s | 64 GB/s | chassis | no (CXL.cache adds it) |
| NVLink bridge (consumer) | 56 GB/s/link | 112 GB/s (3090) | 2 cards | no |
| NVLink 3 (A100) | 50 GB/s/link | 600 GB/s | chassis (HGX) | no |
| NVLink 4 (H100) | 50 GB/s/link | 900 GB/s | chassis (HGX) | no (C2C is coherent) |
| NVLink 5 (B200) | 100 GB/s/link | 1800 GB/s | rack (NVL72) | no (C2C is coherent) |
| NVLink-C2C | 900 GB/s | 900 GB/s | same package | yes |
| InfiniBand HDR (200 Gb) | ~25 GB/s | ~25 GB/s/port | building | no |
| InfiniBand NDR (400 Gb) | ~50 GB/s | ~50 GB/s/port | building | no |
| InfiniBand XDR (800 Gb) | ~100 GB/s | ~100 GB/s/port | building | no |
GPUDirect RDMA lets an InfiniBand NIC DMA straight into GPU HBM, bypassing the host CPU and PCIe bounce buffers. This is what makes inter-node TP tolerable. But the per-port InfiniBand bandwidth (50 GB/s NDR, 100 GB/s XDR) is still ~20–40× less than NVLink 5. Cross-node TP slows down by exactly that ratio — which is why TP is kept inside an NVLink domain and PP/DP is what spans nodes.
Run TP inside an NVLink domain (one HGX, or one NVL72). Run PP and DP across the InfiniBand fabric between racks. Plan around the slowest link in your collective chain — if any leg traverses IB, pick a parallelism strategy that keeps activation traffic off it.
Pick a GPU class, a node size, and the inter-GPU link. The planner estimates per-GPU NVLink bandwidth, the size of the coherent NVLink domain, the rough TP scaling factor, and the maximum coherent memory you can address.
The TP scaling factor uses a simple log-N attenuation (each ring hop costs 1 − f) and per-link multipliers of 0.95 NVSwitch / 0.85 bridge / 0.70 PCIe 5 / 0.55 PCIe 4. Reality varies with model size, NCCL version, kernel overlap and PCIe topology — treat the percentage as a sanity check, not a commitment.