Google TPUs 06 — TPU v4: OCS, SparseCore & Palomar

00

Topics We'll Cover

Why v3 Hit a Wall
v4 Per-Chip Spec
CMEM — The On-Die Cache
SparseCore — Embeddings in Hardware
Six ICI Links — The 3D Torus
Palomar — The Optical Circuit Switch
A 4096-Chip Pod, Logically
v4i — The Inference Sibling
PaLM — 6,144 Chips, 50 Days
Resiliency at Scale
v4 Numbers In One Page
Cheat Sheet

01

Why v3 Hit a Wall

By 2019 v3 was running every interesting workload at Google at scale — and the team had a list of complaints from the inside. The v4 design (deployed 2020, announced May 2021, paper June 2023) is a point-by-point response.

v3 pain points

2D torus diameter. A 32×32 torus has diameter 32 hops. Latency for cross-pod all-reduce on a 1024-chip pod was already painful.
Pod is rigid. 1024 chips, take it or leave it. No sub-pod slices. No way to allocate "5,000 chips" jobs.
One bad chip kills the pod. No mechanism to route around a faulty board.
Embedding lookups are slow. On the vector unit, scatter-gather over 100M-row embedding tables is bandwidth-bound and latency-bound.
HBM-only memory hierarchy. Activations spill direct to HBM; KV-cache-like patterns waste bandwidth.
Still on 16 nm. By 2020 a 7 nm shrink is two nodes overdue.

v4 architectural answer

3D torus + OCS → reconfigurable, sliceable, lower diameter, fault-tolerant.
SparseCore → dedicated embedding accelerator on-die.
CMEM → large on-die SRAM cache between MXU and HBM.
7 nm → 2× more transistors per mm².
v4i as a sibling chip → inference-only, drops the most expensive pieces.

The CACM 2023 paper opens with the statement that v4 is "approximately 2.1× the per-chip performance of v3 and 2.7× better perf/W". The pod-level numbers are much larger because the 4096-chip pod is 4× the 1024-chip v3 pod.

02

v4 Per-Chip Spec

Process	TSMC 7 nm
Die size	<400 mm²
TensorCores per chip	2
MXUs per TensorCore	4 (8 per chip total)
MXU dim	128×128 bf16
Numeric formats	bf16 (multiply) + FP32 (accumulate); INT8 supported
Per-chip bf16 peak	275 TFLOPS
Clock	~1050 MHz
HBM	32 GiB HBM2 at 1.2 TB/s
CMEM	~128 MiB on-die (shared between cores)
SparseCore	One per chip; embedding-lookup accelerator
ICI links per chip	6 (enables 3D torus)
Per-link ICI BW	~50 GB/s
TDP	~170 W
Cooling	Liquid (cold-plate)
Pod size	4,096 chips (3D torus, 4×4×4 of 8×8×8 cubes)
Pod aggregate	~1.1 ExaFLOPS bf16

Note the per-chip compute uplift vs v3: 275 / 123 = 2.24×. That comes from a 2× MXU count (4 vs 2 per core), a clock bump (1.05 GHz vs ~940 MHz), and pipeline-utilisation improvements. Combined with the 4× pod scale-up, the pod is roughly 9× v3.

03

CMEM — The On-Die Cache

v4 (and especially v4i) introduce CMEM: a large on-die SRAM, shared between the two TensorCores, sitting between the per-core VMEM and HBM. The first TPU memory level that's a cache rather than a software-managed scratchpad — though XLA still manages residency explicitly.

The hierarchy on v4

Registers / accumulators — per-PE.
VMEM — ~32 MiB per TensorCore. Software-managed.
CMEM — ~128 MiB shared across the chip. XLA-managed residency.
HBM — 32 GiB at 1.2 TB/s.
Remote HBM — via ICI from a neighbouring chip (~50 GB/s per link).

What CMEM actually buys

KV cache hit rates. Long-context inference reuses the same KVs many times; CMEM keeps them on-die.
Activation re-use. For backprop, activations stored in CMEM avoid the HBM round-trip.
Fused attention. Q, K, V tiles staged in CMEM let the MXU run flash-attention-style fused kernels without spilling.
Effective HBM bandwidth boost. Each CMEM hit saves an HBM read; with good locality, effective bandwidth is ~2× the raw 1.2 TB/s.

Why this is "v4i's contribution"

The v4i ISCA 2021 paper introduces CMEM ahead of the v4 paper. It's listed as one of "Ten Lessons" from v1–v3. The pattern is interesting: ML chips have moved from scratchpad-only (v1–v3) toward small explicit caches (v4 onward). The TPU is approaching the GPU memory model from below, while GPUs (with HBM-on-package and more programmable cache management) approach from above.

04

SparseCore — Embeddings in Hardware

The other v4 first. Recommendation models (DLRM, ranking models, ad CTR predictors) are not matmul-bound — they spend most of their time looking up entries in giant embedding tables (billions of rows, thousands of features, irregular access patterns) and then summing the results.

What the workload looks like

For each input feature i, look up k rows of an embedding table E_i (each row is a bf16 vector of size 32–512).
Apply a pooling reduction (sum, mean, max) over the k rows.
Concatenate the pooled vectors across features.
Feed the result into a dense MLP (which is matmul, fine for the MXU).

Steps 1–3 are scatter-gather + reduction, with random access patterns and small sub-vector reductions. The MXU is useless for them; the vector unit runs them but is bandwidth-bound. The CACM v4 paper says these workloads spent ~50% of training time in embedding ops on v3.

SparseCore's hardware

What's on the die

Dataflow accelerator with hardware gather, hardware scatter-reduce, and atomic accumulate.
Direct path to HBM separate from the MXU's path.
Optimised address-translation hardware for the giant table indices (often 64-bit IDs into 100M-row tables).
~5% of die area, ~5% of die power.

What it delivers

5–7× speedup on DLRM-class models vs running embeddings on the vector unit (Google's claim).
Frees the MXU and vector unit for the dense MLP, so total throughput improves further.
Matters for ad ranking, YouTube recommendations, and Search ranking — all fleet-dominant workloads.

Why this matters in 2026

Modern frontier MoE models (e.g. Mixture-of-Experts at scale) have a similar shape: each token is routed to a small number of experts, requiring scatter-gather over a giant set of expert weights. SparseCore was originally for recommendations but is increasingly load-bearing for MoE inference. Trillium ships with a 3rd-generation SparseCore; Ironwood inherits a refined version. It's permanent.

05

Six ICI Links — The 3D Torus

v2/v3 chips had 4 ICI links (2D torus). v4 has 6, enabling a 3D torus where each chip has neighbours in ±X, ±Y, ±Z.

Why 3D over 2D?

Diameter scales as N^1/3 not N^1/2. For 4096 chips, 3D torus diameter is ~24 hops; a 2D torus would be ~64.
Bisection bandwidth. Higher in 3D; all-reduce times improve roughly proportionally.
Tensor-parallel friendly. A 3D shape lets you natively allocate 3D sharding for tensor / data / pipeline parallelism without packing tricks.

Why not 4D / 5D / hypercube?

Cabling cost. Each extra dimension means more ICI links per chip and more cables per board / rack.
Diminishing returns. Going from 2D to 3D buys you a lot; 3D to 4D buys much less.
Datacenter physics. A datacenter is a 2D plane of racks, with vertical (rack) being short. 3D maps onto it cleanly; 4D doesn't.

Per-link bandwidth is ~50 GB/s in each direction; aggregate ICI bandwidth per chip is ~300 GB/s. The 3D-torus's 6 links per chip is the same number as in v5p and Ironwood; the 3D-torus is the p-class topology from v4 onwards.

06

Palomar — The Optical Circuit Switch

The single most distinctive component of v4 is not on the TPU itself — it's in the rack between TPU sub-pods. Palomar is Google's bespoke 3D-MEMS optical circuit switch.

Each Palomar OCS is a 136-port (128 + 8 spares) reconfigurable optical switch using 3D-MEMS mirrors, with sub-millisecond reconfiguration. Forty-eight Palomar OCSes interconnect the 64 cubes of a 4,096-chip TPU v4 supercomputer. OCSes are <5% of system cost and <3% of system power vs an InfiniBand-based equivalent. — Jouppi et al., "TPU v4: An Optically Reconfigurable Supercomputer", ISCA 2023

How an OCS works at the bit level

1. Stay optical

Light from each ICI cable enters the switch as a beam. It is never converted back to electrical inside the switch.

2. MEMS mirror

An array of microelectromechanical mirrors steers each beam to any output port. The mirror is the size of a salt grain and tilts in microseconds.

3. Reconfigure

To repoint a beam, the controller tilts the mirror to a new angle; the data path now routes a different cube to a different cube. Reconfiguration in under a millisecond.

What this enables that an electrical switch cannot

Topology per job. A 2,048-chip slice can be configured as 4×4×128 or 8×16×16 by pointing different mirrors. Job 1 finishes → reconfigure in milliseconds → job 2 gets a different shape.
Healing around faults. If a TPU board fails its host cube, the OCS routes around it; the pod loses 64 chips, not 1 chip's worth of cube.
Twisted torus. v4 uses a "twisted-3D torus" topology that is impossible with fixed cabling but trivially produced by an OCS — cuts diameter further with no extra hardware.
No serialiser-deserialiser cost. Optical-electrical-optical (OEO) is the single biggest energy cost in datacenter networking. OCS is photonic end-to-end — minimal power, minimal heat.

Where else Google uses OCS

Google's "Mission Apollo" paper (arXiv:2208.10041, 2022) describes OCS rolled out across the rest of the Jupiter datacenter network. So the Palomar work is a single piece of an organisation-wide reorientation toward optical circuit switching as the right scale-up fabric, not just a TPU thing. v4 is the chip that consumes the technology most aggressively.

07

A 4096-Chip Pod, Logically

Allocation policy

Smallest job: a single cube (64 chips, 4×4×4 internal torus, no OCS).
Common slices: 4×4×8 = 128, 8×8×8 = 512, 8×8×16 = 1024, all the way up to the full pod.
The OCS is configured at job start; the chosen cubes appear to software as a single 3D torus, with chip coordinates the user picks.
Pod-spanning topology shapes can be twisted (mirror-symmetric) for lower diameter, or non-twisted for simpler debugging.

08

v4i — The Inference Sibling

v4i (Jouppi et al., ISCA 2021, "Ten Lessons From Three Generations Shaped Google's TPUv4i") is the inference-tuned variant of v4. Same 7 nm process, similar uncore, but designed for the cost envelope of fleet inference.

	v4 (training)	v4i (inference)
Process	7 nm	7 nm
TensorCores	2	1
Per-chip bf16	275 TFLOPS	138 TFLOPS
HBM	32 GiB at 1.2 TB/s	8 GiB HBM2 at ~614 GB/s
CMEM	~128 MiB	~128 MiB (introduced here first)
SparseCore	Yes	Yes
ICI links	6	0 (single chip)
Pod	4096 chips, 3D torus + OCS	Single chip / PCIe
TDP	~170 W	~175 W
Cooling	Liquid	Air

Design choices

One TensorCore is enough. Inference batches are smaller; one core fully utilised is better than two cores at half utilisation.
HBM2 single stack saves cost. 8 GiB is enough for most fleet inference models in 2020.
No ICI — a v4i is a single-chip card. Multi-chip inference uses host-mediated coordination, not the pod fabric.
Air-cooled — same form factor as v1; drops into existing servers.

v4i is the chip that powered the bulk of Google's user-facing inference (Search, YouTube, Ads, Assistant) from 2020 onward. By 2023, with Bard's launch, v4i was also the first inference-side LLM-serving TPU at user scale. It is the direct ancestor of v5e and Trillium — the e-class lineage.

09

PaLM — 6,144 Chips, 50 Days

The headline workload that v4 trained: PaLM (Pathways Language Model), 540B parameters, April 2022. PaLM is the first publicly-disclosed example of a frontier-scale LLM trained on TPUs, and the first job that used Google's Pathways single-controller orchestration system to span multiple TPU pods.

Parameters	540 billion (dense, decoder-only Transformer)
Hardware	2 v4 pods, 3072 + 1024 = 4096 chips for the bulk; some configurations used 6144 chips
Training time	~50 days wall-clock
Tokens trained	780 billion
Hardware utilisation	~46% Model FLOP Utilisation (MFU) — among the highest ever reported for an LLM-class run
Orchestration	Pathways single-controller dataflow (Barham, Dean, Ghemawat et al., MLSys 2022)

The Pathways paper is the systems counterpart to the v4 hardware paper. It describes how a single Python program on a single client dispatches work to thousands of TPUs across multiple pods, with asynchronous compilation and dataflow graphs that span the cluster.

A milestone hidden in the citation

Hennessy and Patterson's CACM 2019 Turing lecture predicted that "domain-specific architectures" would deliver 50×+ improvements over general-purpose CPUs; PaLM is the existence proof at frontier-LLM scale. 540B parameters trained for 6,144-chip-equivalent × 50 days is on the order of 10²³ bf16 FLOPs; that workload on a CPU cluster would be impossible.

10

Resiliency at Scale

USENIX NSDI 2024 has a Google paper, "Resiliency at Scale: Managing Google's TPUv4 ML Supercomputer", that's worth reading on its own as a systems-engineering document.

The problem

4096-chip pods. Mean Time Between Failures of any individual chip is high — but with thousands of them, some chip fails every few days.
An LLM training job runs for weeks. A 24-hour failure rate is 1–5% — expect O(10) chip failures per training run.
Naive checkpoint-and-restart loses hours of progress per failure. Unacceptable at this scale.

The v4 system's answer

OCS rerouting. When a chip dies, the OCS swaps in a hot-spare cube within milliseconds. The training job pauses for one heartbeat and resumes.
Frequent async checkpointing. Pathways's checkpointing layer can write state to disaggregated storage every few minutes without stalling the training step.
In-flight gradient skipping. If a step's all-reduce is corrupted by a transient ICI error, the affected gradient is dropped and the step retried, rather than aborting the job.
Hardware ECC + parity on every memory level (CMEM, VMEM, HBM, ICI), with corrected-error counters reported up to the orchestration layer.

The result is "five-nines" effective uptime on multi-week training jobs at 4096-chip scale — a number that has no real precedent in scientific computing, where chip-scale supercomputers like Summit / Frontier checkpoint hourly.

11

v4 Numbers In One Page

Per chip

275 TFLOPS bf16
32 GiB HBM2
1.2 TB/s HBM BW
~128 MiB CMEM
1 SparseCore
6 ICI links @ 50 GB/s ea
~170 W

Per pod

4096 chips
~1.1 ExaFLOPS bf16
128 TiB aggregate HBM
3D torus + 48 Palomar OCS
Reconfigurable slices: 64 to 4096 chips
Twisted-torus topology option

Pod environmental

Liquid cooled
Pod power ~700 kW
Pod footprint ~12–16 racks
Hot-spare cubes for fault recovery
OCS <5% of system cost, <3% of system power

The v4i companion

Same 7 nm process; one TensorCore; 138 TFLOPS bf16; 8 GiB HBM2; 175 W; air-cooled; PCIe.
Drops into existing inference servers; single-chip; no ICI.
Powers Google's bulk inference fleet from 2020 onward, including the first generation of Bard.

12

Cheat Sheet

v4 (deployed 2020, announced 2021, paper ISCA 2023): 7 nm, 2 TensorCores with 4 MXUs each, 275 TFLOPS bf16, 32 GiB HBM, ~170 W, liquid-cooled.
CMEM — ~128 MiB on-die SRAM cache between VMEM and HBM. First TPU memory level with cache-like behaviour.
SparseCore — embedding-lookup accelerator. 5–7× speedup on DLRM-class workloads. Now in every TPU since.
6 ICI links per chip — enables 3D torus.
Palomar OCS — 136-port 3D-MEMS optical circuit switch. 48 OCSes per pod. <5% of system cost, <3% of power. Reconfigurable in <1 ms.
4096-chip pod: 64 cubes of 64 chips each, connected via OCS. Sub-pod slices reconfigurable per job. Twisted-torus topology.
v4i (ISCA 2021): inference variant. 1 TensorCore, 138 TFLOPS, 8 GiB HBM2, air-cooled, PCIe. CMEM debuted here.
PaLM 540B: trained on 6,144-chip v4 over 50 days using Pathways orchestration. The first frontier-scale LLM training at TPU scale.
Resilience at 4096-chip scale: OCS hot-spare rerouting, frequent async checkpointing, ECC at every memory level.