Google TPUs Series — Presentation 06

TPU v4 — OCS, SparseCore & Palomar

7 nm. 4096 chips per pod. 3D torus reconfigured by optical circuit switches. An on-die accelerator just for embedding lookups. The chip that trained PaLM, the first true ML supercomputer.

7 nm3D torus Palomar OCSSparseCore CMEMv4i (inference) PaLM 540B
7 nm shrink +CMEM +SparseCore +6 ICI links 3D torus Palomar OCS 4096-chip pod
00

Topics We'll Cover

01

Why v3 Hit a Wall

By 2019 v3 was running every interesting workload at Google at scale — and the team had a list of complaints from the inside. The v4 design (deployed 2020, announced May 2021, paper June 2023) is a point-by-point response.

v3 pain points

  • 2D torus diameter. A 32×32 torus has diameter 32 hops. Latency for cross-pod all-reduce on a 1024-chip pod was already painful.
  • Pod is rigid. 1024 chips, take it or leave it. No sub-pod slices. No way to allocate "5,000 chips" jobs.
  • One bad chip kills the pod. No mechanism to route around a faulty board.
  • Embedding lookups are slow. On the vector unit, scatter-gather over 100M-row embedding tables is bandwidth-bound and latency-bound.
  • HBM-only memory hierarchy. Activations spill direct to HBM; KV-cache-like patterns waste bandwidth.
  • Still on 16 nm. By 2020 a 7 nm shrink is two nodes overdue.

v4 architectural answer

  • 3D torus + OCS → reconfigurable, sliceable, lower diameter, fault-tolerant.
  • SparseCore → dedicated embedding accelerator on-die.
  • CMEM → large on-die SRAM cache between MXU and HBM.
  • 7 nm → 2× more transistors per mm².
  • v4i as a sibling chip → inference-only, drops the most expensive pieces.

The CACM 2023 paper opens with the statement that v4 is "approximately 2.1× the per-chip performance of v3 and 2.7× better perf/W". The pod-level numbers are much larger because the 4096-chip pod is 4× the 1024-chip v3 pod.

02

v4 Per-Chip Spec

ProcessTSMC 7 nm
Die size<400 mm²
TensorCores per chip2
MXUs per TensorCore4 (8 per chip total)
MXU dim128×128 bf16
Numeric formatsbf16 (multiply) + FP32 (accumulate); INT8 supported
Per-chip bf16 peak275 TFLOPS
Clock~1050 MHz
HBM32 GiB HBM2 at 1.2 TB/s
CMEM~128 MiB on-die (shared between cores)
SparseCoreOne per chip; embedding-lookup accelerator
ICI links per chip6 (enables 3D torus)
Per-link ICI BW~50 GB/s
TDP~170 W
CoolingLiquid (cold-plate)
Pod size4,096 chips (3D torus, 4×4×4 of 8×8×8 cubes)
Pod aggregate~1.1 ExaFLOPS bf16

Note the per-chip compute uplift vs v3: 275 / 123 = 2.24×. That comes from a 2× MXU count (4 vs 2 per core), a clock bump (1.05 GHz vs ~940 MHz), and pipeline-utilisation improvements. Combined with the 4× pod scale-up, the pod is roughly 9× v3.

03

CMEM — The On-Die Cache

v4 (and especially v4i) introduce CMEM: a large on-die SRAM, shared between the two TensorCores, sitting between the per-core VMEM and HBM. The first TPU memory level that's a cache rather than a software-managed scratchpad — though XLA still manages residency explicitly.

The hierarchy on v4

  • Registers / accumulators — per-PE.
  • VMEM — ~32 MiB per TensorCore. Software-managed.
  • CMEM — ~128 MiB shared across the chip. XLA-managed residency.
  • HBM — 32 GiB at 1.2 TB/s.
  • Remote HBM — via ICI from a neighbouring chip (~50 GB/s per link).

What CMEM actually buys

  • KV cache hit rates. Long-context inference reuses the same KVs many times; CMEM keeps them on-die.
  • Activation re-use. For backprop, activations stored in CMEM avoid the HBM round-trip.
  • Fused attention. Q, K, V tiles staged in CMEM let the MXU run flash-attention-style fused kernels without spilling.
  • Effective HBM bandwidth boost. Each CMEM hit saves an HBM read; with good locality, effective bandwidth is ~2× the raw 1.2 TB/s.
Why this is "v4i's contribution"

The v4i ISCA 2021 paper introduces CMEM ahead of the v4 paper. It's listed as one of "Ten Lessons" from v1–v3. The pattern is interesting: ML chips have moved from scratchpad-only (v1–v3) toward small explicit caches (v4 onward). The TPU is approaching the GPU memory model from below, while GPUs (with HBM-on-package and more programmable cache management) approach from above.

04

SparseCore — Embeddings in Hardware

The other v4 first. Recommendation models (DLRM, ranking models, ad CTR predictors) are not matmul-bound — they spend most of their time looking up entries in giant embedding tables (billions of rows, thousands of features, irregular access patterns) and then summing the results.

What the workload looks like

  1. For each input feature i, look up k rows of an embedding table Ei (each row is a bf16 vector of size 32–512).
  2. Apply a pooling reduction (sum, mean, max) over the k rows.
  3. Concatenate the pooled vectors across features.
  4. Feed the result into a dense MLP (which is matmul, fine for the MXU).

Steps 1–3 are scatter-gather + reduction, with random access patterns and small sub-vector reductions. The MXU is useless for them; the vector unit runs them but is bandwidth-bound. The CACM v4 paper says these workloads spent ~50% of training time in embedding ops on v3.

SparseCore's hardware

What's on the die

  • Dataflow accelerator with hardware gather, hardware scatter-reduce, and atomic accumulate.
  • Direct path to HBM separate from the MXU's path.
  • Optimised address-translation hardware for the giant table indices (often 64-bit IDs into 100M-row tables).
  • ~5% of die area, ~5% of die power.

What it delivers

  • 5–7× speedup on DLRM-class models vs running embeddings on the vector unit (Google's claim).
  • Frees the MXU and vector unit for the dense MLP, so total throughput improves further.
  • Matters for ad ranking, YouTube recommendations, and Search ranking — all fleet-dominant workloads.
Why this matters in 2026

Modern frontier MoE models (e.g. Mixture-of-Experts at scale) have a similar shape: each token is routed to a small number of experts, requiring scatter-gather over a giant set of expert weights. SparseCore was originally for recommendations but is increasingly load-bearing for MoE inference. Trillium ships with a 3rd-generation SparseCore; Ironwood inherits a refined version. It's permanent.

05

Six ICI Links — The 3D Torus

v2/v3 chips had 4 ICI links (2D torus). v4 has 6, enabling a 3D torus where each chip has neighbours in ±X, ±Y, ±Z.

Why 3D over 2D?

  • Diameter scales as N1/3 not N1/2. For 4096 chips, 3D torus diameter is ~24 hops; a 2D torus would be ~64.
  • Bisection bandwidth. Higher in 3D; all-reduce times improve roughly proportionally.
  • Tensor-parallel friendly. A 3D shape lets you natively allocate 3D sharding for tensor / data / pipeline parallelism without packing tricks.

Why not 4D / 5D / hypercube?

  • Cabling cost. Each extra dimension means more ICI links per chip and more cables per board / rack.
  • Diminishing returns. Going from 2D to 3D buys you a lot; 3D to 4D buys much less.
  • Datacenter physics. A datacenter is a 2D plane of racks, with vertical (rack) being short. 3D maps onto it cleanly; 4D doesn't.

Per-link bandwidth is ~50 GB/s in each direction; aggregate ICI bandwidth per chip is ~300 GB/s. The 3D-torus's 6 links per chip is the same number as in v5p and Ironwood; the 3D-torus is the p-class topology from v4 onwards.

06

Palomar — The Optical Circuit Switch

The single most distinctive component of v4 is not on the TPU itself — it's in the rack between TPU sub-pods. Palomar is Google's bespoke 3D-MEMS optical circuit switch.

Each Palomar OCS is a 136-port (128 + 8 spares) reconfigurable optical switch using 3D-MEMS mirrors, with sub-millisecond reconfiguration. Forty-eight Palomar OCSes interconnect the 64 cubes of a 4,096-chip TPU v4 supercomputer. OCSes are <5% of system cost and <3% of system power vs an InfiniBand-based equivalent. — Jouppi et al., "TPU v4: An Optically Reconfigurable Supercomputer", ISCA 2023

How an OCS works at the bit level

1. Stay optical

Light from each ICI cable enters the switch as a beam. It is never converted back to electrical inside the switch.

2. MEMS mirror

An array of microelectromechanical mirrors steers each beam to any output port. The mirror is the size of a salt grain and tilts in microseconds.

3. Reconfigure

To repoint a beam, the controller tilts the mirror to a new angle; the data path now routes a different cube to a different cube. Reconfiguration in under a millisecond.

What this enables that an electrical switch cannot

Where else Google uses OCS

Google's "Mission Apollo" paper (arXiv:2208.10041, 2022) describes OCS rolled out across the rest of the Jupiter datacenter network. So the Palomar work is a single piece of an organisation-wide reorientation toward optical circuit switching as the right scale-up fabric, not just a TPU thing. v4 is the chip that consumes the technology most aggressively.

07

A 4096-Chip Pod, Logically

v4 pod hierarchy — 4 cubes shown, pod has 64 Cube 0 4×4×4 = 64 chips ~17.6 PFLOPS bf16 2 TiB HBM internal 3D torus links Cube 1 64 chips Cube 2 64 chips … Cube 63 64 chips 48 × Palomar OCS — 136-port 3D-MEMS optical switches reconfigurable in <1 ms · routes any cube ↔ any cube · twisted torus topology pod aggregate: ~1.1 ExaFLOPS bf16 · 128 TiB HBM · sub-pod slices reconfigured per job

Allocation policy

08

v4i — The Inference Sibling

v4i (Jouppi et al., ISCA 2021, "Ten Lessons From Three Generations Shaped Google's TPUv4i") is the inference-tuned variant of v4. Same 7 nm process, similar uncore, but designed for the cost envelope of fleet inference.

v4 (training)v4i (inference)
Process7 nm7 nm
TensorCores21
Per-chip bf16275 TFLOPS138 TFLOPS
HBM32 GiB at 1.2 TB/s8 GiB HBM2 at ~614 GB/s
CMEM~128 MiB~128 MiB (introduced here first)
SparseCoreYesYes
ICI links60 (single chip)
Pod4096 chips, 3D torus + OCSSingle chip / PCIe
TDP~170 W~175 W
CoolingLiquidAir

Design choices

v4i is the chip that powered the bulk of Google's user-facing inference (Search, YouTube, Ads, Assistant) from 2020 onward. By 2023, with Bard's launch, v4i was also the first inference-side LLM-serving TPU at user scale. It is the direct ancestor of v5e and Trillium — the e-class lineage.

09

PaLM — 6,144 Chips, 50 Days

The headline workload that v4 trained: PaLM (Pathways Language Model), 540B parameters, April 2022. PaLM is the first publicly-disclosed example of a frontier-scale LLM trained on TPUs, and the first job that used Google's Pathways single-controller orchestration system to span multiple TPU pods.

Parameters540 billion (dense, decoder-only Transformer)
Hardware2 v4 pods, 3072 + 1024 = 4096 chips for the bulk; some configurations used 6144 chips
Training time~50 days wall-clock
Tokens trained780 billion
Hardware utilisation~46% Model FLOP Utilisation (MFU) — among the highest ever reported for an LLM-class run
OrchestrationPathways single-controller dataflow (Barham, Dean, Ghemawat et al., MLSys 2022)

The Pathways paper is the systems counterpart to the v4 hardware paper. It describes how a single Python program on a single client dispatches work to thousands of TPUs across multiple pods, with asynchronous compilation and dataflow graphs that span the cluster.

A milestone hidden in the citation

Hennessy and Patterson's CACM 2019 Turing lecture predicted that "domain-specific architectures" would deliver 50×+ improvements over general-purpose CPUs; PaLM is the existence proof at frontier-LLM scale. 540B parameters trained for 6,144-chip-equivalent × 50 days is on the order of 1023 bf16 FLOPs; that workload on a CPU cluster would be impossible.

10

Resiliency at Scale

USENIX NSDI 2024 has a Google paper, "Resiliency at Scale: Managing Google's TPUv4 ML Supercomputer", that's worth reading on its own as a systems-engineering document.

The problem

The v4 system's answer

The result is "five-nines" effective uptime on multi-week training jobs at 4096-chip scale — a number that has no real precedent in scientific computing, where chip-scale supercomputers like Summit / Frontier checkpoint hourly.

11

v4 Numbers In One Page

Per chip

  • 275 TFLOPS bf16
  • 32 GiB HBM2
  • 1.2 TB/s HBM BW
  • ~128 MiB CMEM
  • 1 SparseCore
  • 6 ICI links @ 50 GB/s ea
  • ~170 W

Per pod

  • 4096 chips
  • ~1.1 ExaFLOPS bf16
  • 128 TiB aggregate HBM
  • 3D torus + 48 Palomar OCS
  • Reconfigurable slices: 64 to 4096 chips
  • Twisted-torus topology option

Pod environmental

  • Liquid cooled
  • Pod power ~700 kW
  • Pod footprint ~12–16 racks
  • Hot-spare cubes for fault recovery
  • OCS <5% of system cost, <3% of system power

The v4i companion

12

Cheat Sheet

Read next

Deck 07 covers the v5 split into e-class and p-class chips. Deck 10 — ICI & OCS takes Palomar apart in much more detail.