7 nm. 4096 chips per pod. 3D torus reconfigured by optical circuit switches. An on-die accelerator just for embedding lookups. The chip that trained PaLM, the first true ML supercomputer.
By 2019 v3 was running every interesting workload at Google at scale — and the team had a list of complaints from the inside. The v4 design (deployed 2020, announced May 2021, paper June 2023) is a point-by-point response.
v3 pain points
2D torus diameter. A 32×32 torus has diameter 32 hops. Latency for cross-pod all-reduce on a 1024-chip pod was already painful.
Pod is rigid. 1024 chips, take it or leave it. No sub-pod slices. No way to allocate "5,000 chips" jobs.
One bad chip kills the pod. No mechanism to route around a faulty board.
Embedding lookups are slow. On the vector unit, scatter-gather over 100M-row embedding tables is bandwidth-bound and latency-bound.
HBM-only memory hierarchy. Activations spill direct to HBM; KV-cache-like patterns waste bandwidth.
Still on 16 nm. By 2020 a 7 nm shrink is two nodes overdue.
v4 architectural answer
3D torus + OCS → reconfigurable, sliceable, lower diameter, fault-tolerant.
CMEM → large on-die SRAM cache between MXU and HBM.
7 nm → 2× more transistors per mm².
v4i as a sibling chip → inference-only, drops the most expensive pieces.
The CACM 2023 paper opens with the statement that v4 is "approximately 2.1× the per-chip performance of v3 and 2.7× better perf/W". The pod-level numbers are much larger because the 4096-chip pod is 4× the 1024-chip v3 pod.
Note the per-chip compute uplift vs v3: 275 / 123 = 2.24×. That comes from a 2× MXU count (4 vs 2 per core), a clock bump (1.05 GHz vs ~940 MHz), and pipeline-utilisation improvements. Combined with the 4× pod scale-up, the pod is roughly 9× v3.
03
CMEM — The On-Die Cache
v4 (and especially v4i) introduce CMEM: a large on-die SRAM, shared between the two TensorCores, sitting between the per-core VMEM and HBM. The first TPU memory level that's a cache rather than a software-managed scratchpad — though XLA still manages residency explicitly.
The hierarchy on v4
Registers / accumulators — per-PE.
VMEM — ~32 MiB per TensorCore. Software-managed.
CMEM — ~128 MiB shared across the chip. XLA-managed residency.
HBM — 32 GiB at 1.2 TB/s.
Remote HBM — via ICI from a neighbouring chip (~50 GB/s per link).
What CMEM actually buys
KV cache hit rates. Long-context inference reuses the same KVs many times; CMEM keeps them on-die.
Activation re-use. For backprop, activations stored in CMEM avoid the HBM round-trip.
Fused attention. Q, K, V tiles staged in CMEM let the MXU run flash-attention-style fused kernels without spilling.
Effective HBM bandwidth boost. Each CMEM hit saves an HBM read; with good locality, effective bandwidth is ~2× the raw 1.2 TB/s.
Why this is "v4i's contribution"
The v4i ISCA 2021 paper introduces CMEM ahead of the v4 paper. It's listed as one of "Ten Lessons" from v1–v3. The pattern is interesting: ML chips have moved from scratchpad-only (v1–v3) toward small explicit caches (v4 onward). The TPU is approaching the GPU memory model from below, while GPUs (with HBM-on-package and more programmable cache management) approach from above.
04
SparseCore — Embeddings in Hardware
The other v4 first. Recommendation models (DLRM, ranking models, ad CTR predictors) are not matmul-bound — they spend most of their time looking up entries in giant embedding tables (billions of rows, thousands of features, irregular access patterns) and then summing the results.
What the workload looks like
For each input feature i, look up k rows of an embedding table Ei (each row is a bf16 vector of size 32–512).
Apply a pooling reduction (sum, mean, max) over the k rows.
Concatenate the pooled vectors across features.
Feed the result into a dense MLP (which is matmul, fine for the MXU).
Steps 1–3 are scatter-gather + reduction, with random access patterns and small sub-vector reductions. The MXU is useless for them; the vector unit runs them but is bandwidth-bound. The CACM v4 paper says these workloads spent ~50% of training time in embedding ops on v3.
SparseCore's hardware
What's on the die
Dataflow accelerator with hardware gather, hardware scatter-reduce, and atomic accumulate.
Direct path to HBM separate from the MXU's path.
Optimised address-translation hardware for the giant table indices (often 64-bit IDs into 100M-row tables).
~5% of die area, ~5% of die power.
What it delivers
5–7× speedup on DLRM-class models vs running embeddings on the vector unit (Google's claim).
Frees the MXU and vector unit for the dense MLP, so total throughput improves further.
Matters for ad ranking, YouTube recommendations, and Search ranking — all fleet-dominant workloads.
Why this matters in 2026
Modern frontier MoE models (e.g. Mixture-of-Experts at scale) have a similar shape: each token is routed to a small number of experts, requiring scatter-gather over a giant set of expert weights. SparseCore was originally for recommendations but is increasingly load-bearing for MoE inference. Trillium ships with a 3rd-generation SparseCore; Ironwood inherits a refined version. It's permanent.
05
Six ICI Links — The 3D Torus
v2/v3 chips had 4 ICI links (2D torus). v4 has 6, enabling a 3D torus where each chip has neighbours in ±X, ±Y, ±Z.
Why 3D over 2D?
Diameter scales as N1/3 not N1/2. For 4096 chips, 3D torus diameter is ~24 hops; a 2D torus would be ~64.
Bisection bandwidth. Higher in 3D; all-reduce times improve roughly proportionally.
Tensor-parallel friendly. A 3D shape lets you natively allocate 3D sharding for tensor / data / pipeline parallelism without packing tricks.
Why not 4D / 5D / hypercube?
Cabling cost. Each extra dimension means more ICI links per chip and more cables per board / rack.
Diminishing returns. Going from 2D to 3D buys you a lot; 3D to 4D buys much less.
Datacenter physics. A datacenter is a 2D plane of racks, with vertical (rack) being short. 3D maps onto it cleanly; 4D doesn't.
Per-link bandwidth is ~50 GB/s in each direction; aggregate ICI bandwidth per chip is ~300 GB/s. The 3D-torus's 6 links per chip is the same number as in v5p and Ironwood; the 3D-torus is the p-class topology from v4 onwards.
06
Palomar — The Optical Circuit Switch
The single most distinctive component of v4 is not on the TPU itself — it's in the rack between TPU sub-pods. Palomar is Google's bespoke 3D-MEMS optical circuit switch.
Each Palomar OCS is a 136-port (128 + 8 spares) reconfigurable optical switch using 3D-MEMS mirrors, with sub-millisecond reconfiguration. Forty-eight Palomar OCSes interconnect the 64 cubes of a 4,096-chip TPU v4 supercomputer. OCSes are <5% of system cost and <3% of system power vs an InfiniBand-based equivalent.
— Jouppi et al., "TPU v4: An Optically Reconfigurable Supercomputer", ISCA 2023
How an OCS works at the bit level
1. Stay optical
Light from each ICI cable enters the switch as a beam. It is never converted back to electrical inside the switch.
2. MEMS mirror
An array of microelectromechanical mirrors steers each beam to any output port. The mirror is the size of a salt grain and tilts in microseconds.
3. Reconfigure
To repoint a beam, the controller tilts the mirror to a new angle; the data path now routes a different cube to a different cube. Reconfiguration in under a millisecond.
What this enables that an electrical switch cannot
Topology per job. A 2,048-chip slice can be configured as 4×4×128 or 8×16×16 by pointing different mirrors. Job 1 finishes → reconfigure in milliseconds → job 2 gets a different shape.
Healing around faults. If a TPU board fails its host cube, the OCS routes around it; the pod loses 64 chips, not 1 chip's worth of cube.
Twisted torus. v4 uses a "twisted-3D torus" topology that is impossible with fixed cabling but trivially produced by an OCS — cuts diameter further with no extra hardware.
No serialiser-deserialiser cost. Optical-electrical-optical (OEO) is the single biggest energy cost in datacenter networking. OCS is photonic end-to-end — minimal power, minimal heat.
Where else Google uses OCS
Google's "Mission Apollo" paper (arXiv:2208.10041, 2022) describes OCS rolled out across the rest of the Jupiter datacenter network. So the Palomar work is a single piece of an organisation-wide reorientation toward optical circuit switching as the right scale-up fabric, not just a TPU thing. v4 is the chip that consumes the technology most aggressively.
07
A 4096-Chip Pod, Logically
Allocation policy
Smallest job: a single cube (64 chips, 4×4×4 internal torus, no OCS).
Common slices: 4×4×8 = 128, 8×8×8 = 512, 8×8×16 = 1024, all the way up to the full pod.
The OCS is configured at job start; the chosen cubes appear to software as a single 3D torus, with chip coordinates the user picks.
Pod-spanning topology shapes can be twisted (mirror-symmetric) for lower diameter, or non-twisted for simpler debugging.
08
v4i — The Inference Sibling
v4i (Jouppi et al., ISCA 2021, "Ten Lessons From Three Generations Shaped Google's TPUv4i") is the inference-tuned variant of v4. Same 7 nm process, similar uncore, but designed for the cost envelope of fleet inference.
v4 (training)
v4i (inference)
Process
7 nm
7 nm
TensorCores
2
1
Per-chip bf16
275 TFLOPS
138 TFLOPS
HBM
32 GiB at 1.2 TB/s
8 GiB HBM2 at ~614 GB/s
CMEM
~128 MiB
~128 MiB (introduced here first)
SparseCore
Yes
Yes
ICI links
6
0 (single chip)
Pod
4096 chips, 3D torus + OCS
Single chip / PCIe
TDP
~170 W
~175 W
Cooling
Liquid
Air
Design choices
One TensorCore is enough. Inference batches are smaller; one core fully utilised is better than two cores at half utilisation.
HBM2 single stack saves cost. 8 GiB is enough for most fleet inference models in 2020.
No ICI — a v4i is a single-chip card. Multi-chip inference uses host-mediated coordination, not the pod fabric.
Air-cooled — same form factor as v1; drops into existing servers.
v4i is the chip that powered the bulk of Google's user-facing inference (Search, YouTube, Ads, Assistant) from 2020 onward. By 2023, with Bard's launch, v4i was also the first inference-side LLM-serving TPU at user scale. It is the direct ancestor of v5e and Trillium — the e-class lineage.
09
PaLM — 6,144 Chips, 50 Days
The headline workload that v4 trained: PaLM (Pathways Language Model), 540B parameters, April 2022. PaLM is the first publicly-disclosed example of a frontier-scale LLM trained on TPUs, and the first job that used Google's Pathways single-controller orchestration system to span multiple TPU pods.
Parameters
540 billion (dense, decoder-only Transformer)
Hardware
2 v4 pods, 3072 + 1024 = 4096 chips for the bulk; some configurations used 6144 chips
Training time
~50 days wall-clock
Tokens trained
780 billion
Hardware utilisation
~46% Model FLOP Utilisation (MFU) — among the highest ever reported for an LLM-class run
Orchestration
Pathways single-controller dataflow (Barham, Dean, Ghemawat et al., MLSys 2022)
The Pathways paper is the systems counterpart to the v4 hardware paper. It describes how a single Python program on a single client dispatches work to thousands of TPUs across multiple pods, with asynchronous compilation and dataflow graphs that span the cluster.
A milestone hidden in the citation
Hennessy and Patterson's CACM 2019 Turing lecture predicted that "domain-specific architectures" would deliver 50×+ improvements over general-purpose CPUs; PaLM is the existence proof at frontier-LLM scale. 540B parameters trained for 6,144-chip-equivalent × 50 days is on the order of 1023 bf16 FLOPs; that workload on a CPU cluster would be impossible.
10
Resiliency at Scale
USENIX NSDI 2024 has a Google paper, "Resiliency at Scale: Managing Google's TPUv4 ML Supercomputer", that's worth reading on its own as a systems-engineering document.
The problem
4096-chip pods. Mean Time Between Failures of any individual chip is high — but with thousands of them, some chip fails every few days.
An LLM training job runs for weeks. A 24-hour failure rate is 1–5% — expect O(10) chip failures per training run.
Naive checkpoint-and-restart loses hours of progress per failure. Unacceptable at this scale.
The v4 system's answer
OCS rerouting. When a chip dies, the OCS swaps in a hot-spare cube within milliseconds. The training job pauses for one heartbeat and resumes.
Frequent async checkpointing. Pathways's checkpointing layer can write state to disaggregated storage every few minutes without stalling the training step.
In-flight gradient skipping. If a step's all-reduce is corrupted by a transient ICI error, the affected gradient is dropped and the step retried, rather than aborting the job.
Hardware ECC + parity on every memory level (CMEM, VMEM, HBM, ICI), with corrected-error counters reported up to the orchestration layer.
The result is "five-nines" effective uptime on multi-week training jobs at 4096-chip scale — a number that has no real precedent in scientific computing, where chip-scale supercomputers like Summit / Frontier checkpoint hourly.
11
v4 Numbers In One Page
Per chip
275 TFLOPS bf16
32 GiB HBM2
1.2 TB/s HBM BW
~128 MiB CMEM
1 SparseCore
6 ICI links @ 50 GB/s ea
~170 W
Per pod
4096 chips
~1.1 ExaFLOPS bf16
128 TiB aggregate HBM
3D torus + 48 Palomar OCS
Reconfigurable slices: 64 to 4096 chips
Twisted-torus topology option
Pod environmental
Liquid cooled
Pod power ~700 kW
Pod footprint ~12–16 racks
Hot-spare cubes for fault recovery
OCS <5% of system cost, <3% of system power
The v4i companion
Same 7 nm process; one TensorCore; 138 TFLOPS bf16; 8 GiB HBM2; 175 W; air-cooled; PCIe.
Drops into existing inference servers; single-chip; no ICI.
Powers Google's bulk inference fleet from 2020 onward, including the first generation of Bard.
12
Cheat Sheet
v4 (deployed 2020, announced 2021, paper ISCA 2023): 7 nm, 2 TensorCores with 4 MXUs each, 275 TFLOPS bf16, 32 GiB HBM, ~170 W, liquid-cooled.
CMEM — ~128 MiB on-die SRAM cache between VMEM and HBM. First TPU memory level with cache-like behaviour.
SparseCore — embedding-lookup accelerator. 5–7× speedup on DLRM-class workloads. Now in every TPU since.
6 ICI links per chip — enables 3D torus.
Palomar OCS — 136-port 3D-MEMS optical circuit switch. 48 OCSes per pod. <5% of system cost, <3% of power. Reconfigurable in <1 ms.
4096-chip pod: 64 cubes of 64 chips each, connected via OCS. Sub-pod slices reconfigurable per job. Twisted-torus topology.