How a TPU pod is wired up. Custom electrical SerDes, 2D vs 3D torus, the Palomar 3D-MEMS optical switch, twisted-torus topologies, healing around faulty cubes, and the multipod fabric over Jupiter.
Inter-Chip Interconnect is Google's custom chip-to-chip link. It is not InfiniBand, not Ethernet, not NVLink. It exists for one reason: to make the TPU pod look like a single coherent compute substrate to the compiler, with the lowest possible latency and the highest bandwidth that fits on the package.
NVLink and InfiniBand both came with cost, latency, or proprietary ecosystem disadvantages in the 2016 TPU v2 design window. Google had a network-silicon team that had built the Jupiter switch ASICs; doing custom SerDes for chip-to-chip was a smaller leap than it sounds. Eight years later, the proprietary fabric is one of the TPU programme's main competitive moats.
Why does e-class stay 2D? Cost. 4 ICI ports per chip vs 6 saves silicon and packaging. For 256-chip pods (the e-class size), 2D is fine; for 9,216-chip pods (p-class), 2D would be unworkable.
Three terms that appear constantly in TPU documentation. They are not synonyms.
The smallest physical unit of a 3D torus. Typically 4×4×4 = 64 chips on v4/v5p. Cabled internally with fixed ICI; cabled externally to the OCS for everything bigger.
A logical, schedulable subset of a pod. Looks to the user like one 3D torus of dimensions they pick. The OCS makes any cube-multiple slice possible.
The full ICI-coherent compute domain. v4: 4,096 chips. v5p: 8,960 chips. Ironwood: 9,216 chips. Trillium pod is 256-chip (e-class shape).
Beyond a pod is multipod — multiple pods connected by Jupiter DCN, used by Multislice. From the compiler's view, multipod is a coarser-grained mesh sitting on top of the pods. ICI gives you sub-microsecond latency inside a pod; Jupiter gives you microseconds between them.
"TPU pod" sometimes refers to a literal physical structure (the 256-chip v3 rack-of-racks) and sometimes to the logical scheduling unit (the 8,960-chip v5p ICI domain). Modern Cloud TPU documentation uses pod to mean the logical ICI domain; the physical rack-level structure is implicit.
Palomar is Google's bespoke 3D-MEMS optical circuit switch, introduced for v4 and disclosed in the ISCA 2023 paper.
Sub-millisecond. The mirrors tilt mechanically but only a few hundred microns; settle time is dominated by piezoelectric damping.
136 = 128 active + 8 spares. The spares are used to repair damaged ports or to round-robin against mirror wear.
OCSes are <5% of system cost and <3% of system power vs an electrical-switch equivalent for the v4 pod.
For a packet switch, latency goes up as you push more bandwidth through (queuing, contention). For a circuit switch, you set up the path once and stream data through it forever. ML workloads have predictable communication patterns — you all-reduce the same gradients every step. That's exactly the workload a circuit switch is good at.
Datacenter networking had been firmly packet-switched since the 1990s because workloads were unpredictable (web, search, analytics). ML training in 2020 broke that assumption: the same all-reduce, every step, forever. That's the workload OCS is designed for. Google's Mission Apollo paper (2022) generalises the idea to the rest of the datacenter network.
Because OCS reconfigures per job, v4-and-later pods can present different topology shapes to different jobs without recabling. The most useful new shape is the twisted torus.
In a standard k×k torus, the diameter is 2×floor(k/2) = k. In a twisted torus where the wrap-around is offset by some constant m, the diameter can drop to ~√(2)×k when m is well-chosen. For the 9,216-chip Ironwood pod, that's the difference between a worst-case 32-hop all-to-all and a worst-case 22-hop all-to-all — significant for tail-latency-sensitive collectives.
You cannot wire a twisted torus in fixed cabling because the offset depends on slice shape. Twisted topology is a free upgrade you get from having an OCS.
The most important communication pattern in TPU pod-scale training. Used after every gradient step in data-parallel and tensor-parallel training, after every softmax in tensor-parallel attention, after every expert pass in MoE.
Total time is approximately (N-1)/N × (gradient size / per-link bandwidth) per axis. For a 9,216-chip pod arranged as 16×24×24 with 600 GB/s per link per direction, an all-reduce on a 1 GB gradient takes well under 10 ms wall-clock.
The TPU's ICI link controllers contain dedicated reduce/gather hardware — the chip doesn't have to use its vector unit to do the additions during an all-reduce. This is one of the underrated parts of the architecture: collective operations don't steal compute time from the matmul. NVIDIA achieves the same with NCCL but it runs on the SMs and competes with kernels.
At 9,216-chip scale, hardware will fail mid-job. The OCS makes that survivable.
For a 50-day PaLM-class run on 6,144 chips, you expect somewhere between 5 and 20 chip failures. Each unrecoverable failure on a non-OCS pod would mean: stop, checkpoint, debug, replace, resume — conservatively an hour. With OCS hot-spare rerouting, each failure costs <1 second. Over a 50-day run, that's the difference between losing one day to faults and losing none.
"Resiliency at Scale: Managing Google's TPUv4 ML Supercomputer" (USENIX NSDI 2024) is the systems-engineering counterpart to ISCA 2023. It documents the failure modes seen in production v4 pods and the orchestration strategies (checkpointing cadence, gradient-skip thresholds, hot-spare reservation policy) that keep multi-week jobs running. Worth reading even outside a TPU context.
Beyond the pod is Google's Jupiter datacenter network — the same network fabric that connects every other Google service. Modern Jupiter is itself OCS-based after Mission Apollo (2022).
The relationship is layered: ICI inside a pod, OCS inside the pod for topology reconfiguration, Jupiter for inter-pod, packet-switching only at the very top.
| Generation | Per-chip ICI (bidir) | Ports | Topology | Pod |
|---|---|---|---|---|
| v1 | — | 0 (PCIe only) | — | 1 chip |
| v2 | ~62 GB/s aggregate | 4 | 2D torus | 256 |
| v3 | ~82 GB/s aggregate (estimated) | 4 | 2D torus | 1024 |
| v4 | ~300 GB/s | 6 | 3D torus + OCS | 4096 |
| v5e | 400 GB/s | 4 | 2D torus | 256 |
| v5p | 1.2 TB/s | 6 | 3D torus + OCS | 8960 |
| Trillium (v6e) | 800 GB/s | 4 | 2D torus | 256 |
| Ironwood (v7) | 1.2 TB/s | 6 | 3D torus + OCS | 9216 |
The Palomar OCS is the most visible Google use of optical circuit switching, but not the only one. "Mission Apollo: A Modern Datacenter Network with OCS" (Google, arXiv:2208.10041, 2022) describes how the same technology was rolled out across the rest of the Jupiter network.
The TPU's Palomar is, in effect, the fine-grained, fast-reconfiguration sibling of Apollo — reconfigure per-job vs Apollo's per-day. They share supply chain (Google's optical-component partner network), share physics (3D-MEMS), share controller code, and share organisational ownership.
Most hyperscalers buy network silicon from Broadcom or Marvell; the network is procurement, not architecture. Google designs its own OCS, its own network ASICs, and its own custom SerDes for both ICI and Jupiter. The TPU isn't just a chip; it's a chip plus a network. Reproducing the TPU advantage requires reproducing both halves — which is why no other hyperscaler has matched it.
Deck 11 — Software Stack shows how XLA and JAX use this pod fabric — mesh, sharding, GSPMD, Pathways. Deck 12 — TPU vs GPU contrasts the 3D-torus + OCS approach with NVIDIA's NVLink + InfiniBand fat-tree.