Google TPUs Series — Presentation 10

ICI, OCS & the 3D Torus

How a TPU pod is wired up. Custom electrical SerDes, 2D vs 3D torus, the Palomar 3D-MEMS optical switch, twisted-torus topologies, healing around faulty cubes, and the multipod fabric over Jupiter.

ICISerDes 2D torus3D torus Palomar OCS3D MEMS Jupiter DCN
SerDes lane ICI link cube (4×4×4 = 64 chips) Palomar OCS pod (3D torus) Jupiter DCN multipod
00

Topics We'll Cover

01

ICI — What It Actually Is

Inter-Chip Interconnect is Google's custom chip-to-chip link. It is not InfiniBand, not Ethernet, not NVLink. It exists for one reason: to make the TPU pod look like a single coherent compute substrate to the compiler, with the lowest possible latency and the highest bandwidth that fits on the package.

Physical

  • Custom Google electrical SerDes lanes, several per ICI port.
  • Per-port bidirectional bandwidth grew from ~62 GB/s aggregate (v2) to 1.2 TB/s bidir (Ironwood).
  • Ports per chip: 4 (v2/v3, 2D torus) or 6 (v4 onwards p-class, 3D torus); 4 on the e-class chips.
  • Cabled (between cubes / racks) and on-package (within a board) variants.

Logical

  • Each chip has direct ICI peers (its torus neighbours).
  • Hardware support for collective primitives: all-reduce, all-gather, reduce-scatter, all-to-all.
  • Sub-microsecond latency between adjacent chips; hop-by-hop routing for non-neighbours.
  • Software view (via XLA): a chip can "address" any HBM in its slice; the compiler turns that into ICI transfers.

Why custom?

NVLink and InfiniBand both came with cost, latency, or proprietary ecosystem disadvantages in the 2016 TPU v2 design window. Google had a network-silicon team that had built the Jupiter switch ASICs; doing custom SerDes for chip-to-chip was a smaller leap than it sounds. Eight years later, the proprietary fabric is one of the TPU programme's main competitive moats.

02

2D vs 3D Torus — The Topology Choice

2D Torus (e-class) — 4 ICI links/chip used by v2 / v3 / v5e / Trillium — up to 256 chips 3D Torus (p-class) — 6 ICI links/chip used by v4 / v5p / Ironwood — up to 9,216 chips 3 axes × 2 directions = 6 ICI links per chip + Palomar OCS layer for inter-cube reconfigurability

Diameter and bisection

Why does e-class stay 2D? Cost. 4 ICI ports per chip vs 6 saves silicon and packaging. For 256-chip pods (the e-class size), 2D is fine; for 9,216-chip pods (p-class), 2D would be unworkable.

03

Cubes, Slices, Pods — The Vocabulary

Three terms that appear constantly in TPU documentation. They are not synonyms.

Cube

The smallest physical unit of a 3D torus. Typically 4×4×4 = 64 chips on v4/v5p. Cabled internally with fixed ICI; cabled externally to the OCS for everything bigger.

Slice

A logical, schedulable subset of a pod. Looks to the user like one 3D torus of dimensions they pick. The OCS makes any cube-multiple slice possible.

Pod

The full ICI-coherent compute domain. v4: 4,096 chips. v5p: 8,960 chips. Ironwood: 9,216 chips. Trillium pod is 256-chip (e-class shape).

Multipod

Beyond a pod is multipod — multiple pods connected by Jupiter DCN, used by Multislice. From the compiler's view, multipod is a coarser-grained mesh sitting on top of the pods. ICI gives you sub-microsecond latency inside a pod; Jupiter gives you microseconds between them.

A common misreading

"TPU pod" sometimes refers to a literal physical structure (the 256-chip v3 rack-of-racks) and sometimes to the logical scheduling unit (the 8,960-chip v5p ICI domain). Modern Cloud TPU documentation uses pod to mean the logical ICI domain; the physical rack-level structure is implicit.

04

Palomar — Inside the Optical Circuit Switch

Palomar is Google's bespoke 3D-MEMS optical circuit switch, introduced for v4 and disclosed in the ISCA 2023 paper.

Each Palomar OCS is a 136-port (128 + 8 spares) reconfigurable optical switch using 3D-MEMS mirrors with sub-millisecond reconfiguration. Forty-eight Palomar OCSes interconnect the 64 cubes of a 4,096-chip TPU v4 supercomputer. — Jouppi et al., ISCA 2023 (arXiv:2304.01433)

The optical path

  1. An ICI cable carries data as optical signals over fibre from a TPU board's network ASIC to a Palomar OCS port.
  2. The light enters the OCS, passes through micro-lenses, and lands on a 3D-MEMS mirror: a microscopic tiltable mirror in a 12×12 array (one per port).
  3. The mirror's tilt angle is set by a controller, sending the beam onto a second mirror plane.
  4. The second mirror reflects the beam toward the chosen output port.
  5. The light exits via fibre to the destination TPU board.
  6. The data is never deserialised inside the switch — it stays optical end-to-end.

Reconfiguration time

Sub-millisecond. The mirrors tilt mechanically but only a few hundred microns; settle time is dominated by piezoelectric damping.

Port count

136 = 128 active + 8 spares. The spares are used to repair damaged ports or to round-robin against mirror wear.

Power & cost

OCSes are <5% of system cost and <3% of system power vs an electrical-switch equivalent for the v4 pod.

05

Why Optical, Not Electrical?

The OEO problem

  • An electrical switch must Optical → Electrical → Optical at every port.
  • Each conversion costs energy (pJ/bit), latency (nanoseconds), and silicon area for the SerDes.
  • At pod scale (thousands of links), this dominates network power.

OCS skips it

  • Light enters, mirror tilts, light exits. No conversion.
  • No retiming, no buffering, no protocol overhead inside the switch.
  • Power and latency in the switch are independent of bandwidth — doubling bandwidth doesn't double switch power.

The deeper reason: bandwidth grows; switching latency must shrink

For a packet switch, latency goes up as you push more bandwidth through (queuing, contention). For a circuit switch, you set up the path once and stream data through it forever. ML workloads have predictable communication patterns — you all-reduce the same gradients every step. That's exactly the workload a circuit switch is good at.

Why this hadn't been done before

Datacenter networking had been firmly packet-switched since the 1990s because workloads were unpredictable (web, search, analytics). ML training in 2020 broke that assumption: the same all-reduce, every step, forever. That's the workload OCS is designed for. Google's Mission Apollo paper (2022) generalises the idea to the rest of the datacenter network.

06

Twisted Tori & Job-Time Topology

Because OCS reconfigures per job, v4-and-later pods can present different topology shapes to different jobs without recabling. The most useful new shape is the twisted torus.

Standard 4×4 torus wrap-around: row N to row N (column 0 ↔ column N-1) Twisted 4×4 torus wrap-around shifts by 1 row → cuts diameter

Why twisting helps

In a standard k×k torus, the diameter is 2×floor(k/2) = k. In a twisted torus where the wrap-around is offset by some constant m, the diameter can drop to ~√(2)×k when m is well-chosen. For the 9,216-chip Ironwood pod, that's the difference between a worst-case 32-hop all-to-all and a worst-case 22-hop all-to-all — significant for tail-latency-sensitive collectives.

You cannot wire a twisted torus in fixed cabling because the offset depends on slice shape. Twisted topology is a free upgrade you get from having an OCS.

07

All-Reduce on a 3D Torus

The most important communication pattern in TPU pod-scale training. Used after every gradient step in data-parallel and tensor-parallel training, after every softmax in tensor-parallel attention, after every expert pass in MoE.

The textbook 3D-torus algorithm

  1. Reduce-scatter along X. Each row of chips reduces and scatters its values along the X axis. Each chip ends up with one X-stride of the partial sum.
  2. Reduce-scatter along Y. Same, perpendicular axis.
  3. Reduce-scatter along Z. Same, third axis.
  4. All-gather along Z, Y, X, in reverse, to broadcast the final result.

Total time is approximately (N-1)/N × (gradient size / per-link bandwidth) per axis. For a 9,216-chip pod arranged as 16×24×24 with 600 GB/s per link per direction, an all-reduce on a 1 GB gradient takes well under 10 ms wall-clock.

Hardware all-reduce

The TPU's ICI link controllers contain dedicated reduce/gather hardware — the chip doesn't have to use its vector unit to do the additions during an all-reduce. This is one of the underrated parts of the architecture: collective operations don't steal compute time from the matmul. NVIDIA achieves the same with NCCL but it runs on the SMs and competes with kernels.

08

Healing Around Faulty Cubes

At 9,216-chip scale, hardware will fail mid-job. The OCS makes that survivable.

  1. A board's monitoring detects an unrecoverable error on one of its 64 chips.
  2. The OCS controller is notified; a hot-spare cube (kept reserved for this purpose) is brought online.
  3. The mirrors that route to the failed cube are tilted to point at the spare cube instead.
  4. The training job pauses for a few hundred milliseconds (single ICI heartbeat), then resumes.
  5. Loss of progress: a few seconds at most. Compare to a checkpoint-and-restart cost of minutes.

What this is worth

For a 50-day PaLM-class run on 6,144 chips, you expect somewhere between 5 and 20 chip failures. Each unrecoverable failure on a non-OCS pod would mean: stop, checkpoint, debug, replace, resume — conservatively an hour. With OCS hot-spare rerouting, each failure costs <1 second. Over a 50-day run, that's the difference between losing one day to faults and losing none.

The NSDI 2024 paper

"Resiliency at Scale: Managing Google's TPUv4 ML Supercomputer" (USENIX NSDI 2024) is the systems-engineering counterpart to ISCA 2023. It documents the failure modes seen in production v4 pods and the orchestration strategies (checkpointing cadence, gradient-skip thresholds, hot-spare reservation policy) that keep multi-week jobs running. Worth reading even outside a TPU context.

09

Jupiter DCN & Multipod

Beyond the pod is Google's Jupiter datacenter network — the same network fabric that connects every other Google service. Modern Jupiter is itself OCS-based after Mission Apollo (2022).

What Jupiter is

  • Hierarchical fat-tree of OCS switches and electrical packet switches.
  • Bisection bandwidth at the datacenter level: 13 Pb/s for a Trillium multipod region.
  • Connects every TPU pod, every CPU rack, every storage node in a region.
  • Latency between pods: tens of microseconds typical.

What multipod uses it for

  • Cross-pod gradient all-reduce in data-parallel training spanning multiple pods.
  • Pipeline-parallel sharding across pods.
  • Pathways control-plane traffic (compilation cache, scheduling, monitoring).
  • Disaggregated checkpoint storage and dataset streaming.

Multipod composition limits

The relationship is layered: ICI inside a pod, OCS inside the pod for topology reconfiguration, Jupiter for inter-pod, packet-switching only at the very top.

10

ICI Bandwidth Across TPU Generations

GenerationPer-chip ICI (bidir)PortsTopologyPod
v10 (PCIe only)1 chip
v2~62 GB/s aggregate42D torus256
v3~82 GB/s aggregate (estimated)42D torus1024
v4~300 GB/s63D torus + OCS4096
v5e400 GB/s42D torus256
v5p1.2 TB/s63D torus + OCS8960
Trillium (v6e)800 GB/s42D torus256
Ironwood (v7)1.2 TB/s63D torus + OCS9216

Reading the trends

11

Mission Apollo — OCS Outside the TPU

The Palomar OCS is the most visible Google use of optical circuit switching, but not the only one. "Mission Apollo: A Modern Datacenter Network with OCS" (Google, arXiv:2208.10041, 2022) describes how the same technology was rolled out across the rest of the Jupiter network.

What Apollo did

The TPU's Palomar is, in effect, the fine-grained, fast-reconfiguration sibling of Apollo — reconfigure per-job vs Apollo's per-day. They share supply chain (Google's optical-component partner network), share physics (3D-MEMS), share controller code, and share organisational ownership.

Why this matters strategically

Most hyperscalers buy network silicon from Broadcom or Marvell; the network is procurement, not architecture. Google designs its own OCS, its own network ASICs, and its own custom SerDes for both ICI and Jupiter. The TPU isn't just a chip; it's a chip plus a network. Reproducing the TPU advantage requires reproducing both halves — which is why no other hyperscaler has matched it.

12

Cheat Sheet

Read next

Deck 11 — Software Stack shows how XLA and JAX use this pod fabric — mesh, sharding, GSPMD, Pathways. Deck 12 — TPU vs GPU contrasts the 3D-torus + OCS approach with NVIDIA's NVLink + InfiniBand fat-tree.