Google TPUs 10 — ICI, OCS & the 3D Torus

00

Topics We'll Cover

ICI — What It Actually Is
2D vs 3D Torus — The Topology Choice
Cubes, Slices, Pods — The Vocabulary
Palomar — Inside the Optical Circuit Switch
Why Optical, Not Electrical?
Twisted Tori & Job-Time Topology
All-Reduce on a 3D Torus
Healing Around Faulty Cubes
Jupiter DCN & Multipod
ICI Bandwidth Across TPU Generations
Mission Apollo — OCS Outside the TPU
Cheat Sheet

01

ICI — What It Actually Is

Inter-Chip Interconnect is Google's custom chip-to-chip link. It is not InfiniBand, not Ethernet, not NVLink. It exists for one reason: to make the TPU pod look like a single coherent compute substrate to the compiler, with the lowest possible latency and the highest bandwidth that fits on the package.

Physical

Custom Google electrical SerDes lanes, several per ICI port.
Per-port bidirectional bandwidth grew from ~62 GB/s aggregate (v2) to 1.2 TB/s bidir (Ironwood).
Ports per chip: 4 (v2/v3, 2D torus) or 6 (v4 onwards p-class, 3D torus); 4 on the e-class chips.
Cabled (between cubes / racks) and on-package (within a board) variants.

Logical

Each chip has direct ICI peers (its torus neighbours).
Hardware support for collective primitives: all-reduce, all-gather, reduce-scatter, all-to-all.
Sub-microsecond latency between adjacent chips; hop-by-hop routing for non-neighbours.
Software view (via XLA): a chip can "address" any HBM in its slice; the compiler turns that into ICI transfers.

Why custom?

NVLink and InfiniBand both came with cost, latency, or proprietary ecosystem disadvantages in the 2016 TPU v2 design window. Google had a network-silicon team that had built the Jupiter switch ASICs; doing custom SerDes for chip-to-chip was a smaller leap than it sounds. Eight years later, the proprietary fabric is one of the TPU programme's main competitive moats.

02

2D vs 3D Torus — The Topology Choice

Diameter and bisection

2D torus, 16×16 = 256 chips: diameter 16 hops, bisection 32 links.
3D torus, 16×24×24 = 9,216 chips: diameter ~32 hops, bisection ~1,150 links.
For matched chip count, 3D torus has dramatically lower diameter and dramatically higher bisection — both critical for collective communication.

Why does e-class stay 2D? Cost. 4 ICI ports per chip vs 6 saves silicon and packaging. For 256-chip pods (the e-class size), 2D is fine; for 9,216-chip pods (p-class), 2D would be unworkable.

03

Cubes, Slices, Pods — The Vocabulary

Three terms that appear constantly in TPU documentation. They are not synonyms.

Cube

The smallest physical unit of a 3D torus. Typically 4×4×4 = 64 chips on v4/v5p. Cabled internally with fixed ICI; cabled externally to the OCS for everything bigger.

Slice

A logical, schedulable subset of a pod. Looks to the user like one 3D torus of dimensions they pick. The OCS makes any cube-multiple slice possible.

Pod

The full ICI-coherent compute domain. v4: 4,096 chips. v5p: 8,960 chips. Ironwood: 9,216 chips. Trillium pod is 256-chip (e-class shape).

Multipod

Beyond a pod is multipod — multiple pods connected by Jupiter DCN, used by Multislice. From the compiler's view, multipod is a coarser-grained mesh sitting on top of the pods. ICI gives you sub-microsecond latency inside a pod; Jupiter gives you microseconds between them.

A common misreading

"TPU pod" sometimes refers to a literal physical structure (the 256-chip v3 rack-of-racks) and sometimes to the logical scheduling unit (the 8,960-chip v5p ICI domain). Modern Cloud TPU documentation uses pod to mean the logical ICI domain; the physical rack-level structure is implicit.

04

Palomar — Inside the Optical Circuit Switch

Palomar is Google's bespoke 3D-MEMS optical circuit switch, introduced for v4 and disclosed in the ISCA 2023 paper.

Each Palomar OCS is a 136-port (128 + 8 spares) reconfigurable optical switch using 3D-MEMS mirrors with sub-millisecond reconfiguration. Forty-eight Palomar OCSes interconnect the 64 cubes of a 4,096-chip TPU v4 supercomputer. — Jouppi et al., ISCA 2023 (arXiv:2304.01433)

The optical path

An ICI cable carries data as optical signals over fibre from a TPU board's network ASIC to a Palomar OCS port.
The light enters the OCS, passes through micro-lenses, and lands on a 3D-MEMS mirror: a microscopic tiltable mirror in a 12×12 array (one per port).
The mirror's tilt angle is set by a controller, sending the beam onto a second mirror plane.
The second mirror reflects the beam toward the chosen output port.
The light exits via fibre to the destination TPU board.
The data is never deserialised inside the switch — it stays optical end-to-end.

Reconfiguration time

Sub-millisecond. The mirrors tilt mechanically but only a few hundred microns; settle time is dominated by piezoelectric damping.

Port count

136 = 128 active + 8 spares. The spares are used to repair damaged ports or to round-robin against mirror wear.

Power & cost

OCSes are <5% of system cost and <3% of system power vs an electrical-switch equivalent for the v4 pod.

05

Why Optical, Not Electrical?

The OEO problem

An electrical switch must Optical → Electrical → Optical at every port.
Each conversion costs energy (pJ/bit), latency (nanoseconds), and silicon area for the SerDes.
At pod scale (thousands of links), this dominates network power.

OCS skips it

Light enters, mirror tilts, light exits. No conversion.
No retiming, no buffering, no protocol overhead inside the switch.
Power and latency in the switch are independent of bandwidth — doubling bandwidth doesn't double switch power.

The deeper reason: bandwidth grows; switching latency must shrink

For a packet switch, latency goes up as you push more bandwidth through (queuing, contention). For a circuit switch, you set up the path once and stream data through it forever. ML workloads have predictable communication patterns — you all-reduce the same gradients every step. That's exactly the workload a circuit switch is good at.

Why this hadn't been done before

Datacenter networking had been firmly packet-switched since the 1990s because workloads were unpredictable (web, search, analytics). ML training in 2020 broke that assumption: the same all-reduce, every step, forever. That's the workload OCS is designed for. Google's Mission Apollo paper (2022) generalises the idea to the rest of the datacenter network.

06

Twisted Tori & Job-Time Topology

Because OCS reconfigures per job, v4-and-later pods can present different topology shapes to different jobs without recabling. The most useful new shape is the twisted torus.

Why twisting helps

In a standard k×k torus, the diameter is 2×floor(k/2) = k. In a twisted torus where the wrap-around is offset by some constant m, the diameter can drop to ~√(2)×k when m is well-chosen. For the 9,216-chip Ironwood pod, that's the difference between a worst-case 32-hop all-to-all and a worst-case 22-hop all-to-all — significant for tail-latency-sensitive collectives.

You cannot wire a twisted torus in fixed cabling because the offset depends on slice shape. Twisted topology is a free upgrade you get from having an OCS.

07

All-Reduce on a 3D Torus

The most important communication pattern in TPU pod-scale training. Used after every gradient step in data-parallel and tensor-parallel training, after every softmax in tensor-parallel attention, after every expert pass in MoE.

The textbook 3D-torus algorithm

Reduce-scatter along X. Each row of chips reduces and scatters its values along the X axis. Each chip ends up with one X-stride of the partial sum.
Reduce-scatter along Y. Same, perpendicular axis.
Reduce-scatter along Z. Same, third axis.
All-gather along Z, Y, X, in reverse, to broadcast the final result.

Total time is approximately (N-1)/N × (gradient size / per-link bandwidth) per axis. For a 9,216-chip pod arranged as 16×24×24 with 600 GB/s per link per direction, an all-reduce on a 1 GB gradient takes well under 10 ms wall-clock.

Hardware all-reduce

The TPU's ICI link controllers contain dedicated reduce/gather hardware — the chip doesn't have to use its vector unit to do the additions during an all-reduce. This is one of the underrated parts of the architecture: collective operations don't steal compute time from the matmul. NVIDIA achieves the same with NCCL but it runs on the SMs and competes with kernels.

08

Healing Around Faulty Cubes

At 9,216-chip scale, hardware will fail mid-job. The OCS makes that survivable.

A board's monitoring detects an unrecoverable error on one of its 64 chips.
The OCS controller is notified; a hot-spare cube (kept reserved for this purpose) is brought online.
The mirrors that route to the failed cube are tilted to point at the spare cube instead.
The training job pauses for a few hundred milliseconds (single ICI heartbeat), then resumes.
Loss of progress: a few seconds at most. Compare to a checkpoint-and-restart cost of minutes.

What this is worth

For a 50-day PaLM-class run on 6,144 chips, you expect somewhere between 5 and 20 chip failures. Each unrecoverable failure on a non-OCS pod would mean: stop, checkpoint, debug, replace, resume — conservatively an hour. With OCS hot-spare rerouting, each failure costs <1 second. Over a 50-day run, that's the difference between losing one day to faults and losing none.

The NSDI 2024 paper

"Resiliency at Scale: Managing Google's TPUv4 ML Supercomputer" (USENIX NSDI 2024) is the systems-engineering counterpart to ISCA 2023. It documents the failure modes seen in production v4 pods and the orchestration strategies (checkpointing cadence, gradient-skip thresholds, hot-spare reservation policy) that keep multi-week jobs running. Worth reading even outside a TPU context.

09

Jupiter DCN & Multipod

Beyond the pod is Google's Jupiter datacenter network — the same network fabric that connects every other Google service. Modern Jupiter is itself OCS-based after Mission Apollo (2022).

What Jupiter is

Hierarchical fat-tree of OCS switches and electrical packet switches.
Bisection bandwidth at the datacenter level: 13 Pb/s for a Trillium multipod region.
Connects every TPU pod, every CPU rack, every storage node in a region.
Latency between pods: tens of microseconds typical.

What multipod uses it for

Cross-pod gradient all-reduce in data-parallel training spanning multiple pods.
Pipeline-parallel sharding across pods.
Pathways control-plane traffic (compilation cache, scheduling, monitoring).
Disaggregated checkpoint storage and dataset streaming.

Multipod composition limits

Trillium multipod: up to ~100,000 chips in one Jupiter optical-network domain.
v5p multipod: similar order; explicit Multislice support since Aug 2023.
Ironwood: ICI-coherent at 9,216; multipod scales further over Jupiter.

The relationship is layered: ICI inside a pod, OCS inside the pod for topology reconfiguration, Jupiter for inter-pod, packet-switching only at the very top.

10

ICI Bandwidth Across TPU Generations

Generation	Per-chip ICI (bidir)	Ports	Topology	Pod
v1	—	0 (PCIe only)	—	1 chip
v2	~62 GB/s aggregate	4	2D torus	256
v3	~82 GB/s aggregate (estimated)	4	2D torus	1024
v4	~300 GB/s	6	3D torus + OCS	4096
v5e	400 GB/s	4	2D torus	256
v5p	1.2 TB/s	6	3D torus + OCS	8960
Trillium (v6e)	800 GB/s	4	2D torus	256
Ironwood (v7)	1.2 TB/s	6	3D torus + OCS	9216

Reading the trends

p-class chips have always doubled the ICI bandwidth of the previous p-class. v4 (300 GB/s) → v5p (1.2 TB/s) → Ironwood (1.2 TB/s, but at higher per-link efficiency).
e-class chips track p-class with a one-generation lag, at half the port count. Trillium's 800 GB/s on 4 ports is comparable per-link to v5p's 1.2 TB/s on 6.
Per-link bandwidth keeps growing — modern SerDes runs at 200 Gbps per lane and is on track for 400 Gbps.

11

Mission Apollo — OCS Outside the TPU

The Palomar OCS is the most visible Google use of optical circuit switching, but not the only one. "Mission Apollo: A Modern Datacenter Network with OCS" (Google, arXiv:2208.10041, 2022) describes how the same technology was rolled out across the rest of the Jupiter network.

What Apollo did

Replaced the spine layer of Google's datacenter networks with optical circuit switches at scale.
Power per switched bit dropped by an order of magnitude.
Topology became reconfigurable at coarse time scales (hours to days) for traffic engineering.
Total deployment: tens of thousands of OCS units across Google datacenters by 2022.

The TPU's Palomar is, in effect, the fine-grained, fast-reconfiguration sibling of Apollo — reconfigure per-job vs Apollo's per-day. They share supply chain (Google's optical-component partner network), share physics (3D-MEMS), share controller code, and share organisational ownership.

Why this matters strategically

Most hyperscalers buy network silicon from Broadcom or Marvell; the network is procurement, not architecture. Google designs its own OCS, its own network ASICs, and its own custom SerDes for both ICI and Jupiter. The TPU isn't just a chip; it's a chip plus a network. Reproducing the TPU advantage requires reproducing both halves — which is why no other hyperscaler has matched it.

12

Cheat Sheet

ICI = Inter-Chip Interconnect. Custom Google electrical SerDes with hardware all-reduce. Sub-microsecond chip-to-chip latency. Up to 1.2 TB/s bidir per chip on Ironwood.
2D torus (4 ICI ports) for e-class — 256-chip pods. 3D torus (6 ICI ports) for p-class — thousands-of-chip pods.
Cube = physical 4×4×4 = 64 chip unit. Slice = logical sub-pod allocated to a job. Pod = full ICI-coherent domain.
Palomar OCS = 136-port (128+8 spares) 3D-MEMS optical circuit switch. Reconfigures in <1 ms. <5% of system cost, <3% of system power.
Twisted torus = OCS-only topology with offset wrap-arounds; cuts diameter by ~√2 vs standard torus.
Hardware all-reduce on the ICI controllers — collectives don't steal compute from the matmul.
Healing via OCS hot-spare cube reroute on chip failure — sub-second pause vs minutes for checkpoint/restart.
Multipod = multi-pod jobs over Jupiter DCN — up to ~100,000 chips on Trillium, 13 Pb/s bisection.
Mission Apollo (2022) = OCS rolled out across the rest of Google's datacenter networks. Palomar is the TPU-specific cousin.