Google TPUs — LLM Hub

Presentations in This Series

History & People — How the TPU Programme Began →
The 2013 voice-search projection, Norm Jouppi's career arc from Stanford MIPS to Google, David Patterson, the 22-day silicon-to-datacenter dash, AlphaGo, the ISCA 2017 / CACM 2020 / ISCA 2023 papers.
live
Ten Years of TPUs — v1 to Ironwood →
Timeline of every TPU generation, what each one unlocked, the e-class / p-class fork, peak FLOPS / HBM / pod scaling, models trained on each. Interactive generation explorer.
live
Systolic Arrays — The Matmul Engine Inside →
Kung & Leiserson 1978, "Why Systolic Architectures?" 1982, Warp / iWarp, weight-stationary vs output-stationary vs row-stationary dataflows, how a 256×256 array maps a matmul. Interactive matmul wavefront.
live
Inside TPU v1 — The 2015 Inference Chip →
28 nm, 256×256 INT8 systolic, 24 MiB unified buffer, 8 GiB DDR3, the brilliantly minimal CISC ISA, the 92 TOPS roofline and what Google learned about memory bandwidth.
live
TPU v2 & v3 — The Training Era Begins →
bfloat16 (Google Brain's invention), HBM arrives, two TensorCores per chip, the 2D torus pod, why v3 needed liquid cooling.
live
TPU v4 — OCS, SparseCore & Palomar →
7 nm, 3D torus, the Palomar 3D-MEMS optical circuit switch, SparseCore embedding accelerator, CMEM, the v4i inference sibling, PaLM at 6,144 chips.
live
TPU v5e & v5p — The Two-Track Fork →
The efficiency / performance product fork, v5p as the Gemini training flagship, 95 GB HBM, the 8,960-chip pod, multislice over Jupiter DCN.
live
Trillium & Ironwood — v6e and v7 →
Trillium's 4.7× jump and 3rd-gen SparseCore, Ironwood's 4.6 PFLOPS FP8 / 192 GB HBM3e per chip, the 9,216-chip "age of inference" pod.
live
Memory Hierarchy & Numerics →
VMEM, CMEM, HBM2 to HBM3e, the bf16 invention story, INT8 + FP8, accumulator widths, why each precision change doubled effective FLOPS.
live
ICI, OCS & the 3D Torus →
Custom-SerDes Inter-Chip Interconnect, 2D vs 3D torus, the Palomar 136-port optical switch, twisted-torus topologies, healing around faulty cubes, multipod over Jupiter DCN.
live
The TPU Software Stack — XLA, JAX, Pallas →
XLA / HLO / StableHLO, JAX transformations and sharding, GSPMD & Shardy partitioning, Pallas / Mosaic kernels, PyTorch-XLA & TorchTPU, MaxText, Pathways, Multislice.
live
TPU vs GPU — Two Architectural Philosophies →
Static-compiler vs dynamic-warp scheduling, scratchpad vs cache, AOT graph vs JIT kernel, 3D torus vs fat-tree InfiniBand, where each wins, the cloud-only constraint.
live

Why a TPU sub-hub? The TPU has been the silent partner of LLM history — AlphaGo ran on v1, every public Gemini model trained on v4 or v5p, and Ironwood pods serve more inference traffic than any rack of GPUs at Google. Yet most LLM engineers know the chip only as a black box behind jax.jit. This series opens the box: the hardware, the interconnect, the compiler, and the ten years of design choices that produced something genuinely different from a GPU.