Presentations in This Series
- History & People — How the TPU Programme Began →The 2013 voice-search projection, Norm Jouppi's career arc from Stanford MIPS to Google, David Patterson, the 22-day silicon-to-datacenter dash, AlphaGo, the ISCA 2017 / CACM 2020 / ISCA 2023 papers.
- Ten Years of TPUs — v1 to Ironwood →Timeline of every TPU generation, what each one unlocked, the e-class / p-class fork, peak FLOPS / HBM / pod scaling, models trained on each. Interactive generation explorer.
- Systolic Arrays — The Matmul Engine Inside →Kung & Leiserson 1978, "Why Systolic Architectures?" 1982, Warp / iWarp, weight-stationary vs output-stationary vs row-stationary dataflows, how a 256×256 array maps a matmul. Interactive matmul wavefront.
- Inside TPU v1 — The 2015 Inference Chip →28 nm, 256×256 INT8 systolic, 24 MiB unified buffer, 8 GiB DDR3, the brilliantly minimal CISC ISA, the 92 TOPS roofline and what Google learned about memory bandwidth.
- TPU v2 & v3 — The Training Era Begins →bfloat16 (Google Brain's invention), HBM arrives, two TensorCores per chip, the 2D torus pod, why v3 needed liquid cooling.
- TPU v4 — OCS, SparseCore & Palomar →7 nm, 3D torus, the Palomar 3D-MEMS optical circuit switch, SparseCore embedding accelerator, CMEM, the v4i inference sibling, PaLM at 6,144 chips.
- TPU v5e & v5p — The Two-Track Fork →The efficiency / performance product fork, v5p as the Gemini training flagship, 95 GB HBM, the 8,960-chip pod, multislice over Jupiter DCN.
- Trillium & Ironwood — v6e and v7 →Trillium's 4.7× jump and 3rd-gen SparseCore, Ironwood's 4.6 PFLOPS FP8 / 192 GB HBM3e per chip, the 9,216-chip "age of inference" pod.
- Memory Hierarchy & Numerics →VMEM, CMEM, HBM2 to HBM3e, the bf16 invention story, INT8 + FP8, accumulator widths, why each precision change doubled effective FLOPS.
- ICI, OCS & the 3D Torus →Custom-SerDes Inter-Chip Interconnect, 2D vs 3D torus, the Palomar 136-port optical switch, twisted-torus topologies, healing around faulty cubes, multipod over Jupiter DCN.
- The TPU Software Stack — XLA, JAX, Pallas →XLA / HLO / StableHLO, JAX transformations and sharding, GSPMD & Shardy partitioning, Pallas / Mosaic kernels, PyTorch-XLA & TorchTPU, MaxText, Pathways, Multislice.
- TPU vs GPU — Two Architectural Philosophies →Static-compiler vs dynamic-warp scheduling, scratchpad vs cache, AOT graph vs JIT kernel, 3D torus vs fat-tree InfiniBand, where each wins, the cloud-only constraint.
Why a TPU sub-hub? The TPU has been the silent partner of LLM history — AlphaGo ran on v1, every public Gemini model trained on v4 or v5p, and Ironwood pods serve more inference traffic than any rack of GPUs at Google. Yet most LLM engineers know the chip only as a black box behind jax.jit. This series opens the box: the hardware, the interconnect, the compiler, and the ten years of design choices that produced something genuinely different from a GPU.