LLM Hub — Google TPUs

Google TPU Architectures

Twelve-deck deep tour of Google's Tensor Processing Unit programme — 2013 voice-search napkin maths to the 9,216-chip Ironwood pod. History, silicon, interconnect, software, and the architectural philosophy that makes a TPU a TPU.

v1v2v3 v4v4i v5ev5p TrilliumIronwood SystolicXLAJAX

Presentations in This Series

  1. History & People — How the TPU Programme Began →
    The 2013 voice-search projection, Norm Jouppi's career arc from Stanford MIPS to Google, David Patterson, the 22-day silicon-to-datacenter dash, AlphaGo, the ISCA 2017 / CACM 2020 / ISCA 2023 papers.
    live
  2. Ten Years of TPUs — v1 to Ironwood →
    Timeline of every TPU generation, what each one unlocked, the e-class / p-class fork, peak FLOPS / HBM / pod scaling, models trained on each. Interactive generation explorer.
    live
  3. Systolic Arrays — The Matmul Engine Inside →
    Kung & Leiserson 1978, "Why Systolic Architectures?" 1982, Warp / iWarp, weight-stationary vs output-stationary vs row-stationary dataflows, how a 256×256 array maps a matmul. Interactive matmul wavefront.
    live
  4. Inside TPU v1 — The 2015 Inference Chip →
    28 nm, 256×256 INT8 systolic, 24 MiB unified buffer, 8 GiB DDR3, the brilliantly minimal CISC ISA, the 92 TOPS roofline and what Google learned about memory bandwidth.
    live
  5. TPU v2 & v3 — The Training Era Begins →
    bfloat16 (Google Brain's invention), HBM arrives, two TensorCores per chip, the 2D torus pod, why v3 needed liquid cooling.
    live
  6. TPU v4 — OCS, SparseCore & Palomar →
    7 nm, 3D torus, the Palomar 3D-MEMS optical circuit switch, SparseCore embedding accelerator, CMEM, the v4i inference sibling, PaLM at 6,144 chips.
    live
  7. TPU v5e & v5p — The Two-Track Fork →
    The efficiency / performance product fork, v5p as the Gemini training flagship, 95 GB HBM, the 8,960-chip pod, multislice over Jupiter DCN.
    live
  8. Trillium & Ironwood — v6e and v7 →
    Trillium's 4.7× jump and 3rd-gen SparseCore, Ironwood's 4.6 PFLOPS FP8 / 192 GB HBM3e per chip, the 9,216-chip "age of inference" pod.
    live
  9. Memory Hierarchy & Numerics →
    VMEM, CMEM, HBM2 to HBM3e, the bf16 invention story, INT8 + FP8, accumulator widths, why each precision change doubled effective FLOPS.
    live
  10. ICI, OCS & the 3D Torus →
    Custom-SerDes Inter-Chip Interconnect, 2D vs 3D torus, the Palomar 136-port optical switch, twisted-torus topologies, healing around faulty cubes, multipod over Jupiter DCN.
    live
  11. The TPU Software Stack — XLA, JAX, Pallas →
    XLA / HLO / StableHLO, JAX transformations and sharding, GSPMD & Shardy partitioning, Pallas / Mosaic kernels, PyTorch-XLA & TorchTPU, MaxText, Pathways, Multislice.
    live
  12. TPU vs GPU — Two Architectural Philosophies →
    Static-compiler vs dynamic-warp scheduling, scratchpad vs cache, AOT graph vs JIT kernel, 3D torus vs fat-tree InfiniBand, where each wins, the cloud-only constraint.
    live

Why a TPU sub-hub? The TPU has been the silent partner of LLM history — AlphaGo ran on v1, every public Gemini model trained on v4 or v5p, and Ironwood pods serve more inference traffic than any rack of GPUs at Google. Yet most LLM engineers know the chip only as a black box behind jax.jit. This series opens the box: the hardware, the interconnect, the compiler, and the ten years of design choices that produced something genuinely different from a GPU.