Google TPUs Series — Presentation 12

TPU vs GPU — Two Architectural Philosophies

Static-compiler scheduling vs dynamic warp scheduling. Scratchpad vs cache. AOT graph vs JIT kernel. 3D torus vs fat-tree InfiniBand. Cloud-only vs everywhere. The same workload, two answers.

XLA AOTVMEM scratchpad 3D torus + OCS CUDA JITL1/L2 cache NVLink + InfiniBand
same workload two architectures two software stacks two pod fabrics different wins / losses
00

Topics We'll Cover

01

Two Different Starting Premises

TPU — "we know the workload"

  • Designed in 2013–14 around a known-and-stable workload: dense matmul, low-precision integers, pre-Transformer.
  • Compiler-first; chip is dumb, the toolchain is smart.
  • Statically scheduled, software-managed memory.
  • One vendor, one cloud, one customer (Google) for the first three generations.
  • Single workload class; specialise relentlessly.

GPU — "we don't know the workload"

  • Originated as 1990s graphics processors; ML usage came as a happy accident in 2009–12.
  • Hardware-first; many cores, fast switchers, robust to whatever code you throw at it.
  • Dynamically scheduled, hardware-managed cache.
  • Many vendors (NVIDIA, AMD, Intel), many platforms (gaming, HPC, ML, crypto, video).
  • Many workload classes; generalise across all of them.

Both architectures reached "AI accelerator" status, but from opposite directions. The TPU started as an ASIC and grew toward generality (vector unit in v2, embedding accelerator in v4, FP8 in v7). The GPU started as a generalist and grew toward specialisation (tensor cores in V100, FP8 in H100, FP4 in B200). By 2026 they converge structurally; the philosophical difference persists in what "performance" means.

A useful question to ask

"What does the chip do when it's idle?" GPU: nothing — the warp scheduler is waiting for ready warps, the SMs sleep, caches retain whatever was there. TPU: literally the program told it to stall — every cycle is on the schedule, so an idle cycle is a compiler decision. This dichotomy explains a lot of downstream behaviour.

02

Compute Granularity

A few huge MXUs

  • Per chip: 2 TensorCores × 4 MXUs × 128×128 = 8 systolic arrays of 16,384 MACs each (v4/v5p/Ironwood class).
  • Total: ~131,000 MACs per chip, all running in lockstep.
  • One thread of control per TensorCore, dispatched by XLA-emitted VLIW.
  • Few but huge units — the philosophy is "make the array as big as physically possible and feed it perfectly".

Many small SMs

  • Per H100: 132 SMs, each with 4 tensor cores, plus FP/INT pipelines.
  • Per B200 (dual-die): 208 SMs total.
  • Tens of thousands of warps in flight across all SMs.
  • Each tensor core is a smaller MMA block (e.g. 16×16×16 on Hopper wgmma).
  • Many small units — the philosophy is "have so many independent execution contexts that something is always ready".

The implication for utilisation

A TPU MXU achieves its peak only on contractions deep enough to amortise pipeline fill (slide 10 of deck 03). A GPU SM achieves its peak as soon as enough warps are resident; the warp scheduler can usually find one. TPUs are easier to underutilise than GPUs for short or irregular kernels. Conversely, when the workload is long dense matmul, the TPU's lockstep array is more efficient than 132 small ones with their associated control overhead.

03

Scheduling: Static Compiler vs Dynamic Warp

Static, compiler-issued (TPU)

  • XLA emits a VLIW-style schedule for each TensorCore.
  • Every cycle: scalar unit reads instructions, dispatches to MXU, vector unit, ICI engine.
  • No branch predictor, no speculation, no out-of-order.
  • Latency hidden by software-pipelined DMA: load_next_tile in cycle N, compute_this_tile in cycle N (overlapped).
  • Performance is deterministic: the same input takes the same number of cycles every time.

Dynamic, hardware-issued (GPU)

  • Each SM has 4 warp schedulers; each picks one of dozens of resident warps every cycle.
  • Warps are fired and replaced as their data arrives.
  • Latency hidden by warp swapping: while warp A waits for memory, warp B runs.
  • The hardware figures out parallelism at runtime.
  • Performance is variable: scheduling decisions depend on cache state and contention.

Why this matters

04

Memory: Scratchpad vs Cache

Software-managed scratchpads

  • VMEM (~32–64 MiB per TensorCore): software-managed, no tags, no replacement policy.
  • CMEM (~128 MiB shared, since v4i): cache-like but XLA-controlled.
  • Working set is statically planned by the compiler for each kernel.
  • Latency is predictable; misses don't exist (if it's not there, the compiler had to load it).
  • Density wins: SRAM cells without tag overhead pack tighter.

Conventional cache hierarchy

  • L1 / shared memory (192 KB per SM on H100): partly programmable, partly cache.
  • L2 (60 MB on H100, larger on B200): hardware-managed.
  • Working set discovered at runtime; cache fills on access.
  • Latency variable: hits are fast, misses are slow.
  • Generality wins: no compiler effort, any workload benefits from caching.

The Hopper Tensor Memory Accelerator (TMA) is a step toward the TPU's model: explicit asynchronous DMA into shared memory, scheduled by the kernel. NVIDIA is moving toward more software-managed memory; Google has been moving toward more cache-like behaviour. The lines are converging.

05

Compilation: AOT Graph vs JIT Kernel

Whole-graph AOT (TPU)

  • The user's Python code is traced into one big HLO graph.
  • XLA optimises the whole graph: cross-kernel fusion, layout assignment, sharding inference.
  • The graph is compiled ahead of time (before the first execution).
  • Result: one big binary with thousands of fused kernels, called by Pathways.
  • Recompile triggered by shape or sharding changes.

Per-kernel JIT (GPU)

  • Each ML op (matmul, layernorm, softmax) is a hand-tuned CUDA kernel from cuBLAS / cuDNN / FlashAttention.
  • Kernels are pre-compiled to PTX, JIT'd to SASS by the driver at first launch.
  • Framework (PyTorch, JAX-on-GPU) dispatches kernels one at a time.
  • Optimisation across kernels is the framework's responsibility, not the compiler's.
  • Triton / TorchInductor / CUTLASS provide more whole-graph compilation paths but the default is still per-kernel.

The fusion advantage of TPU

XLA can fuse a Transformer block's matmul → bias_add → layernorm → softmax into a few kernels with intermediate values living in VMEM. PyTorch eager mode runs each op as a separate kernel with intermediate values in HBM — orders of magnitude more memory traffic. Hence torch.compile: PyTorch is moving toward graph-level compilation specifically to close this gap.

The pragmatic choice in 2026

For a JAX user on TPU: graph compilation is automatic and transparent. For a PyTorch user on GPU: torch.compile is increasingly required for performance, but adds its own debugging complications. The TPU has had an integrated whole-graph compiler since 2017; the GPU world is still catching up.

06

Pod Fabric: 3D Torus + OCS vs NVLink + IB

TPU pod (Ironwood)

  • 9,216 chips in one ICI-coherent domain.
  • Custom electrical SerDes, hardware all-reduce in the link controller.
  • 3D torus, with Palomar OCS for inter-cube reconfigurability.
  • Per-chip ICI: 1.2 TB/s bidir.
  • Pod aggregate ICI: ~10+ PB/s bisection.
  • Sub-microsecond chip-to-chip latency for any neighbour.

GPU pod (NVL72 + IB)

  • 72 GPUs in one NVLink domain (NVL72 rack).
  • NVLink-Switch fabric, NVLink 5 at 1.8 TB/s per GPU.
  • Beyond the rack: ConnectX-8 InfiniBand at ~800 Gbps per HCA.
  • Bisection at the InfiniBand layer is fat-tree; bandwidth depends on topology.
  • NVLink hop latency < 1 μs; IB hop latency 1–3 μs.
  • For frontier training: typically 8×NVL72 = 576 GPUs, then InfiniBand.

The scale-up gap

NVL72 is NVIDIA's biggest single-system scale-up: 72 GPUs sharing NVLink. Ironwood at 9,216 chips is two orders of magnitude larger. For tensor-parallel attention with very long sequences, where you want all-reduce latency to be sub-millisecond, the TPU pod has no peer. NVIDIA's Rubin generation will push NVL up further (rumoured 144 or higher) but won't close the gap entirely — OCS-based reconfigurability is fundamentally cheaper to scale than NVLink-Switch silicon.

07

Numerics & Precision

FormatTPU supportNVIDIA support
FP32vector unit; FP32 accumulatorYes (FMA pipeline + FP32 tensor core path)
TF32No (Google chose bf16 instead)Yes since Ampere
FP16Limited; not preferred for trainingYes since Volta (the original tensor core format)
bfloat16Native since v2 (2017)Native since Ampere (2020) — following Google's lead
INT8v1 native; v5e+ tensor coresYes since Turing (2018)
INT4No first-class supportYes since Turing
FP8 (E4M3 / E5M2)Native on v5p (some path) and IronwoodNative since Hopper (2022) — ahead of TPU
FP4 / MXFP4 / NVFP4Not yet shippedNative on Blackwell (2024)

The pattern

08

Where TPU Wins

09

Where GPU Wins

10

The Cloud-Only Constraint

The single biggest TPU disadvantage is not technical — it's commercial. You can only get TPUs through Google Cloud.

What this prevents

  • On-prem deployment by enterprises with strict data-residency requirements.
  • Air-gapped national-security workloads.
  • "Run it in our datacenter for compliance reasons."
  • Multi-cloud architectures that require the same chip everywhere.
  • Long-lived hardware ownership (you can reserve, but not buy outright).

What GPUs offer instead

  • Buy from any OEM, ship to any datacenter, run any workload.
  • Full on-prem operation, including air-gapped.
  • Workstation form factors (RTX, RTX Pro) for individual researchers.
  • DGX Spark / DGX Station for small-team / desk-side use.
  • Resale value: a 3-year-old H100 still has a market.

Why Google won't sell TPUs

Three plausible reasons: (1) strategic — the TPU is a Google Cloud differentiator, selling them externally would dilute that; (2) operational — the TPU's pod-scale infrastructure (cooling, OCS, Pathways orchestration) requires Google datacenter expertise; (3) economic — selling chips at scale is a much lower-margin business than renting compute. Google has historically chosen rent over sell at every fork.

A 2026 wrinkle

Through 2025 there were rumours that Google was considering selling TPUs to non-cloud customers, particularly hyperscalers (Meta, Apple). As of May 2026 nothing has shipped, but the strategic discussion is live. Even if it happens, on-prem TPU is a long way away — the operational complexity of a v5p / Ironwood pod is not retail-ready.

11

Convergence in 2026

Every architectural difference between TPUs and modern GPUs is narrower than it was five years ago.

Dimension20182026
Compute primitiveTPU MXU vs FP32 SIMDTPU MXU vs GPU tensor cores — both are dedicated matmul accelerators
Memory modelScratchpad vs cacheTPU adds CMEM cache; GPU adds TMA scratchpad-like behaviour
CompilationXLA whole-graph vs CUDA per-kernelXLA still wins on graphs; torch.compile closing the gap
Numericsbf16-only on TPU vs FP16 on GPUBoth: bf16, INT8, FP8 (TPU just added; GPU also has FP4)
Pod fabric2D torus 256 vs 8-GPU server3D torus + OCS 9,216 vs NVL72 + InfiniBand
Numerical determinismTPU yes, GPU noTPU yes (mostly), GPU close (deterministic mode in cuBLAS / cuDNN)

What still divides them

For a frontier-training organisation in 2026 with the option to rent either, the choice often comes down to who's already running which stack. Google trains on TPUs because it built them; xAI trains on H100/H200 because Elon Musk bought a hundred thousand. Anthropic trains on TPUs (via Google Cloud); OpenAI on NVIDIA. Each has good engineering reasons. Both stacks work.

12

Cheat Sheet

End of the series

This is deck 12 of 12. You've now walked the TPU programme end to end — the 2013 napkin maths, the silicon, the systolic core, every generation through Ironwood, the optical interconnect, the software stack, and the architectural philosophy. For deeper context on the GPU side, see the companion NVIDIA GPU Architectures series.