Google TPUs Series — Presentation 12
TPU vs GPU — Two Architectural Philosophies
Static-compiler scheduling vs dynamic warp scheduling. Scratchpad vs cache. AOT graph vs JIT kernel. 3D torus vs fat-tree InfiniBand. Cloud-only vs everywhere. The same workload, two answers.
XLA AOTVMEM scratchpad
3D torus + OCS
CUDA JITL1/L2 cache
NVLink + InfiniBand
same workload
→
two architectures
→
two software stacks
→
two pod fabrics
→
different wins / losses
01
Two Different Starting Premises
TPU — "we know the workload"
- Designed in 2013–14 around a known-and-stable workload: dense matmul, low-precision integers, pre-Transformer.
- Compiler-first; chip is dumb, the toolchain is smart.
- Statically scheduled, software-managed memory.
- One vendor, one cloud, one customer (Google) for the first three generations.
- Single workload class; specialise relentlessly.
GPU — "we don't know the workload"
- Originated as 1990s graphics processors; ML usage came as a happy accident in 2009–12.
- Hardware-first; many cores, fast switchers, robust to whatever code you throw at it.
- Dynamically scheduled, hardware-managed cache.
- Many vendors (NVIDIA, AMD, Intel), many platforms (gaming, HPC, ML, crypto, video).
- Many workload classes; generalise across all of them.
Both architectures reached "AI accelerator" status, but from opposite directions. The TPU started as an ASIC and grew toward generality (vector unit in v2, embedding accelerator in v4, FP8 in v7). The GPU started as a generalist and grew toward specialisation (tensor cores in V100, FP8 in H100, FP4 in B200). By 2026 they converge structurally; the philosophical difference persists in what "performance" means.
A useful question to ask
"What does the chip do when it's idle?" GPU: nothing — the warp scheduler is waiting for ready warps, the SMs sleep, caches retain whatever was there. TPU: literally the program told it to stall — every cycle is on the schedule, so an idle cycle is a compiler decision. This dichotomy explains a lot of downstream behaviour.
02
Compute Granularity
A few huge MXUs
- Per chip: 2 TensorCores × 4 MXUs × 128×128 = 8 systolic arrays of 16,384 MACs each (v4/v5p/Ironwood class).
- Total: ~131,000 MACs per chip, all running in lockstep.
- One thread of control per TensorCore, dispatched by XLA-emitted VLIW.
- Few but huge units — the philosophy is "make the array as big as physically possible and feed it perfectly".
Many small SMs
- Per H100: 132 SMs, each with 4 tensor cores, plus FP/INT pipelines.
- Per B200 (dual-die): 208 SMs total.
- Tens of thousands of warps in flight across all SMs.
- Each tensor core is a smaller MMA block (e.g. 16×16×16 on Hopper wgmma).
- Many small units — the philosophy is "have so many independent execution contexts that something is always ready".
The implication for utilisation
A TPU MXU achieves its peak only on contractions deep enough to amortise pipeline fill (slide 10 of deck 03). A GPU SM achieves its peak as soon as enough warps are resident; the warp scheduler can usually find one. TPUs are easier to underutilise than GPUs for short or irregular kernels. Conversely, when the workload is long dense matmul, the TPU's lockstep array is more efficient than 132 small ones with their associated control overhead.
03
Scheduling: Static Compiler vs Dynamic Warp
Static, compiler-issued (TPU)
- XLA emits a VLIW-style schedule for each TensorCore.
- Every cycle: scalar unit reads instructions, dispatches to MXU, vector unit, ICI engine.
- No branch predictor, no speculation, no out-of-order.
- Latency hidden by software-pipelined DMA:
load_next_tile in cycle N, compute_this_tile in cycle N (overlapped).
- Performance is deterministic: the same input takes the same number of cycles every time.
Dynamic, hardware-issued (GPU)
- Each SM has 4 warp schedulers; each picks one of dozens of resident warps every cycle.
- Warps are fired and replaced as their data arrives.
- Latency hidden by warp swapping: while warp A waits for memory, warp B runs.
- The hardware figures out parallelism at runtime.
- Performance is variable: scheduling decisions depend on cache state and contention.
Why this matters
- Determinism: the TPU's static schedule means the same training step takes exactly the same number of cycles every run. The bit-exactness story is much cleaner. Reproducibility for distributed training is structurally easier.
- Compile cost: changing tensor shapes triggers an XLA recompile; first-call cost can be minutes. GPU kernels are pre-compiled or JIT'd by the driver but kernel launch is microseconds.
- Profiling: a TPU has nowhere to hide. If a workload underperforms, the compiler's schedule shows exactly why. GPU profiling has more sources of variability.
04
Memory: Scratchpad vs Cache
Software-managed scratchpads
- VMEM (~32–64 MiB per TensorCore): software-managed, no tags, no replacement policy.
- CMEM (~128 MiB shared, since v4i): cache-like but XLA-controlled.
- Working set is statically planned by the compiler for each kernel.
- Latency is predictable; misses don't exist (if it's not there, the compiler had to load it).
- Density wins: SRAM cells without tag overhead pack tighter.
Conventional cache hierarchy
- L1 / shared memory (192 KB per SM on H100): partly programmable, partly cache.
- L2 (60 MB on H100, larger on B200): hardware-managed.
- Working set discovered at runtime; cache fills on access.
- Latency variable: hits are fast, misses are slow.
- Generality wins: no compiler effort, any workload benefits from caching.
The Hopper Tensor Memory Accelerator (TMA) is a step toward the TPU's model: explicit asynchronous DMA into shared memory, scheduled by the kernel. NVIDIA is moving toward more software-managed memory; Google has been moving toward more cache-like behaviour. The lines are converging.
05
Compilation: AOT Graph vs JIT Kernel
Whole-graph AOT (TPU)
- The user's Python code is traced into one big HLO graph.
- XLA optimises the whole graph: cross-kernel fusion, layout assignment, sharding inference.
- The graph is compiled ahead of time (before the first execution).
- Result: one big binary with thousands of fused kernels, called by Pathways.
- Recompile triggered by shape or sharding changes.
Per-kernel JIT (GPU)
- Each ML op (matmul, layernorm, softmax) is a hand-tuned CUDA kernel from cuBLAS / cuDNN / FlashAttention.
- Kernels are pre-compiled to PTX, JIT'd to SASS by the driver at first launch.
- Framework (PyTorch, JAX-on-GPU) dispatches kernels one at a time.
- Optimisation across kernels is the framework's responsibility, not the compiler's.
- Triton / TorchInductor / CUTLASS provide more whole-graph compilation paths but the default is still per-kernel.
The fusion advantage of TPU
XLA can fuse a Transformer block's matmul → bias_add → layernorm → softmax into a few kernels with intermediate values living in VMEM. PyTorch eager mode runs each op as a separate kernel with intermediate values in HBM — orders of magnitude more memory traffic. Hence torch.compile: PyTorch is moving toward graph-level compilation specifically to close this gap.
The pragmatic choice in 2026
For a JAX user on TPU: graph compilation is automatic and transparent. For a PyTorch user on GPU: torch.compile is increasingly required for performance, but adds its own debugging complications. The TPU has had an integrated whole-graph compiler since 2017; the GPU world is still catching up.
06
Pod Fabric: 3D Torus + OCS vs NVLink + IB
TPU pod (Ironwood)
- 9,216 chips in one ICI-coherent domain.
- Custom electrical SerDes, hardware all-reduce in the link controller.
- 3D torus, with Palomar OCS for inter-cube reconfigurability.
- Per-chip ICI: 1.2 TB/s bidir.
- Pod aggregate ICI: ~10+ PB/s bisection.
- Sub-microsecond chip-to-chip latency for any neighbour.
GPU pod (NVL72 + IB)
- 72 GPUs in one NVLink domain (NVL72 rack).
- NVLink-Switch fabric, NVLink 5 at 1.8 TB/s per GPU.
- Beyond the rack: ConnectX-8 InfiniBand at ~800 Gbps per HCA.
- Bisection at the InfiniBand layer is fat-tree; bandwidth depends on topology.
- NVLink hop latency < 1 μs; IB hop latency 1–3 μs.
- For frontier training: typically 8×NVL72 = 576 GPUs, then InfiniBand.
The scale-up gap
NVL72 is NVIDIA's biggest single-system scale-up: 72 GPUs sharing NVLink. Ironwood at 9,216 chips is two orders of magnitude larger. For tensor-parallel attention with very long sequences, where you want all-reduce latency to be sub-millisecond, the TPU pod has no peer. NVIDIA's Rubin generation will push NVL up further (rumoured 144 or higher) but won't close the gap entirely — OCS-based reconfigurability is fundamentally cheaper to scale than NVLink-Switch silicon.
07
Numerics & Precision
| Format | TPU support | NVIDIA support |
| FP32 | vector unit; FP32 accumulator | Yes (FMA pipeline + FP32 tensor core path) |
| TF32 | No (Google chose bf16 instead) | Yes since Ampere |
| FP16 | Limited; not preferred for training | Yes since Volta (the original tensor core format) |
| bfloat16 | Native since v2 (2017) | Native since Ampere (2020) — following Google's lead |
| INT8 | v1 native; v5e+ tensor cores | Yes since Turing (2018) |
| INT4 | No first-class support | Yes since Turing |
| FP8 (E4M3 / E5M2) | Native on v5p (some path) and Ironwood | Native since Hopper (2022) — ahead of TPU |
| FP4 / MXFP4 / NVFP4 | Not yet shipped | Native on Blackwell (2024) |
The pattern
- Google led on bfloat16 — the only major numeric format the AI industry adopted from the TPU.
- NVIDIA led on FP8 and FP4 — same OCP standards, but TPU follows by ~1 generation.
- Both architectures use FP32 accumulators below the headline numeric, for the same numerical-stability reasons.
- The "halve precision, double FLOPS" lever has been pulled three times so far. Future formats below FP4 (1.58-bit ternary, binary?) are research-grade for now.
08
Where TPU Wins
- Dense matmul-heavy workloads at long contractions. The systolic array's structural fit is most apparent on transformer FFN layers and big embedding multiplications.
- Pod-scale all-reduce. ICI's hardware all-reduce on a 3D torus is the most efficient single-job-spanning-thousands-of-chips collective in production.
- Deterministic training. Static scheduling means identical results across runs — valuable for reproducibility, debugging, and bit-exactness in distributed pipelines.
- Fleet inference of dense models. e-class TPUs (v5e, Trillium) hit very competitive perf/$ for serving 7B–70B-parameter models — especially when SparseCore handles MoE routing.
- Long-context inference. Ironwood's 192 GiB HBM3e at 7.4 TB/s holds enormous KV caches without all-reduce roundtrips.
- Liquid cooling at unprecedented scale. Google's been doing it since 2018; the operational expertise compounds.
- Cost-effective ML training for a single tenant. If you can fit your job on one pod, the per-FLOP cost is competitive or lower than GPU equivalents.
09
Where GPU Wins
- Sparse, irregular, dynamic-shape workloads. Anything where the kernel boundaries shift at runtime (graph neural nets with variable structure, RL with rollout-dependent batch sizes) — the GPU's dynamic warp scheduling handles it.
- Single-host development. A 4090 in your desk lets you prototype; you don't get a TPU under your desk (DGX Spark notwithstanding).
- Software ecosystem. CUDA has 18 years of accumulated tooling, libraries, examples, talks, books. TPU has ~10 years of equivalent and is much less popular outside Google.
- Custom kernels. Triton on GPU is more mature than Pallas on TPU. FlashAttention has shipped on GPU first every time. Cutting-edge research kernels appear on GPU first.
- Heterogeneous workloads. The same H100 runs your inference, your training, your video transcoding, your scientific HPC code, your particle simulation. A TPU runs ML.
- On-prem availability. You can buy a server with H100s and put it in your datacenter. You cannot buy a server with TPUs.
- Multi-vendor optionality. AMD MI300X, Intel Gaudi 3, and others share the GPU programming model. TPU has one vendor, one cloud.
10
The Cloud-Only Constraint
The single biggest TPU disadvantage is not technical — it's commercial. You can only get TPUs through Google Cloud.
What this prevents
- On-prem deployment by enterprises with strict data-residency requirements.
- Air-gapped national-security workloads.
- "Run it in our datacenter for compliance reasons."
- Multi-cloud architectures that require the same chip everywhere.
- Long-lived hardware ownership (you can reserve, but not buy outright).
What GPUs offer instead
- Buy from any OEM, ship to any datacenter, run any workload.
- Full on-prem operation, including air-gapped.
- Workstation form factors (RTX, RTX Pro) for individual researchers.
- DGX Spark / DGX Station for small-team / desk-side use.
- Resale value: a 3-year-old H100 still has a market.
Why Google won't sell TPUs
Three plausible reasons: (1) strategic — the TPU is a Google Cloud differentiator, selling them externally would dilute that; (2) operational — the TPU's pod-scale infrastructure (cooling, OCS, Pathways orchestration) requires Google datacenter expertise; (3) economic — selling chips at scale is a much lower-margin business than renting compute. Google has historically chosen rent over sell at every fork.
A 2026 wrinkle
Through 2025 there were rumours that Google was considering selling TPUs to non-cloud customers, particularly hyperscalers (Meta, Apple). As of May 2026 nothing has shipped, but the strategic discussion is live. Even if it happens, on-prem TPU is a long way away — the operational complexity of a v5p / Ironwood pod is not retail-ready.
11
Convergence in 2026
Every architectural difference between TPUs and modern GPUs is narrower than it was five years ago.
| Dimension | 2018 | 2026 |
| Compute primitive | TPU MXU vs FP32 SIMD | TPU MXU vs GPU tensor cores — both are dedicated matmul accelerators |
| Memory model | Scratchpad vs cache | TPU adds CMEM cache; GPU adds TMA scratchpad-like behaviour |
| Compilation | XLA whole-graph vs CUDA per-kernel | XLA still wins on graphs; torch.compile closing the gap |
| Numerics | bf16-only on TPU vs FP16 on GPU | Both: bf16, INT8, FP8 (TPU just added; GPU also has FP4) |
| Pod fabric | 2D torus 256 vs 8-GPU server | 3D torus + OCS 9,216 vs NVL72 + InfiniBand |
| Numerical determinism | TPU yes, GPU no | TPU yes (mostly), GPU close (deterministic mode in cuBLAS / cuDNN) |
What still divides them
- Pod scale. 9,216-chip ICI domain vs 72-GPU NVLink domain. Two orders of magnitude.
- Software ecosystem. CUDA's installed base is enormous and still growing.
- Vendor strategy. Google: one cloud, integrated stack. NVIDIA: many platforms, broad toolchain.
- Distribution. TPU only on GCP; GPUs everywhere.
For a frontier-training organisation in 2026 with the option to rent either, the choice often comes down to who's already running which stack. Google trains on TPUs because it built them; xAI trains on H100/H200 because Elon Musk bought a hundred thousand. Anthropic trains on TPUs (via Google Cloud); OpenAI on NVIDIA. Each has good engineering reasons. Both stacks work.
12
Cheat Sheet
- Compute granularity: TPU = a few huge MXUs (8 systolic arrays of 16,384 MACs). GPU = many small SMs (132 on H100, 208 on B200).
- Scheduling: TPU = static, compiler-issued VLIW. GPU = dynamic, hardware warp scheduler. TPU is deterministic; GPU is more flexible.
- Memory: TPU = software-managed VMEM + cache-like CMEM. GPU = L1/L2 cache + shared memory + register file. The two architectures are converging.
- Compilation: TPU = whole-graph AOT via XLA, with cross-kernel fusion. GPU = per-kernel JIT via cuBLAS/cuDNN/Triton, with
torch.compile closing the gap.
- Pod fabric: TPU = 9,216-chip 3D-torus + OCS, custom ICI. GPU = 72-GPU NVL72 + InfiniBand fat-tree. Two orders of magnitude on scale-up size.
- Numerics: Google led on bf16 (TPU 2017, NVIDIA 2020). NVIDIA led on FP8 (Hopper 2022) and FP4 (Blackwell 2024). TPU follows by ~1 generation.
- TPU wins: dense matmul, pod-scale all-reduce, deterministic training, fleet inference of dense models, long-context KV at 192 GiB.
- GPU wins: sparse / irregular workloads, single-host development, software ecosystem, custom kernels, on-prem availability, multi-vendor.
- Cloud-only constraint: TPU only via Google Cloud. The single biggest non-technical disadvantage.
- 2026 convergence: the two architectures are structurally closer than ever; the philosophical difference (chip-is-dumb-compiler-is-smart vs chip-is-smart-compiler-is-flexible) persists.
End of the series
This is deck 12 of 12. You've now walked the TPU programme end to end — the 2013 napkin maths, the silicon, the systolic core, every generation through Ironwood, the optical interconnect, the software stack, and the architectural philosophy. For deeper context on the GPU side, see the companion NVIDIA GPU Architectures series.