Google TPUs Series — Presentation 09

Memory Hierarchy & Numerics

Registers → VMEM → CMEM → HBM → remote HBM via ICI. INT8 → bf16 → FP8. The two axes that determine every TPU's effective performance.

VMEMCMEM HBM2 / 2e / 3 / 3e INT8bf16FP8
PE registers Accumulator SRAM VMEM CMEM HBM remote HBM (ICI)
00

Topics We'll Cover

01

The Six-Tier TPU Memory Hierarchy

From smallest and fastest to largest and slowest. Numbers are roughly Ironwood-class.

PE regs
Per-PE registers (KiBs)
~1 cycle
Accum
Accumulator SRAM (~MB)
1–2 cycles
VMEM
VMEM scratchpad (16–64 MiB per TC)
~5 cycles
CMEM
CMEM cache (128 MiB+ shared)
~20 cycles
HBM
HBM3e — 192 GiB at 7.37 TB/s
~250 ns
Remote
Remote HBM via ICI (1.2 TB/s bidir)
~2 μs + tx

The two boundaries that matter most

Everything else — PE registers, accumulator, CMEM — is plumbing. The two cliffs are where compiler intelligence matters.

02

VMEM — Software-Managed Vector Memory

VMEM is the per-TensorCore software-managed scratchpad. Direct ancestor of v1's Unified Buffer; not a cache.

What lives in VMEM

  • Activation tiles being fed into the MXU.
  • Output tiles drained from the MXU before writeback.
  • Working sets for vector-unit operations (softmax, layer-norm, residual adds).
  • Small operands the compiler can keep resident across multiple operations.

What's not in VMEM

  • Weights — weights live in HBM and stream through the weight FIFO.
  • Anything the compiler can't fit. If a tile is too big, the compiler spills to HBM (or now CMEM).
  • No tags, no replacement policy. Every byte's residency is statically planned.

Capacities by generation

VMEM size has not grown as fast as HBM — SRAM density doesn't scale like DRAM, and area is precious. The compiler is expected to manage residency with finer tile sizes; CMEM picks up the rest.

The Pallas API view

If you write low-level TPU kernels in Pallas, VMEM is the main thing you allocate explicitly. BlockSpec(block_shape=..., index_map=...) tells the compiler exactly which HBM region to bring into VMEM for each pipeline step; pltpu.emit_pipeline handles the double-buffering. This is where the TPU's "software does the scheduling" philosophy is most visible at the application layer.

03

CMEM — The First TPU Cache

CMEM is introduced with v4i (ISCA 2021) and persists through every later TPU. It is shared between the two TensorCores on a chip, sits between VMEM and HBM, and is the first TPU memory level with explicit cache-like semantics.

What CMEM is

  • Roughly 128 MiB on-die SRAM (v4 / v4i specifics; later chips similar order).
  • Shared by both TensorCores on the chip.
  • XLA controls residency; behaves cache-like in that lookups can hit or miss.
  • HBM bandwidth amplifier — effective memory bandwidth roughly doubles for workloads with good locality.

What CMEM is not

  • Not a fully transparent cache (no automatic LRU; XLA decides what stays).
  • Not coherent with anything off-chip.
  • Not a uniform replacement for HBM — capacity is small.

What it buys at the workload level

  • KV cache hits on long-context inference.
  • Activation re-use for backprop in training.
  • Fused-attention staging for FlashAttention-style kernels.
  • Embedding-row caches for repeated lookups in the same sequence.

Why CMEM and not "just bigger VMEM"? VMEM is per-TensorCore; CMEM is shared. Shared SRAM is more area-efficient when both cores reuse the same data (typical for attention's QKV tiles). Splitting the levels gives the compiler two knobs — tightly-pipelined VMEM tiles vs whole-step CMEM working sets.

The architectural trend

From v1's pure scratchpad model to v4's scratchpad-plus-cache, the TPU's memory hierarchy has gradually become more cache-like. NVIDIA went the other way at the same time — H100's TMA introduced more explicit DMA control, more scratchpad-like behaviour. The two architectures are converging on the same shape from opposite directions.

04

HBM Generations Across TPU Generations

TPUHBM standardPer-chip capacityPer-chip bandwidthStacks
v1 (2015)none (DDR3)8 GiB DDR3~34 GB/s2× DDR3 channels
v2 (2017)HBM16 GiB~600 GB/s2 stacks
v3 (2018)HBM232 GiB~900 GB/s2 stacks
v4 (2020)HBM232 GiB1.2 TB/s2 stacks
v4i (2020)HBM28 GiB~614 GB/s1 stack
v5e (2023)HBM2e (likely)16 GiB819 GB/s1–2 stacks
v5p (2023)HBM3 (likely)95 GiB2.76 TB/s6 stacks
Trillium (2024)HBM3 (likely)32 GiB1.64 TB/s2 stacks
Ironwood (2025)HBM3e192 GiB7.37 TB/s6–8 stacks

The bandwidth curve in one ratio

From v1 (34 GB/s) to Ironwood (7.37 TB/s) is ~217× bandwidth growth in a decade — vastly faster than Moore's law on transistor count. Most of the per-chip performance growth in TPUs has come from feeding the chip faster, not from making the chip do more arithmetic per cycle.

The HBM stack mechanics

05

Remote HBM and the ICI Bandwidth Cliff

Below HBM in the hierarchy is "the HBM on a different chip in the same pod, reached via ICI". Critical for any matmul that exceeds one-chip capacity.

PathBandwidthLatency
Local HBM (Ironwood)7.37 TB/s~250 ns
Remote HBM via ICI (Ironwood)1.2 TB/s bidir~2 μs + serialisation
Remote HBM via Jupiter DCN (Trillium multipod)13 Pb/s pod bisection → per-link ~Tbps~10–100 μs

The compiler's job

This is the same job the GPU compiler stack does, with different cliffs (NVLink for nearby, InfiniBand for far). The cliffs are sharper on TPU because of the larger pod size and the fact that ICI is custom-silicon — sub-microsecond at small messages, but with a hard tier change to Jupiter at the pod boundary.

A pod is one machine

Within a TPU pod, software treats the pod as a single coherent memory space — you can jax.device_put(x, sharding=NamedSharding(mesh, ...)) and any chip can access any other chip's HBM. The compiler will turn that into ICI all-gather / all-reduce / reduce-scatter calls automatically. You don't write the cross-chip transfers explicitly — but you do feel them in your tile-size choices.

06

INT8 — Where v1 Started

v1's only numeric: 8-bit signed integers. Multipliers produced 16-bit products; accumulators were 32-bit. Inference only — no gradients, no FP arithmetic anywhere on chip.

Why INT8 worked for inference in 2015

Why INT8 didn't work for training

v1's INT8-only choice is the cleanest possible "inference accelerator" decision. v2 is the moment Google admits that training also needs to live on this chip, which forces FP arithmetic.

07

bfloat16 — The Format That Won

FP32 (IEEE 754 single)

S EEEEEEEE MMMMMMMMMMMMMMMMMMMMMMM

1 sign + 8 exp + 23 mantissa. Range ~10±38. The standard since 1985.

FP16 (IEEE half)

S EEEEE MMMMMMMMMM

1+5+10. Range ~10±5. Underflows on small gradients.

bfloat16 (Google Brain)

S EEEEEEEE MMMMMMM

1+8+7. Same range as FP32 (~10±38). Conversion to/from FP32 is just truncation.

Why bfloat16 is the format of the modern AI era

bf16 first appeared in TPU v2 in 2017. NVIDIA added it to Ampere tensor cores in 2020. Intel shipped AVX-512 BF16 instructions in Cooper Lake (also 2020). ARM's BFloat16 extension landed in ARMv8.6. By 2026 every major AI accelerator and CPU supports it natively. It is the most successful Google-originated numeric format in computing history.

08

FP8 — OCP E4M3 / E5M2

FP8 is the third generation of "halve precision, double FLOPS" on TPUs. Native on Ironwood; supported but less prominent on v5p. The format is the OCP (Open Compute Project) standard, jointly defined by NVIDIA, Intel, ARM, AMD, and Google.

E4M3 — activations / weights

S EEEE MMM
  • 1 sign + 4 exp + 3 mantissa.
  • Range ~10±1.5 (after the special-value asymmetry).
  • Better precision; used for forward-pass tensors.

E5M2 — gradients

S EEEEE MM
  • 1 sign + 5 exp + 2 mantissa.
  • Range ~10±5.
  • Wider range; used for gradients in mixed-precision training (rarer use on TPU since training stays in bf16).

Per-block scaling

FP8's narrow range alone is too tight for most tensors. The standard practice is block scaling: a block of values (typically 32 or 128 elements) shares an FP32 scale factor. The math is FP8 × (1 / scale); the scale is recomputed per block. This is the same pattern NVIDIA uses with MXFP8 / NVFP8, and the same family Blackwell uses for its FP4 (MXFP4) format.

Why TPUs went for FP8 specifically (not int8 doubled)

09

Accumulator Widths & Numerical Stability

An MXU multiplies two narrow-format operands and accumulates into a wider format. Accumulator width is the chip's secret weapon for numerical stability.

GenerationMultiplyAccumulateOutput
v1INT8 × INT8INT32INT8 (re-quantised)
v2 / v3 / v4 / v5bf16 × bf16FP32bf16 (or FP32 if requested)
v5p / v5e (INT8 path)INT8 × INT8INT32bf16 / INT8
Ironwood (FP8 path)FP8 × FP8 (with per-block scale)FP32bf16 / FP8

Why FP32 accumulate matters

A 256-deep matmul with bf16 inputs has 256 partial-product additions stacking up. If you accumulate in bf16 (7-bit mantissa), the running sum loses ~7 bits of precision per addition; long contractions are essentially zero by the end. FP32 (23-bit mantissa) gives you ~16 extra bits of headroom — enough that the same matmul produces a result indistinguishable from FP32-throughout.

This is why "bf16 matmul" on a TPU is in fact "bf16 multiply, FP32 accumulate". The advertised throughput is the multiply rate; the accumulator silicon is more expensive and is sized to keep the array stable.

A trade no one talks about

FP8 with FP32 accumulate is the same engineering choice. NVIDIA does this; AMD does this; Google does this. The reason FP8 / bf16 / INT8 advertised TFLOPS are 2× / 4× / 4× the FP32 TFLOPS is partly because the accumulator stays at FP32 width — you save area on the multiplier, not the adder. This is also why advertised FP4 numbers are even more advertising-flavoured than usual: the accumulator hasn't shrunk.

10

The "Halve Precision, Double FLOPS" Pattern

Every TPU generation has used this lever at least once. The pattern:

FP32v1 era CPU bf16v2 — 2× INT8v5e — 2× FP8Ironwood — 2× FP4?future — 2× each step doubles peak ops/cycle on the same MAC area

This is the post-Dennard-scaling answer. You can't double FLOPS by doubling transistors any more — transistors are too expensive and dies are reticle-limited. You double FLOPS by halving the format. Each step costs you accuracy; the price has been worth paying so far.

11

Why TPUs Have No FP4 (Yet)

NVIDIA's Blackwell (B200) added FP4 (MXFP4) in 2024, claiming 2× FP8 throughput on inference. Ironwood (2025) does not ship FP4. Why?

The skeptical view

  • FP4 is so narrow (3 bits of magnitude) that almost any model needs aggressive per-block scaling and PTQ-style calibration.
  • Quality regressions on long-context generation have been observed.
  • Ironwood's HBM3e capacity (192 GiB) is large enough that most inference workloads don't need FP4 to fit in chip-memory.
  • FP8 + bigger HBM may be a better point on the quality / throughput curve than FP4 + tight HBM.

The likely-future view

  • NVIDIA's MXFP4 is the OCP-blessed format; if it becomes the inference standard, TPUs will follow.
  • Future Trillium-successor or Ironwood-successor will plausibly ship FP4 native; the design lever is just too obvious to skip permanently.
  • NVFP4 (NVIDIA's variant with 16-element blocks and a two-level FP8/FP32 scale) gives most of FP4's bandwidth advantage with fewer accuracy regressions; OCP MXFP4 (32-element blocks, E8M0 scale) is the more conservative version.

The TPU programme has historically been conservative on numerics — bf16 was years ahead of FP16 in adoption, but FP8 and FP4 have lagged NVIDIA by one generation each. That's a deliberate Jouppi-era trade: ship known-good silicon, let the format mature in the wild, then add it.

12

Cheat Sheet

Read next

Deck 10 — ICI & OCS covers the cross-chip side of the hierarchy — how remote HBM actually gets to your local matmul.