From smallest and fastest to largest and slowest. Numbers are roughly Ironwood-class.
PE regs
Per-PE registers (KiBs)
~1 cycle
Accum
Accumulator SRAM (~MB)
1–2 cycles
VMEM
VMEM scratchpad (16–64 MiB per TC)
~5 cycles
CMEM
CMEM cache (128 MiB+ shared)
~20 cycles
HBM
HBM3e — 192 GiB at 7.37 TB/s
~250 ns
Remote
Remote HBM via ICI (1.2 TB/s bidir)
~2 μs + tx
The two boundaries that matter most
VMEM ↔ HBM (a ~1000× bandwidth cliff). The compiler's tile-size choice is dominated by keeping the working set in VMEM — an HBM round-trip costs you orders of magnitude more.
HBM ↔ remote-HBM via ICI (~6× bandwidth cliff and ~10,000× latency cliff). This is the boundary between single-chip-tractable problems and pod-scale-tractable problems.
Everything else — PE registers, accumulator, CMEM — is plumbing. The two cliffs are where compiler intelligence matters.
02
VMEM — Software-Managed Vector Memory
VMEM is the per-TensorCore software-managed scratchpad. Direct ancestor of v1's Unified Buffer; not a cache.
What lives in VMEM
Activation tiles being fed into the MXU.
Output tiles drained from the MXU before writeback.
Working sets for vector-unit operations (softmax, layer-norm, residual adds).
Small operands the compiler can keep resident across multiple operations.
What's not in VMEM
Weights — weights live in HBM and stream through the weight FIFO.
Anything the compiler can't fit. If a tile is too big, the compiler spills to HBM (or now CMEM).
No tags, no replacement policy. Every byte's residency is statically planned.
Capacities by generation
v2: 32 MiB per TensorCore.
v3: 32 MiB per TensorCore.
v4: ~32 MiB per TensorCore.
v5p: ~64 MiB per TensorCore (estimated; not officially confirmed).
Trillium / Ironwood: similar order of magnitude, with bigger CMEM in front.
VMEM size has not grown as fast as HBM — SRAM density doesn't scale like DRAM, and area is precious. The compiler is expected to manage residency with finer tile sizes; CMEM picks up the rest.
The Pallas API view
If you write low-level TPU kernels in Pallas, VMEM is the main thing you allocate explicitly. BlockSpec(block_shape=..., index_map=...) tells the compiler exactly which HBM region to bring into VMEM for each pipeline step; pltpu.emit_pipeline handles the double-buffering. This is where the TPU's "software does the scheduling" philosophy is most visible at the application layer.
03
CMEM — The First TPU Cache
CMEM is introduced with v4i (ISCA 2021) and persists through every later TPU. It is shared between the two TensorCores on a chip, sits between VMEM and HBM, and is the first TPU memory level with explicit cache-like semantics.
What CMEM is
Roughly 128 MiB on-die SRAM (v4 / v4i specifics; later chips similar order).
Shared by both TensorCores on the chip.
XLA controls residency; behaves cache-like in that lookups can hit or miss.
HBM bandwidth amplifier — effective memory bandwidth roughly doubles for workloads with good locality.
What CMEM is not
Not a fully transparent cache (no automatic LRU; XLA decides what stays).
Not coherent with anything off-chip.
Not a uniform replacement for HBM — capacity is small.
What it buys at the workload level
KV cache hits on long-context inference.
Activation re-use for backprop in training.
Fused-attention staging for FlashAttention-style kernels.
Embedding-row caches for repeated lookups in the same sequence.
Why CMEM and not "just bigger VMEM"? VMEM is per-TensorCore; CMEM is shared. Shared SRAM is more area-efficient when both cores reuse the same data (typical for attention's QKV tiles). Splitting the levels gives the compiler two knobs — tightly-pipelined VMEM tiles vs whole-step CMEM working sets.
The architectural trend
From v1's pure scratchpad model to v4's scratchpad-plus-cache, the TPU's memory hierarchy has gradually become more cache-like. NVIDIA went the other way at the same time — H100's TMA introduced more explicit DMA control, more scratchpad-like behaviour. The two architectures are converging on the same shape from opposite directions.
04
HBM Generations Across TPU Generations
TPU
HBM standard
Per-chip capacity
Per-chip bandwidth
Stacks
v1 (2015)
none (DDR3)
8 GiB DDR3
~34 GB/s
2× DDR3 channels
v2 (2017)
HBM
16 GiB
~600 GB/s
2 stacks
v3 (2018)
HBM2
32 GiB
~900 GB/s
2 stacks
v4 (2020)
HBM2
32 GiB
1.2 TB/s
2 stacks
v4i (2020)
HBM2
8 GiB
~614 GB/s
1 stack
v5e (2023)
HBM2e (likely)
16 GiB
819 GB/s
1–2 stacks
v5p (2023)
HBM3 (likely)
95 GiB
2.76 TB/s
6 stacks
Trillium (2024)
HBM3 (likely)
32 GiB
1.64 TB/s
2 stacks
Ironwood (2025)
HBM3e
192 GiB
7.37 TB/s
6–8 stacks
The bandwidth curve in one ratio
From v1 (34 GB/s) to Ironwood (7.37 TB/s) is ~217× bandwidth growth in a decade — vastly faster than Moore's law on transistor count. Most of the per-chip performance growth in TPUs has come from feeding the chip faster, not from making the chip do more arithmetic per cycle.
The HBM stack mechanics
Each HBM stack is several DRAM dies stacked vertically (4-Hi, 8-Hi, now 12-Hi for HBM3e), connected by through-silicon vias (TSVs) to a logic base die.
The logic base die handles the wide PHY and ECC.
The stack sits next to the TPU die on the package, connected via a silicon interposer (CoWoS or equivalent) over thousands of micro-bumps.
Below HBM in the hierarchy is "the HBM on a different chip in the same pod, reached via ICI". Critical for any matmul that exceeds one-chip capacity.
Path
Bandwidth
Latency
Local HBM (Ironwood)
7.37 TB/s
~250 ns
Remote HBM via ICI (Ironwood)
1.2 TB/s bidir
~2 μs + serialisation
Remote HBM via Jupiter DCN (Trillium multipod)
13 Pb/s pod bisection → per-link ~Tbps
~10–100 μs
The compiler's job
Tile a big matmul so each tile fits in local HBM × VMEM bandwidth.
Schedule cross-chip all-reduces so they overlap with compute on the next tile.
For multipod jobs: schedule cross-pod traffic as bulk async transfers, not on the critical path of a step.
This is the same job the GPU compiler stack does, with different cliffs (NVLink for nearby, InfiniBand for far). The cliffs are sharper on TPU because of the larger pod size and the fact that ICI is custom-silicon — sub-microsecond at small messages, but with a hard tier change to Jupiter at the pod boundary.
A pod is one machine
Within a TPU pod, software treats the pod as a single coherent memory space — you can jax.device_put(x, sharding=NamedSharding(mesh, ...)) and any chip can access any other chip's HBM. The compiler will turn that into ICI all-gather / all-reduce / reduce-scatter calls automatically. You don't write the cross-chip transfers explicitly — but you do feel them in your tile-size choices.
06
INT8 — Where v1 Started
v1's only numeric: 8-bit signed integers. Multipliers produced 16-bit products; accumulators were 32-bit. Inference only — no gradients, no FP arithmetic anywhere on chip.
Why INT8 worked for inference in 2015
The dominant inference models (CNNs for images, narrow MLPs for ranking) tolerate post-training quantisation to INT8 with negligible accuracy loss.
INT8 multiplies are 16× cheaper in silicon area than FP32, and ~4× cheaper than FP16.
INT8 storage is 4× cheaper in HBM bandwidth than FP32.
Why INT8 didn't work for training
Gradients have a dynamic range of ~105 to 1010; INT8 covers maybe 102.
Weight updates are typically 10-5 × learning_rate × gradient. Quantising those updates to INT8 erases most of them.
Accumulating gradients across batches blows out the integer range immediately.
v1's INT8-only choice is the cleanest possible "inference accelerator" decision. v2 is the moment Google admits that training also needs to live on this chip, which forces FP arithmetic.
07
bfloat16 — The Format That Won
FP32 (IEEE 754 single)
SEEEEEEEEMMMMMMMMMMMMMMMMMMMMMMM
1 sign + 8 exp + 23 mantissa. Range ~10±38. The standard since 1985.
FP16 (IEEE half)
SEEEEEMMMMMMMMMM
1+5+10. Range ~10±5. Underflows on small gradients.
bfloat16 (Google Brain)
SEEEEEEEEMMMMMMM
1+8+7. Same range as FP32 (~10±38). Conversion to/from FP32 is just truncation.
Why bfloat16 is the format of the modern AI era
Range, not precision, is the binding constraint. Neural-net training is numerically robust to small rounding errors, but extremely sensitive to over/underflow.
Drop-in compatibility. A bf16 value is the top half of an FP32 value — you can convert by truncation, no scaling needed.
Hardware multipliers shrink. A bf16 multiplier is ~25% the area of an FP32 multiplier; an FP16 multiplier is ~30% (extra mantissa bits).
Memory and bandwidth halve compared to FP32.
bf16 first appeared in TPU v2 in 2017. NVIDIA added it to Ampere tensor cores in 2020. Intel shipped AVX-512 BF16 instructions in Cooper Lake (also 2020). ARM's BFloat16 extension landed in ARMv8.6. By 2026 every major AI accelerator and CPU supports it natively. It is the most successful Google-originated numeric format in computing history.
08
FP8 — OCP E4M3 / E5M2
FP8 is the third generation of "halve precision, double FLOPS" on TPUs. Native on Ironwood; supported but less prominent on v5p. The format is the OCP (Open Compute Project) standard, jointly defined by NVIDIA, Intel, ARM, AMD, and Google.
E4M3 — activations / weights
SEEEEMMM
1 sign + 4 exp + 3 mantissa.
Range ~10±1.5 (after the special-value asymmetry).
Better precision; used for forward-pass tensors.
E5M2 — gradients
SEEEEEMM
1 sign + 5 exp + 2 mantissa.
Range ~10±5.
Wider range; used for gradients in mixed-precision training (rarer use on TPU since training stays in bf16).
Per-block scaling
FP8's narrow range alone is too tight for most tensors. The standard practice is block scaling: a block of values (typically 32 or 128 elements) shares an FP32 scale factor. The math is FP8 × (1 / scale); the scale is recomputed per block. This is the same pattern NVIDIA uses with MXFP8 / NVFP8, and the same family Blackwell uses for its FP4 (MXFP4) format.
Why TPUs went for FP8 specifically (not int8 doubled)
FP8 with block scaling matches per-tensor INT8 + scale on accuracy, with floating-point semantics that compose better through nonlinear ops.
The MXU silicon for FP8 multiplies is similar in area to INT8 multiplies, so peak FLOPS doubles vs bf16 in the same area.
Software ecosystem — NVIDIA's adoption of the same OCP formats means models converge across hardware vendors.
09
Accumulator Widths & Numerical Stability
An MXU multiplies two narrow-format operands and accumulates into a wider format. Accumulator width is the chip's secret weapon for numerical stability.
Generation
Multiply
Accumulate
Output
v1
INT8 × INT8
INT32
INT8 (re-quantised)
v2 / v3 / v4 / v5
bf16 × bf16
FP32
bf16 (or FP32 if requested)
v5p / v5e (INT8 path)
INT8 × INT8
INT32
bf16 / INT8
Ironwood (FP8 path)
FP8 × FP8 (with per-block scale)
FP32
bf16 / FP8
Why FP32 accumulate matters
A 256-deep matmul with bf16 inputs has 256 partial-product additions stacking up. If you accumulate in bf16 (7-bit mantissa), the running sum loses ~7 bits of precision per addition; long contractions are essentially zero by the end. FP32 (23-bit mantissa) gives you ~16 extra bits of headroom — enough that the same matmul produces a result indistinguishable from FP32-throughout.
This is why "bf16 matmul" on a TPU is in fact "bf16 multiply, FP32 accumulate". The advertised throughput is the multiply rate; the accumulator silicon is more expensive and is sized to keep the array stable.
A trade no one talks about
FP8 with FP32 accumulate is the same engineering choice. NVIDIA does this; AMD does this; Google does this. The reason FP8 / bf16 / INT8 advertised TFLOPS are 2× / 4× / 4× the FP32 TFLOPS is partly because the accumulator stays at FP32 width — you save area on the multiplier, not the adder. This is also why advertised FP4 numbers are even more advertising-flavoured than usual: the accumulator hasn't shrunk.
10
The "Halve Precision, Double FLOPS" Pattern
Every TPU generation has used this lever at least once. The pattern:
FP32 → bf16 (2017, v2): 2× on multiplier area, 2× on memory and bandwidth.
bf16 → INT8 (v1 had it; v5e revives it): 2× on multiplier area, 2× on memory and bandwidth.
bf16 → FP8 (Ironwood, 2025): 2× on multiplier area, 2× on memory and bandwidth.
This is the post-Dennard-scaling answer. You can't double FLOPS by doubling transistors any more — transistors are too expensive and dies are reticle-limited. You double FLOPS by halving the format. Each step costs you accuracy; the price has been worth paying so far.
11
Why TPUs Have No FP4 (Yet)
NVIDIA's Blackwell (B200) added FP4 (MXFP4) in 2024, claiming 2× FP8 throughput on inference. Ironwood (2025) does not ship FP4. Why?
The skeptical view
FP4 is so narrow (3 bits of magnitude) that almost any model needs aggressive per-block scaling and PTQ-style calibration.
Quality regressions on long-context generation have been observed.
Ironwood's HBM3e capacity (192 GiB) is large enough that most inference workloads don't need FP4 to fit in chip-memory.
FP8 + bigger HBM may be a better point on the quality / throughput curve than FP4 + tight HBM.
The likely-future view
NVIDIA's MXFP4 is the OCP-blessed format; if it becomes the inference standard, TPUs will follow.
Future Trillium-successor or Ironwood-successor will plausibly ship FP4 native; the design lever is just too obvious to skip permanently.
NVFP4 (NVIDIA's variant with 16-element blocks and a two-level FP8/FP32 scale) gives most of FP4's bandwidth advantage with fewer accuracy regressions; OCP MXFP4 (32-element blocks, E8M0 scale) is the more conservative version.
The TPU programme has historically been conservative on numerics — bf16 was years ahead of FP16 in adoption, but FP8 and FP4 have lagged NVIDIA by one generation each. That's a deliberate Jouppi-era trade: ship known-good silicon, let the format mature in the wild, then add it.
12
Cheat Sheet
Six-tier hierarchy: PE registers → accumulator SRAM → VMEM (per-TC scratchpad) → CMEM (shared on-die cache) → HBM → remote HBM via ICI. The two important boundaries are VMEM↔HBM and HBM↔remote.
VMEM: ~32–64 MiB per TensorCore, software-managed, no tags. Direct ancestor of v1's Unified Buffer.