Google TPUs 03 — Systolic Arrays, the Matmul Engine

00

Topics We'll Cover

What's a Systolic Array?
Kung & Leiserson, 1978
"Why Systolic Architectures?" — Kung 1982
CMU Warp & Intel iWarp
The 1990s Decline
The Three Dataflows
The TPU's Choice — Weight-Stationary
How a 256×256 Array Maps a Matmul
Interactive: Wavefront Visualiser
Throughput & Utilisation
Comparison with Other Accelerators
Cheat Sheet

01

What's a Systolic Array?

A systolic array is a 2D grid of identical, simple processing elements (PEs) that pump data through themselves like blood through a heart — data moves, computation happens at each beat, and the result emerges from the boundary. The name is Kung's; the analogy is medical.

A systolic system consists of a set of interconnected cells, each capable of performing some simple operation. Information in a systolic system flows between cells in a pipelined fashion, and communication with the outside world occurs only at the boundary cells. — H.T. Kung, "Why Systolic Architectures?", IEEE Computer, January 1982

Three defining properties

Local communication only. Every PE talks to its 2–4 nearest neighbours. No buses, no global wires, no broadcast. This is the property that makes them buildable in VLSI — long wires are slow, expensive in area, and dominate power.
Regular structure. The same PE is replicated across the entire chip. Layout is a copy-and-paste exercise; verification is one PE plus its boundary conditions.
Pipelined dataflow. Data is moving every cycle. Each operand visits many PEs on its journey through the array, doing useful work at each one. The host system "pumps" inputs into one edge and reads outputs from another.

The compute-density argument

If you remove caches, branch predictors, instruction decoders, and replace all that with a tile of multiply-accumulate units that fire every cycle, you get an enormous arithmetic-intensity advantage per unit of silicon. TPU v1 fits 65,536 MAC units in a 28 nm die smaller than a contemporary mid-range CPU.

02

Kung & Leiserson, 1978

The paper that names the field is "Systolic Arrays (for VLSI)" by H.T. Kung and Charles Leiserson at Carnegie Mellon, circulated as CMU technical report CMU-CS-79-103 (1979) but written and presented in 1978 at the Sparse Matrix Symposium. A condensed version appears as Section 8.3 of Mead & Conway's Introduction to VLSI Systems (Addison-Wesley, 1980), the textbook that taught the world how to design chips.

Why 1978 was the right year

VLSI was new. Mead & Conway's design rules made it possible for university groups to design their own chips for the first time. Before that, chip design was an industrial activity.
Wires were the cost. The Mead & Conway analysis was explicit: in CMOS, transistors are cheap, wires are expensive, long wires are very expensive. Architectures with regular local communication are the only ones that scale.
Algorithms had not yet been mapped. Kung's contribution was to take a long list of dense linear-algebra and signal-processing kernels (matmul, convolution, FIR filtering, LU factorisation) and show how each one mapped naturally to a 1D or 2D systolic structure.

The Kung & Leiserson paper, in one diagram

Their Figure 1 shows a 1D linear array of MAC units computing y = Ax for a banded matrix A. Each PE holds one element of A. The vector x streams in from the left, partial sums of y stream out to the right, and on every clock each PE computes partial += a * x and forwards x. The 2D extension to dense matmul is one paragraph and one figure.

The paper closes with what would, decades later, become the TPU's design philosophy: "these algorithms can be implemented as special-purpose hardware that achieves much higher performance than general-purpose computers".

03

"Why Systolic Architectures?" — Kung 1982

Four years later, Kung published a single-author review in IEEE Computer that became the field's most-cited paper. Its three contributions:

1. The pump analogy

Naming it. "Systole" is the heart-pump phase that pushes blood through the body. The architecture's defining feature is that data moves on every clock and computes along the way.

2. The I/O-bound argument

If you have N PEs you need an I/O bandwidth proportional to N to keep them busy — unless you re-use each operand many times. Systolic arrays are the canonical answer: each input value visits O(N) PEs.

3. The algorithm taxonomy

Tables of which kernels map to which structure. 1D linear arrays for FIR filters and 1D convolution; 2D meshes for matmul and 2D convolution; hex arrays for transitive closure; trees for sorting. Each row of the table is a future ASIC.

The argument in one inequality

Kung's quantitative core: a systolic array of side N processing M elements gives you O(NM) compute work for O(M) I/O — arithmetic intensity of N. For a 256×256 array doing a matmul, that's 256 ops per byte of input bandwidth. The DRAM you have can keep up. This is the same arithmetic-intensity argument that justifies tensor cores today — it just took 35 years for a workload to need it badly enough to fund the silicon.

A direct line to the TPU

The 2017 ISCA TPU paper cites Kung 1982 explicitly, and Jouppi has said in talks that the v1 design started from a literal copy of the Kung-Leiserson 2D matrix-multiply array, scaled to 256×256 to fit a 28 nm die budget. The chip is, structurally, a 1978 idea on a 2014 process.

04

CMU Warp & Intel iWarp

The first generation of systolic-array hardware, built by Kung's group at CMU, ran from 1985 to the early 1990s. Three machines:

Machine	Year	Cells	Performance	Industrial partner
WW-Warp (Wire-Wrap)	1985–86	2-cell prototype, then 10	~10 MFLOPS/cell	GE, Honeywell
PC-Warp (Printed Circuit)	1987–88	10 cells	~10 MFLOPS/cell	GE
iWarp	1990–93	up to ~64 cells per array	20 MFLOPS/cell @ 20 MHz	Intel

Warp's three architectural lessons

Each cell is a small VLIW. Pure systolic arrays of MAC units are too rigid for general use; Warp's cells were programmable VLIW processors with their own register file. This is also true of the TPU's MXU — the cells are not single MACs, they're pipelined multiplier-accumulators with control.
Compiler is everything. The W2 compiler turned high-level Fortran-style numerical code into systolic schedules. XLA does the same job for the TPU.
The bottleneck is moving data between the array and the host. Warp shipped with custom DMA engines and dedicated I/O cells. The TPU has explicit DMA engines and a software-managed VMEM scratchpad for exactly this reason.

The iWarp project shipped about 39 systems before Intel exited in 1993. Commercially small, intellectually large — the entire engineering DNA of TPU v1 traces through Warp.

Patterson on Warp → TPU

David Patterson has said in talks that the TPU programme treated the Warp/iWarp work as the canonical reference for "what a systolic-array supercomputer actually looks like at engineering scale" — including all the unglamorous parts (compiler, I/O, memory hierarchy, system integration) that the original 1978 paper did not need to address.

05

The 1990s Decline

By 1996, systolic arrays were considered an architectural dead end. Three reasons converged:

The DRAM bandwidth wall

An N×N array of MACs needs O(N) bytes per cycle to keep busy. By the early 90s memory bandwidth was lagging compute exponentially — the gap eventually called the "memory wall" in Wulf-McKee 1995. Systolic arrays starved.

Microprocessor exponential

Pentium-class CPUs riding Dennard scaling were getting a free 1.5×/year on single-thread performance. A custom systolic ASIC that took two years to design was beaten by the next general-purpose Pentium before it shipped.

Killer application missing

Outside specialised signal processing, no civilian workload needed 256×256 dense matmul. Without an application, no one funded the silicon. MasPar, Thinking Machines, Cray's MPP unit all failed between 1993 and 1996.

The hibernation, 1996–2014

For nearly two decades systolic arrays survived only in narrow niches: Fourier-transform accelerators, beamforming ASICs in radar, video-compression chips, and a few academic exercises. They were taught in computer-architecture textbooks as a historical milestone, not as a current technique.

And then deep learning happened. By 2014 the dominant compute pattern in industrial ML was large dense matrix multiplications on small batches of low-precision integers — exactly the workload Kung had been trying to find since 1978. The TPU is what happens when a 36-year-old idea finally meets the workload it was waiting for.

06

The Three Dataflows

For the matmul C = A × B, with the array of PEs holding partial accumulators, you have three choices about which value stays still. This is the field's main taxonomy.

Weight-Stationary (WS)

Pre-load the weight matrix A into the array; weights stay put. Activations B stream in skewed; partial sums propagate down (or right). Final outputs emerge at one boundary.

Used by: TPU MXU, NVDLA, most modern ML accelerators.

Wins: weights re-used across many activations — ideal when weights are large and shared.

Output-Stationary (OS)

Pin a partial accumulator in each PE. Both A and B stream through; each PE keeps adding to its own output. When done, drain accumulators.

Used by: early DSP arrays, some FPGA matmul cores.

Wins: no accumulator data movement; minimum write bandwidth at the boundary.

Row-Stationary (RS)

MIT Eyeriss (ISCA 2016, Chen / Emer / Sze). Each row of A, B, C stays local to a PE; partial sums travel diagonally. Hand-tuned to maximise data re-use across all three matrices.

Used by: Eyeriss; not adopted by industry chips at scale.

Wins: 1.4–2.5× energy improvement on AlexNet vs WS/OS in the original paper.

Why the choice matters

The dataflow choice is the single biggest architectural decision in any systolic accelerator. It determines bandwidth requirements at every level of the memory hierarchy, the layout of operands in SRAM, the optimal tile size, the latency to first output, and the pipeline-fill cost. Most public TPU descriptions get this right (weight-stationary); some popular write-ups call it output-stationary, which is incorrect — the TPU's MXU loads weights in advance and streams activations.

07

The TPU's Choice — Weight-Stationary

The TPU MXU pre-loads weights into the array before each tile of work, then streams activations. Why?

Workload-driven argument

In LLM inference the weight tensor is much larger than any single activation tile and is re-used across the whole batch and the whole sequence.
Loading a 128×128 weight tile once and reusing it for 128 activation rows = 128× reuse on the most expensive operand.
Activations are smaller, change frequently, and stream cleanly from VMEM.

Engineering argument

Weights live in HBM. HBM bandwidth is precious. Minimising weight bandwidth is the dominant optimisation.
Weight-stationary needs only one weight tile transfer per N activation tiles.
Output-stationary needs all three matrices streaming — double the SRAM bandwidth for no compute gain.

The mechanics, in one sentence

For each tile: (1) DMA a 128×128 weight tile from HBM into the MXU through the weight FIFO, (2) feed activations from VMEM into the array's left edge skewed by row, (3) partial sums propagate top-to-bottom and accumulate in the bottom row, (4) drain results into accumulators or back to VMEM.

Why "MXU" is not a synonym for "systolic array"

Internally the TPU MXU is a systolic array of MAC PEs. But there are several MXUs per chip (one per TensorCore in v2, two in v3, four in v4 and later), each fed by a vector unit and a scalar unit, with explicit accumulators and a weight FIFO. The systolic core is the compute; the rest of the MXU is the plumbing that keeps it fed. Most of the engineering effort in a TPU goes into the plumbing.

08

How a 256×256 Array Maps a Matmul

For TPU v1's single 256×256 array, computing C[M×N] = A[M×K] × B[K×N] with M = N = K = 256:

Cycle 0: The weight matrix A (or for inference convention, the weight tile of B) is already loaded into the 65,536 PEs — one weight per PE.
Cycles 0–255: Activations enter from the left, skewed: row 0 enters at cycle 0, row 1 at cycle 1, …, row 255 at cycle 255. Each cycle's worth of input is one column-vector entering each row.
Cycles 0–510: A diagonal wavefront of valid compute sweeps the array. At cycle t the PEs on the anti-diagonal i+j = t are doing useful multiply-accumulate work; all others are still in fill or already drained.
Cycles 255–510: Partial sums propagate downwards through the array, accumulating one term at a time. The output emerges from the bottom row, one column per cycle.
Total latency: M + N + K − 2 = 766 cycles to fully process one 256×256×256 matmul on a 256×256 array, a tiny fraction of which is fill or drain.

The wavefront pattern

The visualisation on the next slide shows this concretely. The PEs that are firing at any one cycle form a moving anti-diagonal — a wavefront sweeping from the top-left to the bottom-right.

Why this is hard to write yourself

If you tried to schedule this on a CPU you'd have a giant nested loop, with explicit shift registers, careful boundary handling, and probably an off-by-one error somewhere on the skew. On a systolic array the structure of the chip is the schedule — the only thing the compiler emits is "feed these tiles, in this order, then read these accumulators". This is the property that makes XLA tractable; lowering an HLO matmul to TPU code is largely a tile-size choice, not a scheduling problem.

09

Interactive: Wavefront Visualiser

An 8×8 systolic array (smaller than the TPU's 128×128 MXU for visibility). Press Step to advance one cycle at a time and watch the diagonal wavefront move. The number in each cell is its current accumulator value as integer multiplies fire.

Array size Speed 280 ms

cycle 0

Cycle

0

PEs firing

0

PEs done

0

Utilisation

0%

Yellow PE values are accumulating; green PEs have drained. The wavefront is the diagonal band of yellow. Notice that even on this small array, peak utilisation (the centre of the wavefront) reaches 100% only briefly — the average over a single matmul is around 50% before fill / drain are amortised. Hence why TPUs love big batches.

10

Throughput & Utilisation

The TPU v1's 256×256 array running at 700 MHz, with each PE doing one INT8 multiply and one INT32 accumulate per cycle:

PEs

65,536

Clock

700 MHz

Ops / PE / cycle

2

Peak

~92 TOPS

65,536 × 700 MHz × 2 ops/MAC = 91.75 TOPS, which matches the "92 TOPS" headline figure.

Pipeline-fill cost

For a single matmul of K contract dimension on a 256×256 array, useful cycles = K; fill+drain = ~512 cycles. So utilisation is approximately K / (K + 512). At K = 1024 this is 67%; at K = 4096 it's 89%; at K = 16384 it's 97%. This is the arithmetic-intensity argument that makes TPUs prefer large batches and long contraction dimensions.

Why this scales

If you double the array (e.g. v2/v3 added a second TensorCore with its own MXUs), peak doubles. If you double the clock, peak doubles. If you double the precision (e.g. v7's FP8 vs bf16), the same MAC unit can do twice as many ops. Every TPU generation has used at least two of these levers.

The compiler's job

XLA's most important pass on a TPU target is tile-size selection: pick the largest tile that fits in VMEM and divides evenly into the activation and weight shapes, so that pipeline fill becomes a small percentage of total work. This is the main determinant of measured TFLOPS — you can be at 95% of peak on long contractions and 30% on short ones, all on the same chip.

11

Comparison with Other Accelerators

The TPU is the most prominent systolic-array accelerator but not the only one in the modern landscape. The other major designs make different choices — and most are not systolic.

Architecture	Compute primitive	Scheduling	Memory model
Google TPU	Systolic 2D MAC array (128×128 MXU on v2+, 256×256 on v1)	Static, compiler-issued (XLA HLO)	Software-managed VMEM/CMEM scratchpad + HBM
NVIDIA Tensor Cores	Per-SM MMA blocks (e.g. wgmma m64n256k16 on Hopper). Internals undisclosed.	Dynamic warp scheduler chooses per cycle	L1/L2 cache + register file + shared memory + HBM
Cerebras WSE-2/3	Wafer-scale 2D mesh of independent PEs with SwarmX fabric	Spatial dataflow; static placement	40 GiB on-wafer SRAM, no off-chip DRAM
Groq LPU	Deterministic VLIW streaming pipeline (TSP)	Static, deterministic, no caches at all	On-chip SRAM only; predictable cycle latency
SambaNova RDU	Reconfigurable Dataflow Architecture: pattern compute units + memory units	Compiler-mapped spatial dataflow	Tiered SRAM/HBM, mesh interconnect
Intel Gaudi 3	MME (matrix multiply engine) — systolic-style	Static + dynamic mix	HBM + SRAM scratchpad

Honest claim about NVIDIA tensor cores

NVIDIA does not publish tensor-core internals. Microbenchmarks have argued both ways. The safe statement: at the programmer-visible level they are MMA blocks (matrix-multiply-accumulate primitives, e.g. wmma.mma.sync.m16n16k16 on Volta+, wgmma on Hopper). They might be implemented as small systolic arrays, FMA-tree reducers, or a hybrid. The behavioural difference from a TPU MXU is large regardless: an SM has 4 of these blocks; a TPU has one or two huge ones. The scheduling philosophy (warp-dynamic vs compiler-static) is the much bigger architectural divergence.

The lineage of "alumni" chips

Several modern non-TPU accelerators are explicit philosophical descendants of the TPU. Groq was founded by ex-TPU engineer Jonathan Ross; its deterministic philosophy is TPU's static scheduling taken to its logical extreme. Tenstorrent's Tensix cores, Cerebras's PE meshes, even AWS Trainium: all three companies cite the TPU papers in their whitepapers. The TPU is the chip whose ideas are now everywhere.

12

Cheat Sheet

A systolic array is a 2D grid of identical, simple PEs that pump data through themselves on every clock; only nearest-neighbour communication; output emerges at the boundary.
Named and characterised by H.T. Kung & Charles Leiserson at CMU, 1978/79 (CMU-CS-79-103, "Systolic Arrays for VLSI"). Reviewed in Kung's "Why Systolic Architectures?" IEEE Computer, January 1982.
Built at engineering scale by CMU's Warp / iWarp programme (1985–93) with Intel and GE; ~39 production iWarp systems shipped.
Declined in the 1990s for three reasons: DRAM bandwidth wall, microprocessor exponential, no killer app. Hibernated until ~2014.
Three dataflow choices: weight-stationary (TPU), output-stationary, row-stationary (Eyeriss). The TPU MXU is weight-stationary because weights re-use is the dominant economic factor in LLM workloads.
For TPU v1: 256×256 = 65,536 INT8 MACs at 700 MHz = 92 TOPS peak. Pipeline fill+drain = ~512 cycles per matmul; utilisation depends on contraction depth.
NVIDIA tensor cores are not confirmed systolic; their internals are undisclosed. Behaviourally they are per-SM MMA blocks. The scheduling difference (TPU static vs GPU dynamic) is bigger than the compute-primitive difference.