From Kung & Leiserson's 1978 paper to the 256×256 MAC grid at the heart of every TPU. Why this 47-year-old idea is the reason Google's chips work.
A systolic array is a 2D grid of identical, simple processing elements (PEs) that pump data through themselves like blood through a heart — data moves, computation happens at each beat, and the result emerges from the boundary. The name is Kung's; the analogy is medical.
If you remove caches, branch predictors, instruction decoders, and replace all that with a tile of multiply-accumulate units that fire every cycle, you get an enormous arithmetic-intensity advantage per unit of silicon. TPU v1 fits 65,536 MAC units in a 28 nm die smaller than a contemporary mid-range CPU.
The paper that names the field is "Systolic Arrays (for VLSI)" by H.T. Kung and Charles Leiserson at Carnegie Mellon, circulated as CMU technical report CMU-CS-79-103 (1979) but written and presented in 1978 at the Sparse Matrix Symposium. A condensed version appears as Section 8.3 of Mead & Conway's Introduction to VLSI Systems (Addison-Wesley, 1980), the textbook that taught the world how to design chips.
Their Figure 1 shows a 1D linear array of MAC units computing y = Ax for a banded matrix A. Each PE holds one element of A. The vector x streams in from the left, partial sums of y stream out to the right, and on every clock each PE computes partial += a * x and forwards x. The 2D extension to dense matmul is one paragraph and one figure.
The paper closes with what would, decades later, become the TPU's design philosophy: "these algorithms can be implemented as special-purpose hardware that achieves much higher performance than general-purpose computers".
Four years later, Kung published a single-author review in IEEE Computer that became the field's most-cited paper. Its three contributions:
Naming it. "Systole" is the heart-pump phase that pushes blood through the body. The architecture's defining feature is that data moves on every clock and computes along the way.
If you have N PEs you need an I/O bandwidth proportional to N to keep them busy — unless you re-use each operand many times. Systolic arrays are the canonical answer: each input value visits O(N) PEs.
Tables of which kernels map to which structure. 1D linear arrays for FIR filters and 1D convolution; 2D meshes for matmul and 2D convolution; hex arrays for transitive closure; trees for sorting. Each row of the table is a future ASIC.
Kung's quantitative core: a systolic array of side N processing M elements gives you O(NM) compute work for O(M) I/O — arithmetic intensity of N. For a 256×256 array doing a matmul, that's 256 ops per byte of input bandwidth. The DRAM you have can keep up. This is the same arithmetic-intensity argument that justifies tensor cores today — it just took 35 years for a workload to need it badly enough to fund the silicon.
The 2017 ISCA TPU paper cites Kung 1982 explicitly, and Jouppi has said in talks that the v1 design started from a literal copy of the Kung-Leiserson 2D matrix-multiply array, scaled to 256×256 to fit a 28 nm die budget. The chip is, structurally, a 1978 idea on a 2014 process.
The first generation of systolic-array hardware, built by Kung's group at CMU, ran from 1985 to the early 1990s. Three machines:
| Machine | Year | Cells | Performance | Industrial partner |
|---|---|---|---|---|
| WW-Warp (Wire-Wrap) | 1985–86 | 2-cell prototype, then 10 | ~10 MFLOPS/cell | GE, Honeywell |
| PC-Warp (Printed Circuit) | 1987–88 | 10 cells | ~10 MFLOPS/cell | GE |
| iWarp | 1990–93 | up to ~64 cells per array | 20 MFLOPS/cell @ 20 MHz | Intel |
The iWarp project shipped about 39 systems before Intel exited in 1993. Commercially small, intellectually large — the entire engineering DNA of TPU v1 traces through Warp.
David Patterson has said in talks that the TPU programme treated the Warp/iWarp work as the canonical reference for "what a systolic-array supercomputer actually looks like at engineering scale" — including all the unglamorous parts (compiler, I/O, memory hierarchy, system integration) that the original 1978 paper did not need to address.
By 1996, systolic arrays were considered an architectural dead end. Three reasons converged:
An N×N array of MACs needs O(N) bytes per cycle to keep busy. By the early 90s memory bandwidth was lagging compute exponentially — the gap eventually called the "memory wall" in Wulf-McKee 1995. Systolic arrays starved.
Pentium-class CPUs riding Dennard scaling were getting a free 1.5×/year on single-thread performance. A custom systolic ASIC that took two years to design was beaten by the next general-purpose Pentium before it shipped.
Outside specialised signal processing, no civilian workload needed 256×256 dense matmul. Without an application, no one funded the silicon. MasPar, Thinking Machines, Cray's MPP unit all failed between 1993 and 1996.
For nearly two decades systolic arrays survived only in narrow niches: Fourier-transform accelerators, beamforming ASICs in radar, video-compression chips, and a few academic exercises. They were taught in computer-architecture textbooks as a historical milestone, not as a current technique.
And then deep learning happened. By 2014 the dominant compute pattern in industrial ML was large dense matrix multiplications on small batches of low-precision integers — exactly the workload Kung had been trying to find since 1978. The TPU is what happens when a 36-year-old idea finally meets the workload it was waiting for.
For the matmul C = A × B, with the array of PEs holding partial accumulators, you have three choices about which value stays still. This is the field's main taxonomy.
Pre-load the weight matrix A into the array; weights stay put. Activations B stream in skewed; partial sums propagate down (or right). Final outputs emerge at one boundary.
Used by: TPU MXU, NVDLA, most modern ML accelerators.
Wins: weights re-used across many activations — ideal when weights are large and shared.
Pin a partial accumulator in each PE. Both A and B stream through; each PE keeps adding to its own output. When done, drain accumulators.
Used by: early DSP arrays, some FPGA matmul cores.
Wins: no accumulator data movement; minimum write bandwidth at the boundary.
MIT Eyeriss (ISCA 2016, Chen / Emer / Sze). Each row of A, B, C stays local to a PE; partial sums travel diagonally. Hand-tuned to maximise data re-use across all three matrices.
Used by: Eyeriss; not adopted by industry chips at scale.
Wins: 1.4–2.5× energy improvement on AlexNet vs WS/OS in the original paper.
The dataflow choice is the single biggest architectural decision in any systolic accelerator. It determines bandwidth requirements at every level of the memory hierarchy, the layout of operands in SRAM, the optimal tile size, the latency to first output, and the pipeline-fill cost. Most public TPU descriptions get this right (weight-stationary); some popular write-ups call it output-stationary, which is incorrect — the TPU's MXU loads weights in advance and streams activations.
The TPU MXU pre-loads weights into the array before each tile of work, then streams activations. Why?
For each tile: (1) DMA a 128×128 weight tile from HBM into the MXU through the weight FIFO, (2) feed activations from VMEM into the array's left edge skewed by row, (3) partial sums propagate top-to-bottom and accumulate in the bottom row, (4) drain results into accumulators or back to VMEM.
Internally the TPU MXU is a systolic array of MAC PEs. But there are several MXUs per chip (one per TensorCore in v2, two in v3, four in v4 and later), each fed by a vector unit and a scalar unit, with explicit accumulators and a weight FIFO. The systolic core is the compute; the rest of the MXU is the plumbing that keeps it fed. Most of the engineering effort in a TPU goes into the plumbing.
For TPU v1's single 256×256 array, computing C[M×N] = A[M×K] × B[K×N] with M = N = K = 256:
A (or for inference convention, the weight tile of B) is already loaded into the 65,536 PEs — one weight per PE.The visualisation on the next slide shows this concretely. The PEs that are firing at any one cycle form a moving anti-diagonal — a wavefront sweeping from the top-left to the bottom-right.
If you tried to schedule this on a CPU you'd have a giant nested loop, with explicit shift registers, careful boundary handling, and probably an off-by-one error somewhere on the skew. On a systolic array the structure of the chip is the schedule — the only thing the compiler emits is "feed these tiles, in this order, then read these accumulators". This is the property that makes XLA tractable; lowering an HLO matmul to TPU code is largely a tile-size choice, not a scheduling problem.
An 8×8 systolic array (smaller than the TPU's 128×128 MXU for visibility). Press Step to advance one cycle at a time and watch the diagonal wavefront move. The number in each cell is its current accumulator value as integer multiplies fire.
Yellow PE values are accumulating; green PEs have drained. The wavefront is the diagonal band of yellow. Notice that even on this small array, peak utilisation (the centre of the wavefront) reaches 100% only briefly — the average over a single matmul is around 50% before fill / drain are amortised. Hence why TPUs love big batches.
The TPU v1's 256×256 array running at 700 MHz, with each PE doing one INT8 multiply and one INT32 accumulate per cycle:
65,536 × 700 MHz × 2 ops/MAC = 91.75 TOPS, which matches the "92 TOPS" headline figure.
For a single matmul of K contract dimension on a 256×256 array, useful cycles = K; fill+drain = ~512 cycles. So utilisation is approximately K / (K + 512). At K = 1024 this is 67%; at K = 4096 it's 89%; at K = 16384 it's 97%. This is the arithmetic-intensity argument that makes TPUs prefer large batches and long contraction dimensions.
If you double the array (e.g. v2/v3 added a second TensorCore with its own MXUs), peak doubles. If you double the clock, peak doubles. If you double the precision (e.g. v7's FP8 vs bf16), the same MAC unit can do twice as many ops. Every TPU generation has used at least two of these levers.
XLA's most important pass on a TPU target is tile-size selection: pick the largest tile that fits in VMEM and divides evenly into the activation and weight shapes, so that pipeline fill becomes a small percentage of total work. This is the main determinant of measured TFLOPS — you can be at 95% of peak on long contractions and 30% on short ones, all on the same chip.
The TPU is the most prominent systolic-array accelerator but not the only one in the modern landscape. The other major designs make different choices — and most are not systolic.
| Architecture | Compute primitive | Scheduling | Memory model |
|---|---|---|---|
| Google TPU | Systolic 2D MAC array (128×128 MXU on v2+, 256×256 on v1) | Static, compiler-issued (XLA HLO) | Software-managed VMEM/CMEM scratchpad + HBM |
| NVIDIA Tensor Cores | Per-SM MMA blocks (e.g. wgmma m64n256k16 on Hopper). Internals undisclosed. | Dynamic warp scheduler chooses per cycle | L1/L2 cache + register file + shared memory + HBM |
| Cerebras WSE-2/3 | Wafer-scale 2D mesh of independent PEs with SwarmX fabric | Spatial dataflow; static placement | 40 GiB on-wafer SRAM, no off-chip DRAM |
| Groq LPU | Deterministic VLIW streaming pipeline (TSP) | Static, deterministic, no caches at all | On-chip SRAM only; predictable cycle latency |
| SambaNova RDU | Reconfigurable Dataflow Architecture: pattern compute units + memory units | Compiler-mapped spatial dataflow | Tiered SRAM/HBM, mesh interconnect |
| Intel Gaudi 3 | MME (matrix multiply engine) — systolic-style | Static + dynamic mix | HBM + SRAM scratchpad |
NVIDIA does not publish tensor-core internals. Microbenchmarks have argued both ways. The safe statement: at the programmer-visible level they are MMA blocks (matrix-multiply-accumulate primitives, e.g. wmma.mma.sync.m16n16k16 on Volta+, wgmma on Hopper). They might be implemented as small systolic arrays, FMA-tree reducers, or a hybrid. The behavioural difference from a TPU MXU is large regardless: an SM has 4 of these blocks; a TPU has one or two huge ones. The scheduling philosophy (warp-dynamic vs compiler-static) is the much bigger architectural divergence.
Several modern non-TPU accelerators are explicit philosophical descendants of the TPU. Groq was founded by ex-TPU engineer Jonathan Ross; its deterministic philosophy is TPU's static scheduling taken to its logical extreme. Tenstorrent's Tensix cores, Cerebras's PE meshes, even AWS Trainium: all three companies cite the TPU papers in their whitepapers. The TPU is the chip whose ideas are now everywhere.
Deck 04 — Inside TPU v1 takes apart the chip from this story slide by slide. Deck 09 — Memory & Numerics covers VMEM/CMEM and accumulator widths. Deck 11 — Software Stack covers XLA's role as the systolic-schedule compiler.