Linear Algebra for AI — Presentation 03

Matrix Multiplication Deep-Dive

Four equivalent ways to compute the same product, batched matmul, the GEMM that dominates LLM compute, and the arithmetic-intensity argument that explains every choice in modern AI hardware.

matmulblock batchedGEMM arithmetic intensityFLOPs
inner products (i,j) column-comb row-comb sum of outer products block matmul
00

Topics We'll Cover

01

The Definition

For $A \in \mathbb{R}^{m \times k}$ and $B \in \mathbb{R}^{k \times n}$, the product $C = AB \in \mathbb{R}^{m \times n}$ has entries

$$C_{ij} = \sum_{\ell=1}^{k} A_{i\ell}\, B_{\ell j}.$$

This is the textbook formula. Useful for proofs, useless as a mental model. The same operation has at least four equivalent geometric or structural interpretations — which one you pick changes how you think about every layer in a transformer.

Why so many views?

Each view bundles the same $mnk$ multiplications and $mn(k-1)$ additions into a different shape. The textbook formula computes one entry of $C$ at a time. Other views compute one column at a time, one row at a time, or assemble $C$ as a sum of $k$ rank-1 matrices. They are mathematically identical and computationally identical (same FLOP count); only the dependency structure differs — which matters for memory access, parallelism, and how you understand a transformer.

A non-trivial counting fact

Naive matmul costs $\Theta(mnk)$ multiply-adds. Strassen (1969) gave a divide-and-conquer recursion at $\Theta(n^{2.81})$ for $n\times n$. The current asymptotic record is around $\Theta(n^{2.371552})$ (Williams et al. 2024, refining work over decades). None of these are used in practice for ML — the cubic algorithm with cache-friendly tiling is faster on real hardware below $n \approx 1000$, and stays competitive much further. Modern hardware is fast enough that algorithmic constants matter more than asymptotic exponents.

02

View 1 — Inner Products

The textbook formula, in words: $C_{ij}$ is the dot product of row $i$ of $A$ with column $j$ of $B$.

$$C_{ij} = \mathbf{a}_i^\top \mathbf{b}_j.$$

This is the row-times-column view. To compute one entry, you fly across one row and down one column, multiplying and summing.

When this view is right

The cost: one inner product is $k$ multiply-adds; $mn$ entries gives $mnk$ MACs total.

03

View 2 — Linear Combinations of Columns

Each column of $C$ is a linear combination of the columns of $A$, with weights given by a column of $B$:

$$C_{:,j} = A\, B_{:,j} = \sum_\ell B_{\ell j}\, A_{:,\ell}.$$

You can read $AB$ as "apply $A$ to each column of $B$, separately, and stack the results". This is the picture you get if you think of $B$ as a batch of inputs and $A$ as a layer.

When this view is right

Mental model: $A$ is the rule; $B$ is a list of inputs; $C$ is the list of outputs.

04

View 3 — Linear Combinations of Rows

By transposition, each row of $C$ is a linear combination of the rows of $B$, with weights from a row of $A$:

$$C_{i,:} = A_{i,:} \, B = \sum_\ell A_{i\ell}\, B_{\ell,:}.$$

When this view is right

Views 2 and 3 are the same operation written from different sides of the equation. Knowing both makes you fluent in shape juggling.

05

View 4 — Sum of Outer Products

The most useful view for theory:

$$AB = \sum_{\ell=1}^{k} A_{:,\ell}\, B_{\ell,:}.$$

$AB$ is a sum of $k$ rank-1 matrices, one per "contracted" dimension — each is the outer product of a column of $A$ with a row of $B$.

Why this view is the most theoretically powerful

The four views, side by side

(1) $C_{ij}$ = dot of row of $A$ with col of $B$. One entry at a time.
(2) $C_{:,j}$ = $A$ times col of $B$. One column at a time.
(3) $C_{i,:}$ = row of $A$ times $B$. One row at a time.
(4) $C$ = $\sum$ of $k$ outer products. One contraction step at a time.

06

Block Matmul

If you partition $A$ and $B$ into blocks, matmul respects that partition exactly — provided the inner block sizes match. Schematically:

$$\begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{pmatrix} \begin{pmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{pmatrix} = \begin{pmatrix} A_{11}B_{11} + A_{12}B_{21} & A_{11}B_{12} + A_{12}B_{22} \\ A_{21}B_{11} + A_{22}B_{21} & A_{21}B_{12} + A_{22}B_{22} \end{pmatrix}.$$

The block formula is the recursive structure that turns matmul into a cache-friendly algorithm. You don't compute one entry at a time; you compute one tile at a time, where a tile fits in fast memory (registers, L1, shared memory, VMEM, …).

Why this is the core trick of every fast matmul implementation

07

Batched Matmul (BMM)

Almost every matmul in an LLM is batched: a stack of matrices that we matmul element-wise. In NumPy / PyTorch / JAX:

$$C[b, i, j] = \sum_\ell A[b, i, \ell]\, B[b, \ell, j]$$

i.e. a separate $m \times n$ matmul per batch index $b$. The batch dimension is "trivial" — the matmuls are independent — but it dominates how you actually schedule the FLOPs across an accelerator.

Where BMM lives in a transformer

Why einsum is the right notation

Once you have batches, heads, sequence, head-dim and feature dimension you have five indices to keep track of. Writing BMMs in einsum is the only sustainable way: name every index, sum over the contracted ones, leave the rest. Deck 12 develops einsum and tensor algebra properly.

08

Properties: Associativity, Distributivity, Transpose

Associativity

$(AB)C = A(BC)$. Always true. But the cost can differ wildly.

$A \in \mathbb{R}^{m \times k}$, $B \in \mathbb{R}^{k \times n}$, $C \in \mathbb{R}^{n \times p}$. Cost of $(AB)C$ is $mkn + mnp$; of $A(BC)$ is $knp + mkp$. For $m = p = 1$, $k = n = 1000$: $(AB)C$ costs ~2 million ops; $A(BC)$ costs ~1 million + 1000 = ~1 million ops.

Distributivity

$A(B + C) = AB + AC$ and $(A + B)C = AC + BC$. Both always true. Underlies: residual streams (a sum of contributions) read out by a single matrix; ensemble averages; gradient accumulation.

Transpose

$(AB)^\top = B^\top A^\top$. Order reverses. This is the engine of backprop. If forward is $\mathbf{y} = W\mathbf{x}$, the backward (vector-Jacobian product) is $\nabla_{\mathbf{x}} = W^\top \nabla_\mathbf{y}$. Deck 09.

Non-commutativity

In general $AB \neq BA$. They might not even have the same shape. Two cases where they commute: (a) both are diagonal, (b) both are powers of the same matrix. Otherwise treat $AB \neq BA$ as the rule.

Associativity matters for transformers

Computing $(QK^\top)V$ is $\Theta(L^2 d + L^2 d) = \Theta(L^2 d)$. Computing $Q(K^\top V)$ is $\Theta(d^2 L + L d^2) = \Theta(L d^2)$. For long sequences ($L \gg d$) the second is much faster — this is the linear attention trick, which holds when softmax is replaced by a kernel feature map (Performer, Linear Transformer, RWKV-Linear). Standard softmax attention forbids the swap because the softmax is non-linear.

09

Interactive: Step Through a 4×4 Matmul

The widget computes $C = AB$ for $A, B \in \mathbb{R}^{4 \times 4}$, animating one of the four views. Pick a view and step through. Highlighted cells are what's being read; the green cell is what's being written.

step 0
A (4×4) B (4×4) C = AB

Pick a view from the dropdown and step through. The widget shows what the algorithm is reading (highlighted) and writing (green) at each step.

10

Arithmetic Intensity & Why GEMM Wins

Let's count for an $n \times n$ matmul:

For $n = 1024$ that's ~170 FLOPs per byte loaded. By contrast a vector add ($\mathbf{y} = \mathbf{x} + \mathbf{z}$) does one FLOP per ~12 bytes — intensity 0.08, hopelessly memory-bound. Matmul is the rare workload where compute can keep up with memory bandwidth, provided you re-use loaded operands instead of streaming through them once.

The roofline argument, briefly

An accelerator has peak compute $C^*$ FLOPs/sec and peak memory bandwidth $B^*$ bytes/sec. Above arithmetic intensity $C^* / B^*$ (the "ridge point"), you're compute-bound; below, memory-bound. For an H100: ~990 TFLOPS at FP16 and ~3.35 TB/s HBM → ridge ~300 FLOP/byte. A FP16 matmul with $n \ge 1800$ is comfortably compute-bound; a transformer layer-norm or activation function isn't. This is why everything in modern AI hardware is shaped to maximise matmul time and minimise everything else.

The 4 lines that explain the AI hardware industry

(1) Matmul has favourable arithmetic intensity. (2) Matmul gets even better when you tile and re-use sub-tiles. (3) An accelerator that maximises tiled-matmul throughput per Watt wins. (4) The TPU and the GPU's tensor cores are both optimisations of (3).

11

Cheat Sheet

Read next

Deck 04 — Inner Products, Norms & Geometry takes the row-times-column dot-product picture from View 1 and develops it into a full geometric language: angles, lengths, orthogonality, the geometry behind cosine similarity and attention scores.