Linear Algebra for AI 03 — Matrix Multiplication Deep-Dive

00

Topics We'll Cover

The Definition
View 1 — Inner Products
View 2 — Linear Combinations of Columns
View 3 — Linear Combinations of Rows
View 4 — Sum of Outer Products
Block Matmul
Batched Matmul (BMM)
Properties: Associativity, Distributivity, Transpose
Interactive: Step Through a 4×4 Matmul
Arithmetic Intensity & Why GEMM Wins
Cheat Sheet

01

The Definition

For $A \in \mathbb{R}^{m \times k}$ and $B \in \mathbb{R}^{k \times n}$, the product $C = AB \in \mathbb{R}^{m \times n}$ has entries

$$C_{ij} = \sum_{\ell=1}^{k} A_{i\ell}\, B_{\ell j}.$$

This is the textbook formula. Useful for proofs, useless as a mental model. The same operation has at least four equivalent geometric or structural interpretations — which one you pick changes how you think about every layer in a transformer.

Why so many views?

Each view bundles the same $mnk$ multiplications and $mn(k-1)$ additions into a different shape. The textbook formula computes one entry of $C$ at a time. Other views compute one column at a time, one row at a time, or assemble $C$ as a sum of $k$ rank-1 matrices. They are mathematically identical and computationally identical (same FLOP count); only the dependency structure differs — which matters for memory access, parallelism, and how you understand a transformer.

A non-trivial counting fact

Naive matmul costs $\Theta(mnk)$ multiply-adds. Strassen (1969) gave a divide-and-conquer recursion at $\Theta(n^{2.81})$ for $n\times n$. The current asymptotic record is around $\Theta(n^{2.371552})$ (Williams et al. 2024, refining work over decades). None of these are used in practice for ML — the cubic algorithm with cache-friendly tiling is faster on real hardware below $n \approx 1000$, and stays competitive much further. Modern hardware is fast enough that algorithmic constants matter more than asymptotic exponents.

02

View 1 — Inner Products

The textbook formula, in words: $C_{ij}$ is the dot product of row $i$ of $A$ with column $j$ of $B$.

$$C_{ij} = \mathbf{a}_i^\top \mathbf{b}_j.$$

This is the row-times-column view. To compute one entry, you fly across one row and down one column, multiplying and summing.

When this view is right

Attention scores. $S = QK^\top$ is matmul; $S_{ij}$ is the dot product of query $i$ with key $j$. Reading "attention scores are dot products of queries with keys" is exactly View 1.
Logits. The unembedding $W_U \mathbf{h}$ produces logits whose $i$-th entry is the dot product of the $i$-th unembedding row with the hidden state — "how much does this hidden state look like token $i$?".
Output of an MLP layer. The $i$-th hidden activation is a row $\mathbf{w}_i^\top \mathbf{x}$ — how much $\mathbf{x}$ aligns with detector $i$.

The cost: one inner product is $k$ multiply-adds; $mn$ entries gives $mnk$ MACs total.

03

View 2 — Linear Combinations of Columns

Each column of $C$ is a linear combination of the columns of $A$, with weights given by a column of $B$:

$$C_{:,j} = A\, B_{:,j} = \sum_\ell B_{\ell j}\, A_{:,\ell}.$$

You can read $AB$ as "apply $A$ to each column of $B$, separately, and stack the results". This is the picture you get if you think of $B$ as a batch of inputs and $A$ as a layer.

When this view is right

Forward pass on a batch. Stack a batch of $n$ activation vectors into the columns of $X \in \mathbb{R}^{d_{in} \times n}$; the layer output is $WX \in \mathbb{R}^{d_{out} \times n}$, one output column per input. This is exactly the column view.
Embedding lookup over a sequence. If $E \in \mathbb{R}^{d \times V}$ is the embedding matrix and $T \in \mathbb{R}^{V \times L}$ is a stack of one-hot token columns, then $ET$ has columns equal to the embeddings of the tokens.

Mental model: $A$ is the rule; $B$ is a list of inputs; $C$ is the list of outputs.

04

View 3 — Linear Combinations of Rows

By transposition, each row of $C$ is a linear combination of the rows of $B$, with weights from a row of $A$:

$$C_{i,:} = A_{i,:} \, B = \sum_\ell A_{i\ell}\, B_{\ell,:}.$$

When this view is right

The output of attention. $\mathrm{Attn}\,V = \mathrm{softmax}(QK^\top/\sqrt{d})\,V$. Each row of the output is a convex combination of the rows of $V$ — literally a weighted average of value vectors. View 3 is what makes "attention reads from a memory" a precise statement, not a metaphor. (Deck 10 develops this.)
Activations after a linear layer applied row-wise. If $X \in \mathbb{R}^{n \times d_{in}}$ is a batch with rows = tokens and $W \in \mathbb{R}^{d_{in} \times d_{out}}$ is the layer (PyTorch convention), then $XW$ has rows = transformed tokens.

Views 2 and 3 are the same operation written from different sides of the equation. Knowing both makes you fluent in shape juggling.

05

View 4 — Sum of Outer Products

The most useful view for theory:

$$AB = \sum_{\ell=1}^{k} A_{:,\ell}\, B_{\ell,:}.$$

$AB$ is a sum of $k$ rank-1 matrices, one per "contracted" dimension — each is the outer product of a column of $A$ with a row of $B$.

Why this view is the most theoretically powerful

Rank bound. A sum of $k$ rank-1 matrices has rank at most $k$. Hence $\mathrm{rank}(AB) \le \min(\mathrm{rank}\,A, \mathrm{rank}\,B)$. Trivial in this view.
SVD reconstruction. The SVD writes $A = \sum_i \sigma_i \mathbf{u}_i \mathbf{v}_i^\top$ — a sum of rank-1 outer products with decreasing weights. Truncated SVD, the foundation of low-rank approximation (deck 07), is exactly "keep the largest few of these". Eckart-Young says this is the best rank-$k$ approximation under the Frobenius norm.
LoRA. The LoRA update $\Delta W = BA$ with $A \in \mathbb{R}^{r \times d}$ and $B \in \mathbb{R}^{d \times r}$ is a sum of $r$ outer products. Tiny $r$ — typically 8 or 16 — gives a parameter count proportional to $r d$ instead of $d^2$. Deck 07.
Attention output. The attention output $A V$ where $A$ is the softmaxed score matrix is, in View 4, a sum over keys of $A_{:,\ell}\, V_{\ell,:}$ — a sum of rank-1 contributions, one per key. Each token contributes one outer product.

The four views, side by side

(1) $C_{ij}$ = dot of row of $A$ with col of $B$. One entry at a time.
(2) $C_{:,j}$ = $A$ times col of $B$. One column at a time.
(3) $C_{i,:}$ = row of $A$ times $B$. One row at a time.
(4) $C$ = $\sum$ of $k$ outer products. One contraction step at a time.

06

Block Matmul

If you partition $A$ and $B$ into blocks, matmul respects that partition exactly — provided the inner block sizes match. Schematically:

$$\begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{pmatrix} \begin{pmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{pmatrix} = \begin{pmatrix} A_{11}B_{11} + A_{12}B_{21} & A_{11}B_{12} + A_{12}B_{22} \\ A_{21}B_{11} + A_{22}B_{21} & A_{21}B_{12} + A_{22}B_{22} \end{pmatrix}.$$

The block formula is the recursive structure that turns matmul into a cache-friendly algorithm. You don't compute one entry at a time; you compute one tile at a time, where a tile fits in fast memory (registers, L1, shared memory, VMEM, …).

Why this is the core trick of every fast matmul implementation

BLAS GEMM: Goto-style tiling at three levels — outer (RAM), middle (L2), inner (L1+registers). Blocks are sized to match cache geometry.
cuBLAS / CUTLASS: Tiles are CTAs (thread blocks). Each CTA loads tiles of $A$ and $B$ into shared memory, computes a tile of $C$ in registers / tensor cores.
TPU MXU: Each matmul step is a 128×128 tile. The systolic array is the inner block. XLA chooses outer tile sizes to match VMEM capacity.
FlashAttention: Recomputes attention in tiles of size that fit on-chip, avoiding materialising the full $L \times L$ score matrix in HBM. A pure block-matmul restructuring of the same arithmetic.

07

Batched Matmul (BMM)

Almost every matmul in an LLM is batched: a stack of matrices that we matmul element-wise. In NumPy / PyTorch / JAX:

$$C[b, i, j] = \sum_\ell A[b, i, \ell]\, B[b, \ell, j]$$

i.e. a separate $m \times n$ matmul per batch index $b$. The batch dimension is "trivial" — the matmuls are independent — but it dominates how you actually schedule the FLOPs across an accelerator.

Where BMM lives in a transformer

Multi-head attention scores. $Q \in \mathbb{R}^{B \times H \times L \times d_h}$, $K \in \mathbb{R}^{B \times H \times L \times d_h}$. The score $QK^\top$ is a 4-way batched matmul over $(B, H)$ producing shape $(B, H, L, L)$. PyTorch: torch.einsum("bhld,bhmd->bhlm", Q, K).
Multi-head attention output. $\mathrm{softmax}(S) V$, also a $(B,H)$-batched matmul.
The MLP up- and down-projections. Often broadcast over batch and sequence: a single $(d_{out} \times d_{in})$ weight applied to a $(B \times L)$ batch of inputs — in einsum, "oi,bli->blo".

Why einsum is the right notation

Once you have batches, heads, sequence, head-dim and feature dimension you have five indices to keep track of. Writing BMMs in einsum is the only sustainable way: name every index, sum over the contracted ones, leave the rest. Deck 12 develops einsum and tensor algebra properly.

08

Properties: Associativity, Distributivity, Transpose

Associativity

$(AB)C = A(BC)$. Always true. But the cost can differ wildly.

$A \in \mathbb{R}^{m \times k}$, $B \in \mathbb{R}^{k \times n}$, $C \in \mathbb{R}^{n \times p}$. Cost of $(AB)C$ is $mkn + mnp$; of $A(BC)$ is $knp + mkp$. For $m = p = 1$, $k = n = 1000$: $(AB)C$ costs ~2 million ops; $A(BC)$ costs ~1 million + 1000 = ~1 million ops.

Distributivity

$A(B + C) = AB + AC$ and $(A + B)C = AC + BC$. Both always true. Underlies: residual streams (a sum of contributions) read out by a single matrix; ensemble averages; gradient accumulation.

Transpose

$(AB)^\top = B^\top A^\top$. Order reverses. This is the engine of backprop. If forward is $\mathbf{y} = W\mathbf{x}$, the backward (vector-Jacobian product) is $\nabla_{\mathbf{x}} = W^\top \nabla_\mathbf{y}$. Deck 09.

Non-commutativity

In general $AB \neq BA$. They might not even have the same shape. Two cases where they commute: (a) both are diagonal, (b) both are powers of the same matrix. Otherwise treat $AB \neq BA$ as the rule.

Associativity matters for transformers

Computing $(QK^\top)V$ is $\Theta(L^2 d + L^2 d) = \Theta(L^2 d)$. Computing $Q(K^\top V)$ is $\Theta(d^2 L + L d^2) = \Theta(L d^2)$. For long sequences ($L \gg d$) the second is much faster — this is the linear attention trick, which holds when softmax is replaced by a kernel feature map (Performer, Linear Transformer, RWKV-Linear). Standard softmax attention forbids the swap because the softmax is non-linear.

09

Interactive: Step Through a 4×4 Matmul

The widget computes $C = AB$ for $A, B \in \mathbb{R}^{4 \times 4}$, animating one of the four views. Pick a view and step through. Highlighted cells are what's being read; the green cell is what's being written.

Mode

step 0

Pick a view from the dropdown and step through. The widget shows what the algorithm is reading (highlighted) and writing (green) at each step.

10

Arithmetic Intensity & Why GEMM Wins

Let's count for an $n \times n$ matmul:

Compute: $2n^3$ FLOPs (each of the $n^3$ output contributions is one multiply + one add).
Memory traffic, naive: $3n^2$ floats — load $A$, load $B$, write $C$.
Arithmetic intensity ≡ FLOPs / byte. With float32, $2n^3 / (3n^2 \cdot 4) = n/6$ FLOP/byte.

For $n = 1024$ that's ~170 FLOPs per byte loaded. By contrast a vector add ($\mathbf{y} = \mathbf{x} + \mathbf{z}$) does one FLOP per ~12 bytes — intensity 0.08, hopelessly memory-bound. Matmul is the rare workload where compute can keep up with memory bandwidth, provided you re-use loaded operands instead of streaming through them once.

The roofline argument, briefly

An accelerator has peak compute $C^*$ FLOPs/sec and peak memory bandwidth $B^*$ bytes/sec. Above arithmetic intensity $C^* / B^*$ (the "ridge point"), you're compute-bound; below, memory-bound. For an H100: ~990 TFLOPS at FP16 and ~3.35 TB/s HBM → ridge ~300 FLOP/byte. A FP16 matmul with $n \ge 1800$ is comfortably compute-bound; a transformer layer-norm or activation function isn't. This is why everything in modern AI hardware is shaped to maximise matmul time and minimise everything else.

The 4 lines that explain the AI hardware industry

(1) Matmul has favourable arithmetic intensity. (2) Matmul gets even better when you tile and re-use sub-tiles. (3) An accelerator that maximises tiled-matmul throughput per Watt wins. (4) The TPU and the GPU's tensor cores are both optimisations of (3).

11

Cheat Sheet

Matmul has four equivalent views: entry = inner product (1); $C$ has columns = $A \cdot$col-of-$B$ (2); $C$ has rows = row-of-$A \cdot B$ (3); $C$ = sum of $k$ outer products (4). Same FLOPs, different mental models.
Block matmul respects partitions: $C_{IJ} = \sum_K A_{IK} B_{KJ}$. The recursive structure that powers cache-friendly tiling, GPU shared-memory loading, TPU MXU operation, FlashAttention.
Batched matmul (BMM) is element-wise matmul over an outer batch dimension. Almost every matmul in a transformer is batched over $(B, H, \ldots)$.
$(AB)^\top = B^\top A^\top$. Order reverses. This is the rule that makes backprop work.
Matmul is associative but the parenthesisation matters for cost. For long sequences and small head-dim, $Q(K^\top V)$ is asymptotically faster than $(QK^\top)V$ — the foundation of linear attention.
Arithmetic intensity of an $n\times n$ matmul scales linearly with $n$. This is why matmul, alone among basic numerical kernels, can saturate modern accelerator FLOPs.
Strassen-style sub-cubic algorithms exist but aren't used in practice: their constants are too high and modern hardware too good at the cubic algorithm with tiling.