Four equivalent ways to compute the same product, batched matmul, the GEMM that dominates LLM compute, and the arithmetic-intensity argument that explains every choice in modern AI hardware.
For $A \in \mathbb{R}^{m \times k}$ and $B \in \mathbb{R}^{k \times n}$, the product $C = AB \in \mathbb{R}^{m \times n}$ has entries
$$C_{ij} = \sum_{\ell=1}^{k} A_{i\ell}\, B_{\ell j}.$$
This is the textbook formula. Useful for proofs, useless as a mental model. The same operation has at least four equivalent geometric or structural interpretations — which one you pick changes how you think about every layer in a transformer.
Each view bundles the same $mnk$ multiplications and $mn(k-1)$ additions into a different shape. The textbook formula computes one entry of $C$ at a time. Other views compute one column at a time, one row at a time, or assemble $C$ as a sum of $k$ rank-1 matrices. They are mathematically identical and computationally identical (same FLOP count); only the dependency structure differs — which matters for memory access, parallelism, and how you understand a transformer.
Naive matmul costs $\Theta(mnk)$ multiply-adds. Strassen (1969) gave a divide-and-conquer recursion at $\Theta(n^{2.81})$ for $n\times n$. The current asymptotic record is around $\Theta(n^{2.371552})$ (Williams et al. 2024, refining work over decades). None of these are used in practice for ML — the cubic algorithm with cache-friendly tiling is faster on real hardware below $n \approx 1000$, and stays competitive much further. Modern hardware is fast enough that algorithmic constants matter more than asymptotic exponents.
The textbook formula, in words: $C_{ij}$ is the dot product of row $i$ of $A$ with column $j$ of $B$.
$$C_{ij} = \mathbf{a}_i^\top \mathbf{b}_j.$$
This is the row-times-column view. To compute one entry, you fly across one row and down one column, multiplying and summing.
The cost: one inner product is $k$ multiply-adds; $mn$ entries gives $mnk$ MACs total.
Each column of $C$ is a linear combination of the columns of $A$, with weights given by a column of $B$:
$$C_{:,j} = A\, B_{:,j} = \sum_\ell B_{\ell j}\, A_{:,\ell}.$$
You can read $AB$ as "apply $A$ to each column of $B$, separately, and stack the results". This is the picture you get if you think of $B$ as a batch of inputs and $A$ as a layer.
Mental model: $A$ is the rule; $B$ is a list of inputs; $C$ is the list of outputs.
By transposition, each row of $C$ is a linear combination of the rows of $B$, with weights from a row of $A$:
$$C_{i,:} = A_{i,:} \, B = \sum_\ell A_{i\ell}\, B_{\ell,:}.$$
Views 2 and 3 are the same operation written from different sides of the equation. Knowing both makes you fluent in shape juggling.
The most useful view for theory:
$$AB = \sum_{\ell=1}^{k} A_{:,\ell}\, B_{\ell,:}.$$
$AB$ is a sum of $k$ rank-1 matrices, one per "contracted" dimension — each is the outer product of a column of $A$ with a row of $B$.
(1) $C_{ij}$ = dot of row of $A$ with col of $B$. One entry at a time.
(2) $C_{:,j}$ = $A$ times col of $B$. One column at a time.
(3) $C_{i,:}$ = row of $A$ times $B$. One row at a time.
(4) $C$ = $\sum$ of $k$ outer products. One contraction step at a time.
If you partition $A$ and $B$ into blocks, matmul respects that partition exactly — provided the inner block sizes match. Schematically:
$$\begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{pmatrix} \begin{pmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{pmatrix} = \begin{pmatrix} A_{11}B_{11} + A_{12}B_{21} & A_{11}B_{12} + A_{12}B_{22} \\ A_{21}B_{11} + A_{22}B_{21} & A_{21}B_{12} + A_{22}B_{22} \end{pmatrix}.$$
The block formula is the recursive structure that turns matmul into a cache-friendly algorithm. You don't compute one entry at a time; you compute one tile at a time, where a tile fits in fast memory (registers, L1, shared memory, VMEM, …).
Almost every matmul in an LLM is batched: a stack of matrices that we matmul element-wise. In NumPy / PyTorch / JAX:
$$C[b, i, j] = \sum_\ell A[b, i, \ell]\, B[b, \ell, j]$$
i.e. a separate $m \times n$ matmul per batch index $b$. The batch dimension is "trivial" — the matmuls are independent — but it dominates how you actually schedule the FLOPs across an accelerator.
torch.einsum("bhld,bhmd->bhlm", Q, K)."oi,bli->blo".Once you have batches, heads, sequence, head-dim and feature dimension you have five indices to keep track of. Writing BMMs in einsum is the only sustainable way: name every index, sum over the contracted ones, leave the rest. Deck 12 develops einsum and tensor algebra properly.
$(AB)C = A(BC)$. Always true. But the cost can differ wildly.
$A \in \mathbb{R}^{m \times k}$, $B \in \mathbb{R}^{k \times n}$, $C \in \mathbb{R}^{n \times p}$. Cost of $(AB)C$ is $mkn + mnp$; of $A(BC)$ is $knp + mkp$. For $m = p = 1$, $k = n = 1000$: $(AB)C$ costs ~2 million ops; $A(BC)$ costs ~1 million + 1000 = ~1 million ops.
$A(B + C) = AB + AC$ and $(A + B)C = AC + BC$. Both always true. Underlies: residual streams (a sum of contributions) read out by a single matrix; ensemble averages; gradient accumulation.
$(AB)^\top = B^\top A^\top$. Order reverses. This is the engine of backprop. If forward is $\mathbf{y} = W\mathbf{x}$, the backward (vector-Jacobian product) is $\nabla_{\mathbf{x}} = W^\top \nabla_\mathbf{y}$. Deck 09.
In general $AB \neq BA$. They might not even have the same shape. Two cases where they commute: (a) both are diagonal, (b) both are powers of the same matrix. Otherwise treat $AB \neq BA$ as the rule.
Computing $(QK^\top)V$ is $\Theta(L^2 d + L^2 d) = \Theta(L^2 d)$. Computing $Q(K^\top V)$ is $\Theta(d^2 L + L d^2) = \Theta(L d^2)$. For long sequences ($L \gg d$) the second is much faster — this is the linear attention trick, which holds when softmax is replaced by a kernel feature map (Performer, Linear Transformer, RWKV-Linear). Standard softmax attention forbids the swap because the softmax is non-linear.
The widget computes $C = AB$ for $A, B \in \mathbb{R}^{4 \times 4}$, animating one of the four views. Pick a view and step through. Highlighted cells are what's being read; the green cell is what's being written.
Pick a view from the dropdown and step through. The widget shows what the algorithm is reading (highlighted) and writing (green) at each step.
Let's count for an $n \times n$ matmul:
For $n = 1024$ that's ~170 FLOPs per byte loaded. By contrast a vector add ($\mathbf{y} = \mathbf{x} + \mathbf{z}$) does one FLOP per ~12 bytes — intensity 0.08, hopelessly memory-bound. Matmul is the rare workload where compute can keep up with memory bandwidth, provided you re-use loaded operands instead of streaming through them once.
An accelerator has peak compute $C^*$ FLOPs/sec and peak memory bandwidth $B^*$ bytes/sec. Above arithmetic intensity $C^* / B^*$ (the "ridge point"), you're compute-bound; below, memory-bound. For an H100: ~990 TFLOPS at FP16 and ~3.35 TB/s HBM → ridge ~300 FLOP/byte. A FP16 matmul with $n \ge 1800$ is comfortably compute-bound; a transformer layer-norm or activation function isn't. This is why everything in modern AI hardware is shaped to maximise matmul time and minimise everything else.
(1) Matmul has favourable arithmetic intensity. (2) Matmul gets even better when you tile and re-use sub-tiles. (3) An accelerator that maximises tiled-matmul throughput per Watt wins. (4) The TPU and the GPU's tensor cores are both optimisations of (3).
Deck 04 — Inner Products, Norms & Geometry takes the row-times-column dot-product picture from View 1 and develops it into a full geometric language: angles, lengths, orthogonality, the geometry behind cosine similarity and attention scores.