Linear Algebra for AI — Presentation 10

Attention as Linear Algebra

Q, K, V are learned projections. The attention matrix is row-stochastic. The output is a row-by-row weighted average of value vectors. Read this way, attention loses its mystery and becomes a piece of crystallised linear algebra.

Q K Vscaled dot product softmax row-stochasticmulti-head causal maskGQA
X (B,L,d) Q,K,V projections QK/√d softmax (rowwise) A·V WO
00

Topics We'll Cover

01

Scaled Dot-Product Attention in One Equation

$$\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_h}}\right) V.$$

Three matrices in, one matrix out. Shapes (single head, single batch element):

The whole equation is two matmuls and one rowwise softmax. We will look at each piece, in linear-algebra terms, in the next eight slides.

Two ways to read this equation

(1) Per query. For each query row $\mathbf{q}_i$, attention computes scores $\mathbf{q}_i^\top \mathbf{k}_j / \sqrt{d_h}$ for every key, softmaxes them across $j$, then weighted-averages the value rows. (2) Per key. Each key $\mathbf{k}_j$ contributes to every query in proportion to its score. Either reading is correct; both will appear in the slides ahead.

02

Q, K, V as Learned Projections

For an input residual stream $X \in \mathbb{R}^{L \times d_{model}}$ (one token per row, $d_{model} = $ hidden width), each head computes

$$Q = X W_Q, \qquad K = X W_K, \qquad V = X W_V,$$

with $W_Q, W_K, W_V \in \mathbb{R}^{d_{model} \times d_h}$. Each is a linear down-projection from the residual stream into a head-specific subspace of dimension $d_h$ (deck 05).

What each projection does

$W_Q$ — the question

Each row of $W_Q^\top$ is a "thing to look for" detector. $\mathbf{q}_i = W_Q^\top \mathbf{x}_i$ asks: "what about my context do I need to find?"

$W_K$ — the address

Each row of $W_K^\top$ produces a key tag. $\mathbf{k}_j$ says "this is what I, position $j$, advertise." The dot product $\mathbf{q}_i^\top \mathbf{k}_j$ measures advertise-vs-need similarity.

$W_V$ — the payload

$\mathbf{v}_j$ is the actual content position $j$ contributes if attended-to. The output is a weighted average of $V$'s rows.

Why three separate projections

Could $K$ just be $X$? It could, but it would tie the "address" of a token to its raw value. Separating $W_K$ and $W_V$ lets a token advertise something different from what it contributes. The Q/K split lets a query also be different from its own key — useful when you want a token to attend to itself or to other tokens in a non-symmetric way.

Empirically: in mechanistic interpretability work (Elhage et al., Anthropic 2021), $W_Q W_K^\top$ and $W_O W_V$ are studied as "QK circuit" and "OV circuit" respectively — the actual learnable directions of attention behaviour live in these products, not in the individual matrices.

03

$QK^\top$ — the Score Matrix

The matrix $S = QK^\top \in \mathbb{R}^{L \times L}$ has entry $(i, j)$ equal to $\mathbf{q}_i^\top \mathbf{k}_j$ — a single scalar, the affinity between query $i$ and key $j$.

Read it as view 1 of matmul (deck 03): every entry is one dot product of a row of $Q$ with a row of $K$ (after transposing). Or read it as view 4: $S = \sum_d Q_{:,d}\, K_{:,d}^\top$ — a sum over head-dimensions of rank-1 outer products. Either reading works.

Properties of $S$

Why the rank constraint matters

A rank-$d_h$ score matrix can express only $d_h$ independent "lines of attention pattern". For tasks needing finer granularity, the model uses multiple heads to combine different rank-$d_h$ patterns. With $H = 32$ heads and $d_h = 128$ each, the effective combined rank is $\sum_h d_h = 4096$ — full $d_{model}$. The number of heads is not a hyperparameter chosen at random; it's the ratio of $d_{model}$ to a per-head budget that has to stay below the softmax saturation regime.

04

Softmax Rowwise — the Stochastic Matrix

Apply softmax to each row of $S/\sqrt{d_h}$:

$$A_{ij} = \frac{\exp(S_{ij}/\sqrt{d_h})}{\sum_{k} \exp(S_{ik}/\sqrt{d_h})}.$$

The result $A \in \mathbb{R}^{L \times L}$ is a row-stochastic matrix: every entry is in $[0, 1]$ and every row sums to 1.

Properties of a row-stochastic matrix

That last property is the load-bearing one. $A V$ doesn't invent new information — it averages existing value vectors. The model's job in choosing $A$ is to decide which average to take per query.

Why softmax and not something simpler

You could use ReLU or Sigmoid to make scores non-negative. But softmax has three properties that matter: (1) row-sums to 1, so the output stays in the convex hull of values regardless of score magnitudes; (2) exponential, so a small lead in score becomes a near-monopoly on probability mass — the "winner take most" property that gives sparse attention to the most relevant key; (3) the gradient-of-softmax-cross-entropy collapses to $\mathbf{p} - \mathbf{y}^\star$ (deck 09), which makes training pleasant. Linear attention (deck 03) replaces softmax with a kernel feature-map and gives up (1) and (2) for asymptotic FLOPs savings.

05

$A V$ — Output as Weighted Sum of Values

The final attention output is the matrix product $O = AV \in \mathbb{R}^{L \times d_v}$. Reading this with view 3 of matmul (rows-as-combinations):

$$O_{i,:} = \sum_{j} A_{ij}\, V_{j,:}.$$

Each output token $i$ is a convex combination of the input value tokens, weighted by the attention probabilities. This is the cleanest possible reading: attention computes a token-specific weighted average of all token values.

The OV circuit

Going back to the residual stream, the attention head writes

$$\Delta X = (A V) W_O = A X (W_V W_O).$$

The product $W_V W_O \in \mathbb{R}^{d_{model} \times d_{model}}$ is the OV circuit — the per-head linear map from "what's in the residual stream" to "what gets written back". Combined with the QK circuit $W_Q W_K^\top$ (which determines where attention flows), these two products are the right-sized abstractions for understanding what an attention head does mechanistically.

"Attention is information routing"

The QK circuit says which token to read from. The OV circuit says what content to copy. Together they implement a token-conditional routing of information across positions, with no new information created — only re-mixing of value-projections of existing residual-stream contents. This is the attention head's atomic operation.

06

Multi-Head Attention as Block-Diagonal Projection

Multi-head attention runs $H$ heads in parallel, each with its own $W_Q^{(h)}, W_K^{(h)}, W_V^{(h)}$ of shape $\mathbb{R}^{d_{model} \times d_h}$ where $d_h = d_{model}/H$.

Stacking heads as block-columns:

$$W_Q = \big[\, W_Q^{(1)} \mid W_Q^{(2)} \mid \cdots \mid W_Q^{(H)} \,\big] \in \mathbb{R}^{d_{model} \times d_{model}}.$$

And similarly for $W_K, W_V$. Each head's projection is a down-projection to $d_h$; the concatenation recovers the full $d_{model}$. Per-head attention happens independently.

Why $H$ heads with $d_h$ each, instead of one head with $d_{model}$

Standard head counts

Note $d_h$ stays roughly constant at 64-128 across model sizes — a different scaling regime than $H$ or $d_{model}$.

07

$W_O$ — the Concat-and-Mix Step

Each head produces an output $O^{(h)} \in \mathbb{R}^{L \times d_h}$. The $H$ heads are concatenated along the feature dim,

$$O = \big[\, O^{(1)} \mid O^{(2)} \mid \cdots \mid O^{(H)} \,\big] \in \mathbb{R}^{L \times d_{model}},$$

then projected back to the residual stream by $W_O \in \mathbb{R}^{d_{model} \times d_{model}}$:

$$\mathrm{MHA}(X) = O W_O.$$

Block decomposition of $W_O$

Partition $W_O$ into $H$ horizontal blocks $W_O^{(1)}, \ldots, W_O^{(H)}$ each of shape $\mathbb{R}^{d_h \times d_{model}}$. Then

$$O W_O = \sum_{h=1}^H O^{(h)} W_O^{(h)}.$$

The full output is a sum over heads of "this head's output, projected up to $d_{model}$". Each head writes a contribution to the residual stream independently; the residual stream is the bus that adds them up.

Why this decomposition matters

Reading attention this way makes a fact pop out: each head writes into the residual stream along its own column-space of $W_O^{(h)}$. Two heads with disjoint column spaces don't interfere; two heads with overlapping spaces compete. Mechanistic interpretability (Anthropic's "Mathematical Framework for Transformer Circuits") uses exactly this decomposition to understand what each head contributes.

08

Causal Mask, Padding Mask, Sliding Window

Attention as defined would let token $i$ peek at every other token, including future ones. For autoregressive language modelling we forbid future-peeking by adding a mask to the score matrix before softmax:

$$S' = S + M,$$

where $M_{ij}$ is $0$ for allowed $(i, j)$ pairs and $-\infty$ for forbidden ones. After exp-and-normalise, forbidden positions get probability 0.

Causal mask

$M_{ij} = 0$ if $j \le i$, else $-\infty$. Lower-triangular allowed region. Used in every decoder LLM.

Padding mask

$M_{ij} = -\infty$ if $j$ is a pad token. Lets you batch sequences of different lengths.

Sliding-window

$M_{ij} = -\infty$ if $|i - j| > w$. Used by Mistral-7B (w = 4096), Longformer, etc. Per-token compute drops from $O(L)$ to $O(w)$.

The causal mask doesn't change the FLOPs

You still compute the full $L \times L$ score matrix, then mask. Naive attention is $O(L^2)$ in compute and memory regardless. FlashAttention (deck 12) tiles the computation to avoid materialising the full mask, which is where the actual savings come from.

Subtler structured masks

09

GQA & MQA — Sharing K and V

Standard MHA has $H$ separate $W_K, W_V$ projections, so the KV cache has $H \cdot d_h \cdot L = d_{model} \cdot L$ values per layer (per K and per V). For long-context inference this dominates memory.

The fix: share K and V across heads

MHA — full

$H$ Q-heads, $H$ K-heads, $H$ V-heads. KV cache = $2 H d_h L = 2 d_{model} L$ floats.

GQA — grouped

$H$ Q-heads, $G < H$ K-heads, $G$ V-heads. Each Q-head shares K, V with $H/G$ neighbours. KV cache = $2 G d_h L = 2(d_{model}/H)\cdot G \cdot L$.

Used by Llama-2 70B, Llama-3, Mixtral, GPT-4-class.

MQA — multi-query

$G = 1$ — one K and V shared by all heads. KV cache = $2 d_h L$, cut by factor $H$. Used by PaLM, Falcon. Some quality drop on training-from-scratch; almost none if introduced after the fact.

The numbers for Llama-3 70B

GQA is a low-rank trick (deck 07) applied to a different structure: instead of compressing $W_K$ to low rank, we constrain different heads to share the same $W_K$. Same Eckart-Young flavour; different parameterisation.

10

Interactive: Attention Playground

Drag the temperature slider to see how softmax behaves as scores get larger or smaller; switch between unmasked, causal, and sliding-window. The right matrix is the resulting attention pattern $A$ (rows = queries, columns = keys, brightness = probability).

1.00
1.0
Score S = QK/√d Mask Attention A = softmax(S)

The attention matrix on the right is row-stochastic: each row sums to 1. Crank temperature high to see attention spread (uniform-like rows); crank it low to see attention sharpen onto a single key per query.

11

Cheat Sheet

Read next

Deck 11 — Transformer Block Anatomy walks a tensor through the full block: pre-norm, MHA with all four projections, residual, FFN with up-then-down, residual. Every shape, every matmul, in one place.