Linear Algebra for AI 10 — Attention as Linear Algebra

00

Topics We'll Cover

Scaled Dot-Product Attention in One Equation
Q, K, V as Learned Projections
$QK^\top$ — the Score Matrix
Softmax Rowwise — the Stochastic Matrix
$A V$ — Output as Weighted Sum of Values
Multi-Head Attention as Block-Diagonal Projection
$W_O$ — the Concat-and-Mix Step
Causal Mask, Padding Mask, Sliding Window
GQA & MQA — Sharing K and V
Interactive: Attention Playground
Cheat Sheet

01

Scaled Dot-Product Attention in One Equation

$$\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_h}}\right) V.$$

Three matrices in, one matrix out. Shapes (single head, single batch element):

$Q \in \mathbb{R}^{L \times d_h}$ — one query per token, head dim $d_h$.
$K \in \mathbb{R}^{L \times d_h}$ — one key per token, same head dim.
$V \in \mathbb{R}^{L \times d_v}$ — one value per token, value dim $d_v$ (usually $= d_h$).
$Q K^\top \in \mathbb{R}^{L \times L}$ — the score matrix, one number per query-key pair.
Output $\in \mathbb{R}^{L \times d_v}$ — one updated representation per token.

The whole equation is two matmuls and one rowwise softmax. We will look at each piece, in linear-algebra terms, in the next eight slides.

Two ways to read this equation

(1) Per query. For each query row $\mathbf{q}_i$, attention computes scores $\mathbf{q}_i^\top \mathbf{k}_j / \sqrt{d_h}$ for every key, softmaxes them across $j$, then weighted-averages the value rows. (2) Per key. Each key $\mathbf{k}_j$ contributes to every query in proportion to its score. Either reading is correct; both will appear in the slides ahead.

02

Q, K, V as Learned Projections

For an input residual stream $X \in \mathbb{R}^{L \times d_{model}}$ (one token per row, $d_{model} = $ hidden width), each head computes

$$Q = X W_Q, \qquad K = X W_K, \qquad V = X W_V,$$

with $W_Q, W_K, W_V \in \mathbb{R}^{d_{model} \times d_h}$. Each is a linear down-projection from the residual stream into a head-specific subspace of dimension $d_h$ (deck 05).

What each projection does

$W_Q$ — the question

Each row of $W_Q^\top$ is a "thing to look for" detector. $\mathbf{q}_i = W_Q^\top \mathbf{x}_i$ asks: "what about my context do I need to find?"

$W_K$ — the address

Each row of $W_K^\top$ produces a key tag. $\mathbf{k}_j$ says "this is what I, position $j$, advertise." The dot product $\mathbf{q}_i^\top \mathbf{k}_j$ measures advertise-vs-need similarity.

$W_V$ — the payload

$\mathbf{v}_j$ is the actual content position $j$ contributes if attended-to. The output is a weighted average of $V$'s rows.

Why three separate projections

Could $K$ just be $X$? It could, but it would tie the "address" of a token to its raw value. Separating $W_K$ and $W_V$ lets a token advertise something different from what it contributes. The Q/K split lets a query also be different from its own key — useful when you want a token to attend to itself or to other tokens in a non-symmetric way.

Empirically: in mechanistic interpretability work (Elhage et al., Anthropic 2021), $W_Q W_K^\top$ and $W_O W_V$ are studied as "QK circuit" and "OV circuit" respectively — the actual learnable directions of attention behaviour live in these products, not in the individual matrices.

03

$QK^\top$ — the Score Matrix

The matrix $S = QK^\top \in \mathbb{R}^{L \times L}$ has entry $(i, j)$ equal to $\mathbf{q}_i^\top \mathbf{k}_j$ — a single scalar, the affinity between query $i$ and key $j$.

Read it as view 1 of matmul (deck 03): every entry is one dot product of a row of $Q$ with a row of $K$ (after transposing). Or read it as view 4: $S = \sum_d Q_{:,d}\, K_{:,d}^\top$ — a sum over head-dimensions of rank-1 outer products. Either reading works.

Properties of $S$

Generally not symmetric. $S_{ij} = \mathbf{q}_i^\top \mathbf{k}_j \neq \mathbf{q}_j^\top \mathbf{k}_i = S_{ji}$ in general — queries and keys are different objects.
Rank. $\mathrm{rank}(S) \le d_h$. With $d_h = 64$ and $L = 32{,}768$, $S$ is a $32k \times 32k$ matrix of rank at most 64. Massively low-rank — this is the "low-rank bottleneck" identified by Bhojanapalli et al. (2020).
Scaling. Divided by $\sqrt{d_h}$ for the variance reason in deck 04, slide 08.

Why the rank constraint matters

A rank-$d_h$ score matrix can express only $d_h$ independent "lines of attention pattern". For tasks needing finer granularity, the model uses multiple heads to combine different rank-$d_h$ patterns. With $H = 32$ heads and $d_h = 128$ each, the effective combined rank is $\sum_h d_h = 4096$ — full $d_{model}$. The number of heads is not a hyperparameter chosen at random; it's the ratio of $d_{model}$ to a per-head budget that has to stay below the softmax saturation regime.

04

Softmax Rowwise — the Stochastic Matrix

Apply softmax to each row of $S/\sqrt{d_h}$:

$$A_{ij} = \frac{\exp(S_{ij}/\sqrt{d_h})}{\sum_{k} \exp(S_{ik}/\sqrt{d_h})}.$$

The result $A \in \mathbb{R}^{L \times L}$ is a row-stochastic matrix: every entry is in $[0, 1]$ and every row sums to 1.

Properties of a row-stochastic matrix

$A \mathbf{1} = \mathbf{1}$: $\mathbf{1}$ is a right eigenvector with eigenvalue 1.
Largest eigenvalue is exactly 1 by Perron-Frobenius, since entries are non-negative.
$AV$ is a row-by-row convex combination. Each row of the output is in the convex hull of the rows of $V$.

That last property is the load-bearing one. $A V$ doesn't invent new information — it averages existing value vectors. The model's job in choosing $A$ is to decide which average to take per query.

Why softmax and not something simpler

You could use ReLU or Sigmoid to make scores non-negative. But softmax has three properties that matter: (1) row-sums to 1, so the output stays in the convex hull of values regardless of score magnitudes; (2) exponential, so a small lead in score becomes a near-monopoly on probability mass — the "winner take most" property that gives sparse attention to the most relevant key; (3) the gradient-of-softmax-cross-entropy collapses to $\mathbf{p} - \mathbf{y}^\star$ (deck 09), which makes training pleasant. Linear attention (deck 03) replaces softmax with a kernel feature-map and gives up (1) and (2) for asymptotic FLOPs savings.

05

$A V$ — Output as Weighted Sum of Values

The final attention output is the matrix product $O = AV \in \mathbb{R}^{L \times d_v}$. Reading this with view 3 of matmul (rows-as-combinations):

$$O_{i,:} = \sum_{j} A_{ij}\, V_{j,:}.$$

Each output token $i$ is a convex combination of the input value tokens, weighted by the attention probabilities. This is the cleanest possible reading: attention computes a token-specific weighted average of all token values.

The OV circuit

Going back to the residual stream, the attention head writes

$$\Delta X = (A V) W_O = A X (W_V W_O).$$

The product $W_V W_O \in \mathbb{R}^{d_{model} \times d_{model}}$ is the OV circuit — the per-head linear map from "what's in the residual stream" to "what gets written back". Combined with the QK circuit $W_Q W_K^\top$ (which determines where attention flows), these two products are the right-sized abstractions for understanding what an attention head does mechanistically.

"Attention is information routing"

The QK circuit says which token to read from. The OV circuit says what content to copy. Together they implement a token-conditional routing of information across positions, with no new information created — only re-mixing of value-projections of existing residual-stream contents. This is the attention head's atomic operation.

06

Multi-Head Attention as Block-Diagonal Projection

Multi-head attention runs $H$ heads in parallel, each with its own $W_Q^{(h)}, W_K^{(h)}, W_V^{(h)}$ of shape $\mathbb{R}^{d_{model} \times d_h}$ where $d_h = d_{model}/H$.

Stacking heads as block-columns:

$$W_Q = \big[\, W_Q^{(1)} \mid W_Q^{(2)} \mid \cdots \mid W_Q^{(H)} \,\big] \in \mathbb{R}^{d_{model} \times d_{model}}.$$

And similarly for $W_K, W_V$. Each head's projection is a down-projection to $d_h$; the concatenation recovers the full $d_{model}$. Per-head attention happens independently.

Why $H$ heads with $d_h$ each, instead of one head with $d_{model}$

Multiple attention patterns at once. Each head's score matrix is rank $\le d_h$. With one head you get one pattern of rank $\le d_{model}$; with $H$ heads you get $H$ separate patterns. Empirically the interpretable circuits live in individual heads, not in the raw rank-$d_{model}$ score.
Same total FLOPs. $H$ matmuls of shape $L \times L \times d_h$ has the same FLOP count as one matmul of shape $L \times L \times d_{model}$. The cost is identical; the structure differs.
Routing diversity. Different heads can specialise: copying, induction, name-binding, position lookup. The "circuits" decomposition (Anthropic, Olsson et al. 2022) finds these specialised heads in real models.

Standard head counts

GPT-2 small: $d_{model} = 768$, $H = 12$, $d_h = 64$.
Llama-2 7B: $d_{model} = 4096$, $H = 32$, $d_h = 128$.
GPT-3 175B: $d_{model} = 12288$, $H = 96$, $d_h = 128$.

Note $d_h$ stays roughly constant at 64-128 across model sizes — a different scaling regime than $H$ or $d_{model}$.

07

$W_O$ — the Concat-and-Mix Step

Each head produces an output $O^{(h)} \in \mathbb{R}^{L \times d_h}$. The $H$ heads are concatenated along the feature dim,

$$O = \big[\, O^{(1)} \mid O^{(2)} \mid \cdots \mid O^{(H)} \,\big] \in \mathbb{R}^{L \times d_{model}},$$

then projected back to the residual stream by $W_O \in \mathbb{R}^{d_{model} \times d_{model}}$:

$$\mathrm{MHA}(X) = O W_O.$$

Block decomposition of $W_O$

Partition $W_O$ into $H$ horizontal blocks $W_O^{(1)}, \ldots, W_O^{(H)}$ each of shape $\mathbb{R}^{d_h \times d_{model}}$. Then

$$O W_O = \sum_{h=1}^H O^{(h)} W_O^{(h)}.$$

The full output is a sum over heads of "this head's output, projected up to $d_{model}$". Each head writes a contribution to the residual stream independently; the residual stream is the bus that adds them up.

Why this decomposition matters

Reading attention this way makes a fact pop out: each head writes into the residual stream along its own column-space of $W_O^{(h)}$. Two heads with disjoint column spaces don't interfere; two heads with overlapping spaces compete. Mechanistic interpretability (Anthropic's "Mathematical Framework for Transformer Circuits") uses exactly this decomposition to understand what each head contributes.

08

Causal Mask, Padding Mask, Sliding Window

Attention as defined would let token $i$ peek at every other token, including future ones. For autoregressive language modelling we forbid future-peeking by adding a mask to the score matrix before softmax:

$$S' = S + M,$$

where $M_{ij}$ is $0$ for allowed $(i, j)$ pairs and $-\infty$ for forbidden ones. After exp-and-normalise, forbidden positions get probability 0.

Causal mask

$M_{ij} = 0$ if $j \le i$, else $-\infty$. Lower-triangular allowed region. Used in every decoder LLM.

Padding mask

$M_{ij} = -\infty$ if $j$ is a pad token. Lets you batch sequences of different lengths.

Sliding-window

$M_{ij} = -\infty$ if $|i - j| > w$. Used by Mistral-7B (w = 4096), Longformer, etc. Per-token compute drops from $O(L)$ to $O(w)$.

The causal mask doesn't change the FLOPs

You still compute the full $L \times L$ score matrix, then mask. Naive attention is $O(L^2)$ in compute and memory regardless. FlashAttention (deck 12) tiles the computation to avoid materialising the full mask, which is where the actual savings come from.

Subtler structured masks

Block-sparse (BigBird, Longformer): some positions attend globally, others locally. Reduces FLOPs.
Bi-directional (BERT, RoBERTa): no mask. Useful for encoder-only models that see the whole sequence at once.
Prefix-LM (T5 / UL2): no mask within the prompt, causal mask within the completion.

09

GQA & MQA — Sharing K and V

Standard MHA has $H$ separate $W_K, W_V$ projections, so the KV cache has $H \cdot d_h \cdot L = d_{model} \cdot L$ values per layer (per K and per V). For long-context inference this dominates memory.

The fix: share K and V across heads

MHA — full

$H$ Q-heads, $H$ K-heads, $H$ V-heads. KV cache = $2 H d_h L = 2 d_{model} L$ floats.

GQA — grouped

$H$ Q-heads, $G < H$ K-heads, $G$ V-heads. Each Q-head shares K, V with $H/G$ neighbours. KV cache = $2 G d_h L = 2(d_{model}/H)\cdot G \cdot L$.

Used by Llama-2 70B, Llama-3, Mixtral, GPT-4-class.

MQA — multi-query

$G = 1$ — one K and V shared by all heads. KV cache = $2 d_h L$, cut by factor $H$. Used by PaLM, Falcon. Some quality drop on training-from-scratch; almost none if introduced after the fact.

The numbers for Llama-3 70B

$d_{model} = 8192$, $H = 64$, $d_h = 128$, $G = 8$ (GQA).
KV cache shrinks from $2 \cdot 8192 \cdot L$ to $2 \cdot 8 \cdot 128 \cdot L = 2 \cdot 1024 \cdot L$ — an 8× reduction.
For $L = 8192$: 256 MiB per layer (MHA) vs 32 MiB (GQA). At 80 layers, that's 20 GiB vs 2.5 GiB just for KV cache.

GQA is a low-rank trick (deck 07) applied to a different structure: instead of compressing $W_K$ to low rank, we constrain different heads to share the same $W_K$. Same Eckart-Young flavour; different parameterisation.

10

Interactive: Attention Playground

Drag the temperature slider to see how softmax behaves as scores get larger or smaller; switch between unmasked, causal, and sliding-window. The right matrix is the resulting attention pattern $A$ (rows = queries, columns = keys, brightness = probability).

Mask

Temperature τ 1.00

√d_h scale 1.0

The attention matrix on the right is row-stochastic: each row sums to 1. Crank temperature high to see attention spread (uniform-like rows); crank it low to see attention sharpen onto a single key per query.

11

Cheat Sheet

Scaled dot-product attention: $\mathrm{Attention}(Q, K, V) = \mathrm{softmax}(QK^\top / \sqrt{d_h})\, V$. Two matmuls and a rowwise softmax.
$Q, K, V$ are down-projections of the residual stream by $W_Q, W_K, W_V \in \mathbb{R}^{d_{model} \times d_h}$ (per head).
$S = QK^\top$ has rank $\le d_h$ — a low-rank score matrix; multi-head structure compensates.
$A = \mathrm{softmax}(S/\sqrt{d_h})$ is row-stochastic. $AV$ is a row-by-row convex combination of value rows — attention routes, doesn't create.
Multi-head attention is $H$ parallel attention computations on disjoint $d_h$-dim slices, concatenated and remixed by $W_O$.
$W_O$ is best read as $H$ stacked sub-blocks; each head writes its own contribution to the residual stream along its block's column space.
QK circuit $W_Q W_K^\top$ controls where attention flows; OV circuit $W_V W_O^{(h)}$ controls what content gets written.
Masks are added to $S$ (with $-\infty$ for forbidden) before softmax: causal, padding, sliding-window. Don't change FLOPs in naive attention.
GQA / MQA share $W_K$ and $W_V$ across groups of heads — cuts KV cache by 8× (GQA) or $H$× (MQA) at almost no quality cost.