Q, K, V are learned projections. The attention matrix is row-stochastic. The output is a row-by-row weighted average of value vectors. Read this way, attention loses its mystery and becomes a piece of crystallised linear algebra.
$$\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_h}}\right) V.$$
Three matrices in, one matrix out. Shapes (single head, single batch element):
The whole equation is two matmuls and one rowwise softmax. We will look at each piece, in linear-algebra terms, in the next eight slides.
(1) Per query. For each query row $\mathbf{q}_i$, attention computes scores $\mathbf{q}_i^\top \mathbf{k}_j / \sqrt{d_h}$ for every key, softmaxes them across $j$, then weighted-averages the value rows. (2) Per key. Each key $\mathbf{k}_j$ contributes to every query in proportion to its score. Either reading is correct; both will appear in the slides ahead.
For an input residual stream $X \in \mathbb{R}^{L \times d_{model}}$ (one token per row, $d_{model} = $ hidden width), each head computes
$$Q = X W_Q, \qquad K = X W_K, \qquad V = X W_V,$$
with $W_Q, W_K, W_V \in \mathbb{R}^{d_{model} \times d_h}$. Each is a linear down-projection from the residual stream into a head-specific subspace of dimension $d_h$ (deck 05).
Each row of $W_Q^\top$ is a "thing to look for" detector. $\mathbf{q}_i = W_Q^\top \mathbf{x}_i$ asks: "what about my context do I need to find?"
Each row of $W_K^\top$ produces a key tag. $\mathbf{k}_j$ says "this is what I, position $j$, advertise." The dot product $\mathbf{q}_i^\top \mathbf{k}_j$ measures advertise-vs-need similarity.
$\mathbf{v}_j$ is the actual content position $j$ contributes if attended-to. The output is a weighted average of $V$'s rows.
Could $K$ just be $X$? It could, but it would tie the "address" of a token to its raw value. Separating $W_K$ and $W_V$ lets a token advertise something different from what it contributes. The Q/K split lets a query also be different from its own key — useful when you want a token to attend to itself or to other tokens in a non-symmetric way.
Empirically: in mechanistic interpretability work (Elhage et al., Anthropic 2021), $W_Q W_K^\top$ and $W_O W_V$ are studied as "QK circuit" and "OV circuit" respectively — the actual learnable directions of attention behaviour live in these products, not in the individual matrices.
The matrix $S = QK^\top \in \mathbb{R}^{L \times L}$ has entry $(i, j)$ equal to $\mathbf{q}_i^\top \mathbf{k}_j$ — a single scalar, the affinity between query $i$ and key $j$.
Read it as view 1 of matmul (deck 03): every entry is one dot product of a row of $Q$ with a row of $K$ (after transposing). Or read it as view 4: $S = \sum_d Q_{:,d}\, K_{:,d}^\top$ — a sum over head-dimensions of rank-1 outer products. Either reading works.
A rank-$d_h$ score matrix can express only $d_h$ independent "lines of attention pattern". For tasks needing finer granularity, the model uses multiple heads to combine different rank-$d_h$ patterns. With $H = 32$ heads and $d_h = 128$ each, the effective combined rank is $\sum_h d_h = 4096$ — full $d_{model}$. The number of heads is not a hyperparameter chosen at random; it's the ratio of $d_{model}$ to a per-head budget that has to stay below the softmax saturation regime.
Apply softmax to each row of $S/\sqrt{d_h}$:
$$A_{ij} = \frac{\exp(S_{ij}/\sqrt{d_h})}{\sum_{k} \exp(S_{ik}/\sqrt{d_h})}.$$
The result $A \in \mathbb{R}^{L \times L}$ is a row-stochastic matrix: every entry is in $[0, 1]$ and every row sums to 1.
That last property is the load-bearing one. $A V$ doesn't invent new information — it averages existing value vectors. The model's job in choosing $A$ is to decide which average to take per query.
You could use ReLU or Sigmoid to make scores non-negative. But softmax has three properties that matter: (1) row-sums to 1, so the output stays in the convex hull of values regardless of score magnitudes; (2) exponential, so a small lead in score becomes a near-monopoly on probability mass — the "winner take most" property that gives sparse attention to the most relevant key; (3) the gradient-of-softmax-cross-entropy collapses to $\mathbf{p} - \mathbf{y}^\star$ (deck 09), which makes training pleasant. Linear attention (deck 03) replaces softmax with a kernel feature-map and gives up (1) and (2) for asymptotic FLOPs savings.
The final attention output is the matrix product $O = AV \in \mathbb{R}^{L \times d_v}$. Reading this with view 3 of matmul (rows-as-combinations):
$$O_{i,:} = \sum_{j} A_{ij}\, V_{j,:}.$$
Each output token $i$ is a convex combination of the input value tokens, weighted by the attention probabilities. This is the cleanest possible reading: attention computes a token-specific weighted average of all token values.
Going back to the residual stream, the attention head writes
$$\Delta X = (A V) W_O = A X (W_V W_O).$$
The product $W_V W_O \in \mathbb{R}^{d_{model} \times d_{model}}$ is the OV circuit — the per-head linear map from "what's in the residual stream" to "what gets written back". Combined with the QK circuit $W_Q W_K^\top$ (which determines where attention flows), these two products are the right-sized abstractions for understanding what an attention head does mechanistically.
The QK circuit says which token to read from. The OV circuit says what content to copy. Together they implement a token-conditional routing of information across positions, with no new information created — only re-mixing of value-projections of existing residual-stream contents. This is the attention head's atomic operation.
Multi-head attention runs $H$ heads in parallel, each with its own $W_Q^{(h)}, W_K^{(h)}, W_V^{(h)}$ of shape $\mathbb{R}^{d_{model} \times d_h}$ where $d_h = d_{model}/H$.
Stacking heads as block-columns:
$$W_Q = \big[\, W_Q^{(1)} \mid W_Q^{(2)} \mid \cdots \mid W_Q^{(H)} \,\big] \in \mathbb{R}^{d_{model} \times d_{model}}.$$
And similarly for $W_K, W_V$. Each head's projection is a down-projection to $d_h$; the concatenation recovers the full $d_{model}$. Per-head attention happens independently.
Note $d_h$ stays roughly constant at 64-128 across model sizes — a different scaling regime than $H$ or $d_{model}$.
Each head produces an output $O^{(h)} \in \mathbb{R}^{L \times d_h}$. The $H$ heads are concatenated along the feature dim,
$$O = \big[\, O^{(1)} \mid O^{(2)} \mid \cdots \mid O^{(H)} \,\big] \in \mathbb{R}^{L \times d_{model}},$$
then projected back to the residual stream by $W_O \in \mathbb{R}^{d_{model} \times d_{model}}$:
$$\mathrm{MHA}(X) = O W_O.$$
Partition $W_O$ into $H$ horizontal blocks $W_O^{(1)}, \ldots, W_O^{(H)}$ each of shape $\mathbb{R}^{d_h \times d_{model}}$. Then
$$O W_O = \sum_{h=1}^H O^{(h)} W_O^{(h)}.$$
The full output is a sum over heads of "this head's output, projected up to $d_{model}$". Each head writes a contribution to the residual stream independently; the residual stream is the bus that adds them up.
Reading attention this way makes a fact pop out: each head writes into the residual stream along its own column-space of $W_O^{(h)}$. Two heads with disjoint column spaces don't interfere; two heads with overlapping spaces compete. Mechanistic interpretability (Anthropic's "Mathematical Framework for Transformer Circuits") uses exactly this decomposition to understand what each head contributes.
Attention as defined would let token $i$ peek at every other token, including future ones. For autoregressive language modelling we forbid future-peeking by adding a mask to the score matrix before softmax:
$$S' = S + M,$$
where $M_{ij}$ is $0$ for allowed $(i, j)$ pairs and $-\infty$ for forbidden ones. After exp-and-normalise, forbidden positions get probability 0.
$M_{ij} = 0$ if $j \le i$, else $-\infty$. Lower-triangular allowed region. Used in every decoder LLM.
$M_{ij} = -\infty$ if $j$ is a pad token. Lets you batch sequences of different lengths.
$M_{ij} = -\infty$ if $|i - j| > w$. Used by Mistral-7B (w = 4096), Longformer, etc. Per-token compute drops from $O(L)$ to $O(w)$.
You still compute the full $L \times L$ score matrix, then mask. Naive attention is $O(L^2)$ in compute and memory regardless. FlashAttention (deck 12) tiles the computation to avoid materialising the full mask, which is where the actual savings come from.
Standard MHA has $H$ separate $W_K, W_V$ projections, so the KV cache has $H \cdot d_h \cdot L = d_{model} \cdot L$ values per layer (per K and per V). For long-context inference this dominates memory.
$H$ Q-heads, $H$ K-heads, $H$ V-heads. KV cache = $2 H d_h L = 2 d_{model} L$ floats.
$H$ Q-heads, $G < H$ K-heads, $G$ V-heads. Each Q-head shares K, V with $H/G$ neighbours. KV cache = $2 G d_h L = 2(d_{model}/H)\cdot G \cdot L$.
Used by Llama-2 70B, Llama-3, Mixtral, GPT-4-class.
$G = 1$ — one K and V shared by all heads. KV cache = $2 d_h L$, cut by factor $H$. Used by PaLM, Falcon. Some quality drop on training-from-scratch; almost none if introduced after the fact.
GQA is a low-rank trick (deck 07) applied to a different structure: instead of compressing $W_K$ to low rank, we constrain different heads to share the same $W_K$. Same Eckart-Young flavour; different parameterisation.
Drag the temperature slider to see how softmax behaves as scores get larger or smaller; switch between unmasked, causal, and sliding-window. The right matrix is the resulting attention pattern $A$ (rows = queries, columns = keys, brightness = probability).
The attention matrix on the right is row-stochastic: each row sums to 1. Crank temperature high to see attention spread (uniform-like rows); crank it low to see attention sharpen onto a single key per query.
Deck 11 — Transformer Block Anatomy walks a tensor through the full block: pre-norm, MHA with all four projections, residual, FFN with up-then-down, residual. Every shape, every matmul, in one place.