The orthogonal projection is the workhorse of geometry. Up-projection lifts to a richer space; down-projection compresses. The transformer FFN is exactly an up-then-down projection — and the 4× expansion ratio is no accident.
An orthogonal projection sends a vector $\mathbf{v}$ to its closest point in a target subspace $U$. By "closest" we mean smallest L2 distance — equivalently, the residual $\mathbf{v} - P\mathbf{v}$ is perpendicular to $U$.
This is the most useful single operation in numerical linear algebra. Least-squares regression projects the target onto the column span of the data matrix. PCA projects data onto the top eigenvectors of its covariance. Every "best low-dimensional approximation" you've ever met is a projection.
$P\mathbf{v}$ is the unique element of $U$ such that $\mathbf{v} - P\mathbf{v} \perp U$. Drop a perpendicular from the tip of $\mathbf{v}$ onto $U$; the foot is $P\mathbf{v}$.
$$P\mathbf{v} = \arg\min_{\mathbf{u} \in U} \|\mathbf{v} - \mathbf{u}\|_2^2.$$
Gradient = 0 forces $\mathbf{v} - \mathbf{u} \perp U$. The same point.
The simplest case: $U$ is the line spanned by a single non-zero vector $\mathbf{a}$. Then
$$\mathrm{proj}_{\mathbf{a}}(\mathbf{v}) = \frac{\langle \mathbf{a}, \mathbf{v} \rangle}{\|\mathbf{a}\|^2}\, \mathbf{a} = \frac{\mathbf{a}\mathbf{a}^\top}{\mathbf{a}^\top \mathbf{a}}\, \mathbf{v}.$$
The projection matrix is the rank-1 outer product $\mathbf{a}\mathbf{a}^\top / (\mathbf{a}^\top\mathbf{a})$. If $\mathbf{a}$ is a unit vector this simplifies to $\mathbf{a}\mathbf{a}^\top$.
Every $\mathbf{v}$ decomposes uniquely as
$$\mathbf{v} = \underbrace{\mathrm{proj}_{\mathbf{a}}(\mathbf{v})}_{\text{parallel to } \mathbf{a}} + \underbrace{(\mathbf{v} - \mathrm{proj}_{\mathbf{a}}(\mathbf{v}))}_{\text{perpendicular to } \mathbf{a}}.$$
Pythagoras applies: $\|\mathbf{v}\|^2 = \|\mathrm{proj}_{\mathbf{a}}(\mathbf{v})\|^2 + \|\mathbf{v} - \mathrm{proj}_{\mathbf{a}}(\mathbf{v})\|^2$.
An attention head's value-vector readout is, in this language, a projection. Each token's contribution to the residual stream lies along the direction of the value vector it produces; the rest of the residual stream is, by orthogonal complement, untouched. This decomposition is part of why mechanistic interpretability can read transformers as a sum of "writes" along low-dimensional directions.
Now $U$ is an $r$-dimensional subspace of $\mathbb{R}^n$. Stack a basis of $U$ as the columns of $A \in \mathbb{R}^{n \times r}$. Then the projection of $\mathbf{v}$ onto $U$ is
$$P_U \mathbf{v} = A(A^\top A)^{-1} A^\top \mathbf{v}, \qquad P_U = A(A^\top A)^{-1} A^\top.$$
If the columns of $A$ are orthonormal — we'll write $Q$ in that case — this collapses to the much friendlier
$$P_U = Q Q^\top.$$
Read $Q^\top \mathbf{v}$ as "extract coordinates of $\mathbf{v}$ in the orthonormal basis of $U$" — a vector of $r$ inner products. Read $Q$ then as "rebuild a vector in $\mathbb{R}^n$ from those coordinates". The composition is "first compress, then re-expand using only $U$" — everything not in $U$ is discarded by the first step and never recovered.
Solving $A\mathbf{x} \approx \mathbf{b}$ when no exact solution exists: project $\mathbf{b}$ onto $\mathrm{Range}(A)$, then back out $\mathbf{x}$. Result: $\mathbf{x}^\star = (A^\top A)^{-1} A^\top \mathbf{b}$. The classic least-squares formula is a projection followed by an inverse change of coordinates.
An orthogonal projection matrix $P$ is exactly any matrix that is
Geometrically: projecting twice is the same as projecting once (idempotent), and the projection map is its own adjoint (symmetric).
From $P^2 = P$, the eigenvalues of $P$ satisfy $\lambda^2 = \lambda$, so each $\lambda$ is 0 or 1. Eigenvectors with eigenvalue 1 span $U$ (the projection's image); eigenvectors with eigenvalue 0 span $U^\perp$ (the kernel).
Many ML maps are not exact projections but are close enough to share the geometry: a low-rank weight matrix sends inputs roughly into its column span and discards the rest. LayerNorm projects onto the unit sphere within the affine hyperplane $\sum x_i = 0$ (deck 11 makes this precise). Attention's score matrix isn't a projection — but its row-stochastic structure plays an analogous "average within a subset" role (deck 10).
A down-projection is a linear map $W : \mathbb{R}^{d_h} \to \mathbb{R}^{d}$ where $d < d_h$. The output dimension is smaller than the input. It is, by rank-nullity, guaranteed to throw away at least $d_h - d$ dimensions of input information.
Down-projections show up everywhere in ML:
After the up-projection, the FFN's output projection $W_2 \in \mathbb{R}^{d_{model} \times d_{ff}}$ collapses the $4d_{model}$ hidden activations back to $d_{model}$ to write into the residual stream.
Multi-head attention concatenates $H$ heads, giving width $H d_h = d_{model}$, then $W_O \in \mathbb{R}^{d_{model} \times d_{model}}$ remixes back to $d_{model}$. (Same width here, but it's a "remix" projection — deck 11.)
Per head: $W_Q, W_K, W_V \in \mathbb{R}^{d_h \times d_{model}}$ with $d_h = d_{model}/H$. Each head sees a $d_h$-dim slice of the residual stream — a down-projection.
$\Delta W = BA$ with $A : \mathbb{R}^{d_{in}} \to \mathbb{R}^r$ and $B : \mathbb{R}^r \to \mathbb{R}^{d_{out}}$, $r \ll d_{in}, d_{out}$. The fine-tuning update lives on a low-dim subspace.
By the four-fundamental-subspaces theorem (deck 02): exactly the row space of $W$ — an $r$-dimensional subspace of input space, where $r = \mathrm{rank}\,W \le d$. Two inputs that differ only in the null space produce identical outputs.
The implication for transformers: the $W_2$ down-projection is a commitment about which $d_{model}$ directions of hidden activation get a vote in the residual stream. The trained weights pick these directions to maximise predictive value.
An up-projection is a linear map $W : \mathbb{R}^{d} \to \mathbb{R}^{d_h}$ where $d_h > d$. The output dimension is larger.
An up-projection cannot increase information — the rank is bounded by $\min(d_h, d) = d$, so $d_h - d$ dimensions of the output are linearly determined by the others. So why bother?
If you stack two linear maps you get one linear map — the up and down collapse to a single $d \times d$ map (deck 02). What makes the FFN useful is the activation in the middle:
$$\mathrm{FFN}(\mathbf{x}) = W_{\text{down}}\, \sigma(W_{\text{up}} \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2.$$
The non-linear $\sigma$ acts component-wise in the lifted space. Different lift directions get different non-linear treatments — a hidden coordinate that ends up positive passes through; a negative one is suppressed (or smoothly clipped, in GELU/SwiGLU). This is what the lift was for: more independent non-linear "channels" to switch on or off.
A $d_h$-dimensional hidden layer with ReLU-style activations can implement up to $2^{d_h}$ different "on/off patterns" of which neurons fire. With $d_h = 4d$, that's $2^{4d}$ possible activation patterns — vastly more than $2^d$. The up-projection's job is to give the non-linearity a wide enough chord-board to play on.
The "feed-forward" sub-layer of a transformer block, in matrix form:
$$\mathrm{FFN}(\mathbf{x}) = W_{\text{down}}\,\sigma(W_{\text{up}} \mathbf{x}).$$
Shapes for a typical $d_{model} = 4096$ block (Llama-style):
$\mathrm{FFN}(\mathbf{x}) = \sum_{i=1}^{d_h} \sigma(\mathbf{w}_i^\top \mathbf{x})\, \mathbf{c}_i$ where $\mathbf{w}_i$ is row $i$ of $W_{\text{up}}$ and $\mathbf{c}_i$ is column $i$ of $W_{\text{down}}$. Read this carefully: each hidden neuron $i$ pairs a "feature detector" $\mathbf{w}_i$ (row of up-projection) with a "feature contribution" $\mathbf{c}_i$ (column of down-projection). When the detector fires (its score is positive enough that $\sigma$ doesn't squash it), the contribution direction $\mathbf{c}_i$ is added to the output, scaled by the activation magnitude.
Geva et al. (EMNLP 2021) and Anthropic's circuits work argue that the rows of $W_{\text{up}}$ act as "key" vectors detecting input patterns, and the columns of $W_{\text{down}}$ act as "value" vectors writing answers into the residual stream — an attention-shaped structure inside the FFN. This is exactly what the equation above says. The up-down view of the FFN is the cleanest way to see it.
The original transformer (Vaswani et al. 2017) chose $d_{ff} = 4 d_{model}$. The ratio has stuck across most subsequent architectures — PaLM, GPT-3/4, Llama-2 dense, Mistral, Falcon. Where does it come from?
The FFN is the per-block "memory" / pattern-bank. A wider hidden makes more independent feature directions available. Empirical scaling work (Kaplan 2020, Hoffmann 2022) shows roughly flat performance for ratios $\in [2, 8]$ — 4 is a sweet spot.
At ratio 4, FFN params $= 8d^2$ and attention params $= 4d^2$ (without bias / norm). The FFN is twice the size of attention — roughly the empirical ratio that gives best perplexity/FLOP. Smaller ratios hand FLOPs back to attention; bigger ratios starve attention.
$4d \times d$ and $d \times 4d$ matmuls have arithmetic intensity proportional to $\min(4d, d)$. For typical $d \in [1024, 16384]$ they sit comfortably above the tensor-core ridge point. Wider would help only if memory bandwidth weren't the bottleneck.
Some architectures push back: SwiGLU adds a third matrix (gated unit), so to keep total FFN params constant Llama-2 uses ratio $\frac{8}{3} \cdot d_{model}$ — about 2.67. The 4:1 figure is a convention with sound but soft justification, not a constant of nature.
Mixture-of-experts FFNs (Mixtral, DeepSeek-V3) replace the single 4×-wide FFN with $E$ experts of which $k$ activate per token. Each expert can be much smaller (e.g. ratio 2) but the aggregate parameter count is huge. The 4× rule lives only at the dense-FFN limit. Deck 12 covers MoE briefly.
Llama, Mistral and most modern dense LLMs replace the GELU FFN with SwiGLU (Shazeer 2020), which adds a multiplicative gating mechanism:
$$\mathrm{SwiGLU}(\mathbf{x}) = W_{\text{down}} \big[ \mathrm{Swish}(W_{\text{gate}} \mathbf{x}) \odot (W_{\text{up}} \mathbf{x}) \big]$$
where $\odot$ is the element-wise product and $\mathrm{Swish}(z) = z \cdot \sigma(z)$ (i.e. SiLU). Three matrices instead of two. To keep parameter count constant compared to GELU FFN at ratio 4, the inner width is cut to $\frac{8}{3} d$ instead of $4d$.
$W_{\text{up}} \mathbf{x}$ is the "value" path — the candidate contribution per hidden coordinate.
$W_{\text{gate}} \mathbf{x}$ runs through Swish and gates each coordinate of the value: how much of this hidden coordinate makes it through.
Without gating, every hidden coordinate gets the same activation function applied. With gating, each coordinate's gain is data-dependent: it can learn to attenuate when the input doesn't look like its target pattern. The gate gives the FFN more expressive power per parameter.
Noam Shazeer's two-page paper compared 9 GLU variants empirically; SwiGLU and GeGLU consistently topped the leaderboard. Swap in SwiGLU at the same parameter budget and perplexity falls a few percent. The choice is an empirical "free lunch" that everyone has now adopted — the only cost is a third matrix multiplication per FFN.
Two widgets in one panel.
(A) Projection visualiser: drag $\mathbf{v}$ and the line direction $\mathbf{a}$. The pink arrow is $\mathrm{proj}_{\mathbf{a}}(\mathbf{v})$; the dashed grey is the perpendicular residual.
(B) SwiGLU toy: a width-2 input lifted to width-$d_h$, gated, projected back to width-1. Adjust the hidden width and the gate cutoff to see the effective function the FFN computes — a piecewise-linear approximation to whatever target you draw.
Green dashed: target. Violet: the FFN output as a function of $x \in [-2, 2]$, using random weights. Hit "Random weights" a few times to see how the same architecture spans a class of piecewise-linear functions; the wider the hidden, the richer the class. SwiGLU's gate gives more flexibility per neuron than ReLU/GELU.
Deck 06 — Eigenvalues & Eigenvectors introduces the spectral theory we'll need for SVD (deck 07), where we'll see that the "best" basis for any matrix — the one in which it looks like a clean projection-with-scaling — comes from its singular vectors.