Linear Algebra for AI — Presentation 05

Projections — Up & Down

The orthogonal projection is the workhorse of geometry. Up-projection lifts to a richer space; down-projection compresses. The transformer FFN is exactly an up-then-down projection — and the 4× expansion ratio is no accident.

orthogonal projectionprojection matrix up-projectiondown-projection FFN 4×SwiGLU
vector project onto subspace lift to bigger space non-linear gate project back down
00

Topics We'll Cover

01

What's a Projection?

An orthogonal projection sends a vector $\mathbf{v}$ to its closest point in a target subspace $U$. By "closest" we mean smallest L2 distance — equivalently, the residual $\mathbf{v} - P\mathbf{v}$ is perpendicular to $U$.

This is the most useful single operation in numerical linear algebra. Least-squares regression projects the target onto the column span of the data matrix. PCA projects data onto the top eigenvectors of its covariance. Every "best low-dimensional approximation" you've ever met is a projection.

Two equivalent characterisations

Geometric

$P\mathbf{v}$ is the unique element of $U$ such that $\mathbf{v} - P\mathbf{v} \perp U$. Drop a perpendicular from the tip of $\mathbf{v}$ onto $U$; the foot is $P\mathbf{v}$.

Optimisation

$$P\mathbf{v} = \arg\min_{\mathbf{u} \in U} \|\mathbf{v} - \mathbf{u}\|_2^2.$$

Gradient = 0 forces $\mathbf{v} - \mathbf{u} \perp U$. The same point.

02

The Projection-onto-a-Vector Formula

The simplest case: $U$ is the line spanned by a single non-zero vector $\mathbf{a}$. Then

$$\mathrm{proj}_{\mathbf{a}}(\mathbf{v}) = \frac{\langle \mathbf{a}, \mathbf{v} \rangle}{\|\mathbf{a}\|^2}\, \mathbf{a} = \frac{\mathbf{a}\mathbf{a}^\top}{\mathbf{a}^\top \mathbf{a}}\, \mathbf{v}.$$

The projection matrix is the rank-1 outer product $\mathbf{a}\mathbf{a}^\top / (\mathbf{a}^\top\mathbf{a})$. If $\mathbf{a}$ is a unit vector this simplifies to $\mathbf{a}\mathbf{a}^\top$.

Component decomposition

Every $\mathbf{v}$ decomposes uniquely as

$$\mathbf{v} = \underbrace{\mathrm{proj}_{\mathbf{a}}(\mathbf{v})}_{\text{parallel to } \mathbf{a}} + \underbrace{(\mathbf{v} - \mathrm{proj}_{\mathbf{a}}(\mathbf{v}))}_{\text{perpendicular to } \mathbf{a}}.$$

Pythagoras applies: $\|\mathbf{v}\|^2 = \|\mathrm{proj}_{\mathbf{a}}(\mathbf{v})\|^2 + \|\mathbf{v} - \mathrm{proj}_{\mathbf{a}}(\mathbf{v})\|^2$.

Read this carefully

An attention head's value-vector readout is, in this language, a projection. Each token's contribution to the residual stream lies along the direction of the value vector it produces; the rest of the residual stream is, by orthogonal complement, untouched. This decomposition is part of why mechanistic interpretability can read transformers as a sum of "writes" along low-dimensional directions.

03

Projection onto a Subspace

Now $U$ is an $r$-dimensional subspace of $\mathbb{R}^n$. Stack a basis of $U$ as the columns of $A \in \mathbb{R}^{n \times r}$. Then the projection of $\mathbf{v}$ onto $U$ is

$$P_U \mathbf{v} = A(A^\top A)^{-1} A^\top \mathbf{v}, \qquad P_U = A(A^\top A)^{-1} A^\top.$$

If the columns of $A$ are orthonormal — we'll write $Q$ in that case — this collapses to the much friendlier

$$P_U = Q Q^\top.$$

Geometric reading

Read $Q^\top \mathbf{v}$ as "extract coordinates of $\mathbf{v}$ in the orthonormal basis of $U$" — a vector of $r$ inner products. Read $Q$ then as "rebuild a vector in $\mathbb{R}^n$ from those coordinates". The composition is "first compress, then re-expand using only $U$" — everything not in $U$ is discarded by the first step and never recovered.

Least squares in one line

Solving $A\mathbf{x} \approx \mathbf{b}$ when no exact solution exists: project $\mathbf{b}$ onto $\mathrm{Range}(A)$, then back out $\mathbf{x}$. Result: $\mathbf{x}^\star = (A^\top A)^{-1} A^\top \mathbf{b}$. The classic least-squares formula is a projection followed by an inverse change of coordinates.

04

Projection Matrices & Their Two Properties

An orthogonal projection matrix $P$ is exactly any matrix that is

Geometrically: projecting twice is the same as projecting once (idempotent), and the projection map is its own adjoint (symmetric).

Eigenstructure of a projection

From $P^2 = P$, the eigenvalues of $P$ satisfy $\lambda^2 = \lambda$, so each $\lambda$ is 0 or 1. Eigenvectors with eigenvalue 1 span $U$ (the projection's image); eigenvectors with eigenvalue 0 span $U^\perp$ (the kernel).

Soft projections in ML

Many ML maps are not exact projections but are close enough to share the geometry: a low-rank weight matrix sends inputs roughly into its column span and discards the rest. LayerNorm projects onto the unit sphere within the affine hyperplane $\sum x_i = 0$ (deck 11 makes this precise). Attention's score matrix isn't a projection — but its row-stochastic structure plays an analogous "average within a subset" role (deck 10).

05

Down-Projection — Information that Doesn't Fit

A down-projection is a linear map $W : \mathbb{R}^{d_h} \to \mathbb{R}^{d}$ where $d < d_h$. The output dimension is smaller than the input. It is, by rank-nullity, guaranteed to throw away at least $d_h - d$ dimensions of input information.

Down-projections show up everywhere in ML:

The MLP's $W_2$ (down)

After the up-projection, the FFN's output projection $W_2 \in \mathbb{R}^{d_{model} \times d_{ff}}$ collapses the $4d_{model}$ hidden activations back to $d_{model}$ to write into the residual stream.

Attention's $W_O$ (down)

Multi-head attention concatenates $H$ heads, giving width $H d_h = d_{model}$, then $W_O \in \mathbb{R}^{d_{model} \times d_{model}}$ remixes back to $d_{model}$. (Same width here, but it's a "remix" projection — deck 11.)

QKV projections (down per head)

Per head: $W_Q, W_K, W_V \in \mathbb{R}^{d_h \times d_{model}}$ with $d_h = d_{model}/H$. Each head sees a $d_h$-dim slice of the residual stream — a down-projection.

LoRA's $A$ (down)

$\Delta W = BA$ with $A : \mathbb{R}^{d_{in}} \to \mathbb{R}^r$ and $B : \mathbb{R}^r \to \mathbb{R}^{d_{out}}$, $r \ll d_{in}, d_{out}$. The fine-tuning update lives on a low-dim subspace.

What survives a down-projection?

By the four-fundamental-subspaces theorem (deck 02): exactly the row space of $W$ — an $r$-dimensional subspace of input space, where $r = \mathrm{rank}\,W \le d$. Two inputs that differ only in the null space produce identical outputs.

The implication for transformers: the $W_2$ down-projection is a commitment about which $d_{model}$ directions of hidden activation get a vote in the residual stream. The trained weights pick these directions to maximise predictive value.

06

Up-Projection — Lifting into a Bigger Space

An up-projection is a linear map $W : \mathbb{R}^{d} \to \mathbb{R}^{d_h}$ where $d_h > d$. The output dimension is larger.

An up-projection cannot increase information — the rank is bounded by $\min(d_h, d) = d$, so $d_h - d$ dimensions of the output are linearly determined by the others. So why bother?

The answer: pair it with a non-linearity

If you stack two linear maps you get one linear map — the up and down collapse to a single $d \times d$ map (deck 02). What makes the FFN useful is the activation in the middle:

$$\mathrm{FFN}(\mathbf{x}) = W_{\text{down}}\, \sigma(W_{\text{up}} \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2.$$

The non-linear $\sigma$ acts component-wise in the lifted space. Different lift directions get different non-linear treatments — a hidden coordinate that ends up positive passes through; a negative one is suppressed (or smoothly clipped, in GELU/SwiGLU). This is what the lift was for: more independent non-linear "channels" to switch on or off.

Wide hidden layer = more switches

A $d_h$-dimensional hidden layer with ReLU-style activations can implement up to $2^{d_h}$ different "on/off patterns" of which neurons fire. With $d_h = 4d$, that's $2^{4d}$ possible activation patterns — vastly more than $2^d$. The up-projection's job is to give the non-linearity a wide enough chord-board to play on.

07

The Transformer FFN as Up-then-Down

The "feed-forward" sub-layer of a transformer block, in matrix form:

$$\mathrm{FFN}(\mathbf{x}) = W_{\text{down}}\,\sigma(W_{\text{up}} \mathbf{x}).$$

Shapes for a typical $d_{model} = 4096$ block (Llama-style):

x: (d=4096) W_up: (4d × d) = (16384 × 4096) h: (4d=16384) σ(h) W_down: (d × 4d) = (4096 × 16384) y: (d=4096)

Param count and compute

Function-class reading

$\mathrm{FFN}(\mathbf{x}) = \sum_{i=1}^{d_h} \sigma(\mathbf{w}_i^\top \mathbf{x})\, \mathbf{c}_i$ where $\mathbf{w}_i$ is row $i$ of $W_{\text{up}}$ and $\mathbf{c}_i$ is column $i$ of $W_{\text{down}}$. Read this carefully: each hidden neuron $i$ pairs a "feature detector" $\mathbf{w}_i$ (row of up-projection) with a "feature contribution" $\mathbf{c}_i$ (column of down-projection). When the detector fires (its score is positive enough that $\sigma$ doesn't squash it), the contribution direction $\mathbf{c}_i$ is added to the output, scaled by the activation magnitude.

Mechanistic-interpretability claim

Geva et al. (EMNLP 2021) and Anthropic's circuits work argue that the rows of $W_{\text{up}}$ act as "key" vectors detecting input patterns, and the columns of $W_{\text{down}}$ act as "value" vectors writing answers into the residual stream — an attention-shaped structure inside the FFN. This is exactly what the equation above says. The up-down view of the FFN is the cleanest way to see it.

08

Why 4× — Capacity, Numerics, Hardware

The original transformer (Vaswani et al. 2017) chose $d_{ff} = 4 d_{model}$. The ratio has stuck across most subsequent architectures — PaLM, GPT-3/4, Llama-2 dense, Mistral, Falcon. Where does it come from?

Capacity

The FFN is the per-block "memory" / pattern-bank. A wider hidden makes more independent feature directions available. Empirical scaling work (Kaplan 2020, Hoffmann 2022) shows roughly flat performance for ratios $\in [2, 8]$ — 4 is a sweet spot.

Parameter / FLOP balance

At ratio 4, FFN params $= 8d^2$ and attention params $= 4d^2$ (without bias / norm). The FFN is twice the size of attention — roughly the empirical ratio that gives best perplexity/FLOP. Smaller ratios hand FLOPs back to attention; bigger ratios starve attention.

Hardware shape

$4d \times d$ and $d \times 4d$ matmuls have arithmetic intensity proportional to $\min(4d, d)$. For typical $d \in [1024, 16384]$ they sit comfortably above the tensor-core ridge point. Wider would help only if memory bandwidth weren't the bottleneck.

The architectural pendulum

Some architectures push back: SwiGLU adds a third matrix (gated unit), so to keep total FFN params constant Llama-2 uses ratio $\frac{8}{3} \cdot d_{model}$ — about 2.67. The 4:1 figure is a convention with sound but soft justification, not a constant of nature.

MoE breaks the rule

Mixture-of-experts FFNs (Mixtral, DeepSeek-V3) replace the single 4×-wide FFN with $E$ experts of which $k$ activate per token. Each expert can be much smaller (e.g. ratio 2) but the aggregate parameter count is huge. The 4× rule lives only at the dense-FFN limit. Deck 12 covers MoE briefly.

09

SwiGLU and the Modern FFN

Llama, Mistral and most modern dense LLMs replace the GELU FFN with SwiGLU (Shazeer 2020), which adds a multiplicative gating mechanism:

$$\mathrm{SwiGLU}(\mathbf{x}) = W_{\text{down}} \big[ \mathrm{Swish}(W_{\text{gate}} \mathbf{x}) \odot (W_{\text{up}} \mathbf{x}) \big]$$

where $\odot$ is the element-wise product and $\mathrm{Swish}(z) = z \cdot \sigma(z)$ (i.e. SiLU). Three matrices instead of two. To keep parameter count constant compared to GELU FFN at ratio 4, the inner width is cut to $\frac{8}{3} d$ instead of $4d$.

What the gate does, in linear-algebra terms

$W_{\text{up}} \mathbf{x}$ is the "value" path — the candidate contribution per hidden coordinate.
$W_{\text{gate}} \mathbf{x}$ runs through Swish and gates each coordinate of the value: how much of this hidden coordinate makes it through.

Without gating, every hidden coordinate gets the same activation function applied. With gating, each coordinate's gain is data-dependent: it can learn to attenuate when the input doesn't look like its target pattern. The gate gives the FFN more expressive power per parameter.

"GLU Variants Improve Transformer", Shazeer 2020

Noam Shazeer's two-page paper compared 9 GLU variants empirically; SwiGLU and GeGLU consistently topped the leaderboard. Swap in SwiGLU at the same parameter budget and perplexity falls a few percent. The choice is an empirical "free lunch" that everyone has now adopted — the only cost is a third matrix multiplication per FFN.

10

Interactive: 2D Projection & SwiGLU Toy

Two widgets in one panel.

(A) Projection visualiser: drag $\mathbf{v}$ and the line direction $\mathbf{a}$. The pink arrow is $\mathrm{proj}_{\mathbf{a}}(\mathbf{v})$; the dashed grey is the perpendicular residual.

v
(1.6, 1.6)
a (line dir)
(2, 0)
proj_a(v)
(1.6, 0)
‖residual‖
1.60

(B) SwiGLU toy: a width-2 input lifted to width-$d_h$, gated, projected back to width-1. Adjust the hidden width and the gate cutoff to see the effective function the FFN computes — a piecewise-linear approximation to whatever target you draw.

8

Green dashed: target. Violet: the FFN output as a function of $x \in [-2, 2]$, using random weights. Hit "Random weights" a few times to see how the same architecture spans a class of piecewise-linear functions; the wider the hidden, the richer the class. SwiGLU's gate gives more flexibility per neuron than ReLU/GELU.

11

Cheat Sheet

Read next

Deck 06 — Eigenvalues & Eigenvectors introduces the spectral theory we'll need for SVD (deck 07), where we'll see that the "best" basis for any matrix — the one in which it looks like a clean projection-with-scaling — comes from its singular vectors.