The Singular Value Decomposition is the most useful theorem in applied linear algebra. It gives the best low-rank approximation of any matrix — the foundation of LoRA, MLA, low-rank KV-cache compression, and almost every "just project to a smaller space" trick in modern ML.
The SVD theorem. Every matrix $A \in \mathbb{R}^{m\times n}$ — any shape, any rank — can be factorised as
$$A = U\Sigma V^\top,$$
where $U \in \mathbb{R}^{m\times m}$ is orthogonal ($U^\top U = I$), $V \in \mathbb{R}^{n\times n}$ is orthogonal ($V^\top V = I$), and $\Sigma \in \mathbb{R}^{m\times n}$ is "diagonal" — non-negative entries $\sigma_1 \ge \sigma_2 \ge \cdots \ge 0$ on the main diagonal, zeros elsewhere.
The columns of $U$ are left singular vectors, the columns of $V$ are right singular vectors, and the $\sigma_i$ are singular values. The number of non-zero singular values equals $\mathrm{rank}\,A$.
If $\mathrm{rank}\,A = r$, store only the first $r$ columns of $U$ and $V$ and the $r\times r$ leading block of $\Sigma$:
$$A = U_r \Sigma_r V_r^\top = \sum_{i=1}^r \sigma_i\, \mathbf{u}_i \mathbf{v}_i^\top.$$
The sum on the right is the rank-1-decomposition view: $A$ is a weighted sum of $r$ rank-1 outer products, each weighted by its singular value. Often more useful than the full thing.
Eigendecomposition exists only for some matrices (must be square, must be diagonalisable). SVD exists for all matrices — rectangular, rank-deficient, even all-zero. This generality is why SVD is the workhorse: when you don't know what shape your matrix is, write SVD.
Multiplication by $A$ does three things in sequence:
Rewrite the input vector in the orthonormal "right singular" basis. No stretching; just a rotation/reflection of axes.
Stretch each axis by its singular value. The $i$-th coordinate is multiplied by $\sigma_i$. Coordinates beyond rank are discarded (multiplied by 0).
Rewrite the result in the standard basis of the output space. Again pure rotation/reflection.
Every linear map is, in the right pair of orthonormal bases, just a per-axis scaling. The two rotations are precisely the change-of-basis from "ordinary" to "principal" axes — on the input side and on the output side independently.
$A$ takes the unit sphere $\{\mathbf{x} : \|\mathbf{x}\| = 1\}$ to an ellipsoid (or a degenerate flatter object if rank-deficient). The semi-axes of the ellipsoid are the singular values, and their directions are $U$. The largest singular value $\sigma_1$ is the maximum stretch any input direction can experience — the spectral norm $\|A\|_2 = \sigma_1$. The smallest non-zero singular value is the minimum stretch on the row space.
For any $A$, the matrix $A^\top A$ is symmetric and PSD. Apply the spectral theorem (deck 06):
$$A^\top A = V \Sigma^\top \Sigma V^\top.$$
So the right singular vectors $V$ are the eigenvectors of $A^\top A$, and the singular values are the square roots of its eigenvalues.
Similarly $A A^\top = U \Sigma \Sigma^\top U^\top$ — left singular vectors are eigenvectors of $A A^\top$.
sklearn.decomposition.PCA actually computes (via SVD, not eigendecomp of covariance).The most-cited theorem in applied linear algebra:
Among all matrices of rank at most $k$, the closest one to $A$ in either Frobenius or spectral norm is the truncated SVD: keep only the $k$ largest singular values.
$$A_k = \sum_{i=1}^k \sigma_i\, \mathbf{u}_i \mathbf{v}_i^\top, \qquad \|A - A_k\|_F^2 = \sum_{i=k+1}^r \sigma_i^2, \quad \|A - A_k\|_2 = \sigma_{k+1}.$$
This is the most general statement of "the best $k$-dimensional summary": truncate. The error is exactly the trailing singular values — nothing else.
A matrix's nominal rank is the count of non-zero singular values. Its effective rank is, informally, "how many singular values are big enough to matter".
Formal versions:
The "lottery ticket hypothesis" (Frankle & Carbin 2019), the success of LoRA, and the success of weight-sharing tricks like cross-attention KV reuse are all variations on a single empirical fact: trained transformers have far less degrees-of-freedom than their parameter count suggests. Linear-algebraically, that's a low effective rank.
LoRA (Hu et al., ICLR 2022) is the dominant parameter-efficient fine-tuning technique. Idea: freeze the pre-trained $W \in \mathbb{R}^{d_{out} \times d_{in}}$, add a low-rank trainable update.
$$W_{\text{ft}} = W + \Delta W, \qquad \Delta W = B A,\quad A \in \mathbb{R}^{r \times d_{in}},\ B \in \mathbb{R}^{d_{out} \times r}.$$
$\Delta W$ has rank $\le r$. Initialise $B = 0$, $A$ random Gaussian, so $\Delta W = 0$ at start. Train only $A$ and $B$.
Empirically, the rank of $\Delta W$ in full fine-tuning is very low — often $\le 32$. So the rank constraint loses very little. Aghajanyan et al. (2020) made this precise with their "intrinsic dimension" experiments before LoRA itself was published; LoRA generalised the observation into a clean parameterisation.
LoRA in practice is parameterised as $\Delta W = (\alpha/r)\, B A$, with a hyperparameter $\alpha$ — typically equal to $r$ (so the scale is 1) or $2r$. The $\alpha/r$ factor decouples the choice of rank from the effective learning rate; doubling $r$ doesn't change the effective scale of updates. A small but mattering detail.
The LoRA family has spawned a small zoo of refinements, each addressing a specific limitation.
| Method | Idea | What it adds |
|---|---|---|
| QLoRA (Dettmers 2023) | 4-bit-quantise the frozen $W$; LoRA in fp16 on top. | Trains 65B-param models on a single 48 GB GPU. Now standard for hobbyist fine-tuning. |
| DoRA (Liu 2024) | Decompose $W = m \cdot \frac{V}{\|V\|}$ into magnitude $m$ and direction $\frac{V}{\|V\|}$. LoRA on the direction; full-rank update on $m$. | Closes most of the gap to full FT at the same param budget. The mag/dir split mirrors weight normalisation. |
| VeRA (Kopiczko 2024) | Share random projections $A_{\text{shared}}, B_{\text{shared}}$ across all layers; learn only per-layer scaling vectors $\mathbf{b}, \mathbf{d}$. | Even fewer trainable params (~10× fewer than LoRA) at minor accuracy cost. |
| LoRA+ (Hayou 2024) | Different learning rates for $A$ and $B$ to balance their gradient scales. | Better convergence at the same FLOPs. |
| PiSSA (Meng 2024) | Initialise $A, B$ from the top-$r$ SVD components of $W$. Learn the residual. | Better starting point; faster convergence than zero-init LoRA. |
All of them sit on the same linear-algebraic substrate: a rank-$r$ correction on top of a frozen full-rank pre-trained map. The differences are about how to budget the rank and where to put it.
Multi-head Latent Attention (DeepSeek-V2/V3, 2024) attacks a different problem: the KV cache that grows with sequence length. By pushing low-rank structure into activations rather than weights, MLA cuts the KV-cache footprint by an order of magnitude with negligible quality loss.
For a sequence of $L$ tokens, multi-head attention caches the keys and values at every layer:
$$\text{cache size per layer} = 2 \cdot L \cdot H \cdot d_h = 2L\, d_{model} \text{ floats}.$$
For Llama-2 70B at $d_{model} = 8192$, $L = 32k$, that's 0.5 GB per layer, 40 GB total — in fp16 — just for the cache.
Instead of caching full $\mathbf{k}_i, \mathbf{v}_i$ vectors per token, cache a compressed latent
$$\mathbf{c}_i = W^{KV}_{\text{down}} \mathbf{h}_i \in \mathbb{R}^{r}$$
with $r \ll d_{model}$ (DeepSeek-V2 uses $r = 512$ vs $d_{model} = 5120$). At inference, recompute keys and values on demand:
$$\mathbf{k}_i = W^K_{\text{up}} \mathbf{c}_i, \quad \mathbf{v}_i = W^V_{\text{up}} \mathbf{c}_i.$$
Cache size is now $L \cdot r$ instead of $2L\, d_{model}$ — a factor of $\sim 20\times$ reduction. The KV up-projections happen at attention time, not at cache-write time, but they're cheap because they batch over the same latent vector that lives in the cache. RoPE is applied to the keys after the up-projection to maintain positional encoding.
MLA assumes the keys (and values) for a given sequence position can be parameterised by a low-dimensional latent $\mathbf{c}$. This is exactly the rank-$r$ approximation argument applied to the joint $(\mathbf{k}, \mathbf{v})$ stack across heads. The weight matrices stay full rank; only the per-token activation is constrained. Same Eckart-Young idea, applied to a different object.
Pick a procedurally-generated image, then truncate its SVD to rank $k$. The image is recomposed as $A_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^\top$. As $k$ grows you see the picture sharpen: the first few components capture the broad structure, later ones add detail and texture, and the smallest add noise.
The middle bar chart shows the singular values $\sigma_1 \ge \sigma_2 \ge \cdots$. Violet bars are kept; grey bars are truncated. Notice that for the "Random low-rank" image the spectrum drops sharply to noise after a few components — rank-4 is essentially perfect. For the gradient/text images the decay is more gradual; you'd need more components to recover sharp edges.
For $A = U\Sigma V^\top$, the Moore-Penrose pseudoinverse is
$$A^+ = V \Sigma^+ U^\top,$$
where $\Sigma^+$ is obtained by transposing $\Sigma$ and replacing each non-zero $\sigma_i$ with $1/\sigma_i$. Zero singular values stay zero.
The pseudoinverse is also the standard way to solve ridge regression in closed form; numerical stability of small singular values is exactly the issue ridge addresses (replacing $1/\sigma_i$ with $\sigma_i / (\sigma_i^2 + \lambda)$).
Use eigendecomp for square symmetric matrices (covariance, Hessian). Use SVD for everything else: rectangular weight matrices, Jacobians, data matrices, attention scores. SVD is strictly more general; for a symmetric PSD matrix the two coincide. Default to SVD unless you have a specific reason not to.
Deck 08 — Orthogonality, QR & RoPE picks up where SVD's "rotations" left off: how to actually compute orthonormal bases, and why a particular family of 2×2 rotations — RoPE — has become standard for transformer position encoding.