Linear Algebra for AI — Presentation 07

SVD, Low-Rank & LoRA

The Singular Value Decomposition is the most useful theorem in applied linear algebra. It gives the best low-rank approximation of any matrix — the foundation of LoRA, MLA, low-rank KV-cache compression, and almost every "just project to a smaller space" trick in modern ML.

SVDtruncated SVD Eckart-YoungLoRA DoRAMLA
A = UΣV truncate to top k best rank-k approx LoRA / DoRA / MLA
00

Topics We'll Cover

01

The Theorem & the Three Matrices

The SVD theorem. Every matrix $A \in \mathbb{R}^{m\times n}$ — any shape, any rank — can be factorised as

$$A = U\Sigma V^\top,$$

where $U \in \mathbb{R}^{m\times m}$ is orthogonal ($U^\top U = I$), $V \in \mathbb{R}^{n\times n}$ is orthogonal ($V^\top V = I$), and $\Sigma \in \mathbb{R}^{m\times n}$ is "diagonal" — non-negative entries $\sigma_1 \ge \sigma_2 \ge \cdots \ge 0$ on the main diagonal, zeros elsewhere.

The columns of $U$ are left singular vectors, the columns of $V$ are right singular vectors, and the $\sigma_i$ are singular values. The number of non-zero singular values equals $\mathrm{rank}\,A$.

Compact (thin) SVD

If $\mathrm{rank}\,A = r$, store only the first $r$ columns of $U$ and $V$ and the $r\times r$ leading block of $\Sigma$:

$$A = U_r \Sigma_r V_r^\top = \sum_{i=1}^r \sigma_i\, \mathbf{u}_i \mathbf{v}_i^\top.$$

The sum on the right is the rank-1-decomposition view: $A$ is a weighted sum of $r$ rank-1 outer products, each weighted by its singular value. Often more useful than the full thing.

Why "every matrix"?

Eigendecomposition exists only for some matrices (must be square, must be diagonalisable). SVD exists for all matrices — rectangular, rank-deficient, even all-zero. This generality is why SVD is the workhorse: when you don't know what shape your matrix is, write SVD.

02

Geometric Reading: Rotate, Scale, Rotate

Multiplication by $A$ does three things in sequence:

1. $V^\top$ — rotate

Rewrite the input vector in the orthonormal "right singular" basis. No stretching; just a rotation/reflection of axes.

2. $\Sigma$ — scale

Stretch each axis by its singular value. The $i$-th coordinate is multiplied by $\sigma_i$. Coordinates beyond rank are discarded (multiplied by 0).

3. $U$ — rotate

Rewrite the result in the standard basis of the output space. Again pure rotation/reflection.

Every linear map is, in the right pair of orthonormal bases, just a per-axis scaling. The two rotations are precisely the change-of-basis from "ordinary" to "principal" axes — on the input side and on the output side independently.

What this means for the unit ball

$A$ takes the unit sphere $\{\mathbf{x} : \|\mathbf{x}\| = 1\}$ to an ellipsoid (or a degenerate flatter object if rank-deficient). The semi-axes of the ellipsoid are the singular values, and their directions are $U$. The largest singular value $\sigma_1$ is the maximum stretch any input direction can experience — the spectral norm $\|A\|_2 = \sigma_1$. The smallest non-zero singular value is the minimum stretch on the row space.

03

Relation to Eigendecomposition

For any $A$, the matrix $A^\top A$ is symmetric and PSD. Apply the spectral theorem (deck 06):

$$A^\top A = V \Sigma^\top \Sigma V^\top.$$

So the right singular vectors $V$ are the eigenvectors of $A^\top A$, and the singular values are the square roots of its eigenvalues.

Similarly $A A^\top = U \Sigma \Sigma^\top U^\top$ — left singular vectors are eigenvectors of $A A^\top$.

What this gives in practice

04

Truncated SVD & the Eckart-Young Theorem

The most-cited theorem in applied linear algebra:

Eckart-Young-Mirsky (1936/1960)

Among all matrices of rank at most $k$, the closest one to $A$ in either Frobenius or spectral norm is the truncated SVD: keep only the $k$ largest singular values.

$$A_k = \sum_{i=1}^k \sigma_i\, \mathbf{u}_i \mathbf{v}_i^\top, \qquad \|A - A_k\|_F^2 = \sum_{i=k+1}^r \sigma_i^2, \quad \|A - A_k\|_2 = \sigma_{k+1}.$$

This is the most general statement of "the best $k$-dimensional summary": truncate. The error is exactly the trailing singular values — nothing else.

Why this is the foundation of modern ML compression

05

Effective Rank & the Singular-Value Spectrum

A matrix's nominal rank is the count of non-zero singular values. Its effective rank is, informally, "how many singular values are big enough to matter".

Formal versions:

What real ML matrices look like

The lottery-ticket / low-rank coincidence

The "lottery ticket hypothesis" (Frankle & Carbin 2019), the success of LoRA, and the success of weight-sharing tricks like cross-attention KV reuse are all variations on a single empirical fact: trained transformers have far less degrees-of-freedom than their parameter count suggests. Linear-algebraically, that's a low effective rank.

06

LoRA — Fine-Tuning as a Rank Constraint

LoRA (Hu et al., ICLR 2022) is the dominant parameter-efficient fine-tuning technique. Idea: freeze the pre-trained $W \in \mathbb{R}^{d_{out} \times d_{in}}$, add a low-rank trainable update.

$$W_{\text{ft}} = W + \Delta W, \qquad \Delta W = B A,\quad A \in \mathbb{R}^{r \times d_{in}},\ B \in \mathbb{R}^{d_{out} \times r}.$$

$\Delta W$ has rank $\le r$. Initialise $B = 0$, $A$ random Gaussian, so $\Delta W = 0$ at start. Train only $A$ and $B$.

The numbers

Why does it work?

Empirically, the rank of $\Delta W$ in full fine-tuning is very low — often $\le 32$. So the rank constraint loses very little. Aghajanyan et al. (2020) made this precise with their "intrinsic dimension" experiments before LoRA itself was published; LoRA generalised the observation into a clean parameterisation.

The scale factor $\alpha/r$

LoRA in practice is parameterised as $\Delta W = (\alpha/r)\, B A$, with a hyperparameter $\alpha$ — typically equal to $r$ (so the scale is 1) or $2r$. The $\alpha/r$ factor decouples the choice of rank from the effective learning rate; doubling $r$ doesn't change the effective scale of updates. A small but mattering detail.

07

DoRA, VeRA & LoRA's Cousins

The LoRA family has spawned a small zoo of refinements, each addressing a specific limitation.

MethodIdeaWhat it adds
QLoRA (Dettmers 2023)4-bit-quantise the frozen $W$; LoRA in fp16 on top.Trains 65B-param models on a single 48 GB GPU. Now standard for hobbyist fine-tuning.
DoRA (Liu 2024)Decompose $W = m \cdot \frac{V}{\|V\|}$ into magnitude $m$ and direction $\frac{V}{\|V\|}$. LoRA on the direction; full-rank update on $m$.Closes most of the gap to full FT at the same param budget. The mag/dir split mirrors weight normalisation.
VeRA (Kopiczko 2024)Share random projections $A_{\text{shared}}, B_{\text{shared}}$ across all layers; learn only per-layer scaling vectors $\mathbf{b}, \mathbf{d}$.Even fewer trainable params (~10× fewer than LoRA) at minor accuracy cost.
LoRA+ (Hayou 2024)Different learning rates for $A$ and $B$ to balance their gradient scales.Better convergence at the same FLOPs.
PiSSA (Meng 2024)Initialise $A, B$ from the top-$r$ SVD components of $W$. Learn the residual.Better starting point; faster convergence than zero-init LoRA.

All of them sit on the same linear-algebraic substrate: a rank-$r$ correction on top of a frozen full-rank pre-trained map. The differences are about how to budget the rank and where to put it.

08

DeepSeek MLA — Low-Rank KV Cache

Multi-head Latent Attention (DeepSeek-V2/V3, 2024) attacks a different problem: the KV cache that grows with sequence length. By pushing low-rank structure into activations rather than weights, MLA cuts the KV-cache footprint by an order of magnitude with negligible quality loss.

The standard KV cache

For a sequence of $L$ tokens, multi-head attention caches the keys and values at every layer:

$$\text{cache size per layer} = 2 \cdot L \cdot H \cdot d_h = 2L\, d_{model} \text{ floats}.$$

For Llama-2 70B at $d_{model} = 8192$, $L = 32k$, that's 0.5 GB per layer, 40 GB total — in fp16 — just for the cache.

MLA's substitution

Instead of caching full $\mathbf{k}_i, \mathbf{v}_i$ vectors per token, cache a compressed latent

$$\mathbf{c}_i = W^{KV}_{\text{down}} \mathbf{h}_i \in \mathbb{R}^{r}$$

with $r \ll d_{model}$ (DeepSeek-V2 uses $r = 512$ vs $d_{model} = 5120$). At inference, recompute keys and values on demand:

$$\mathbf{k}_i = W^K_{\text{up}} \mathbf{c}_i, \quad \mathbf{v}_i = W^V_{\text{up}} \mathbf{c}_i.$$

Cache size is now $L \cdot r$ instead of $2L\, d_{model}$ — a factor of $\sim 20\times$ reduction. The KV up-projections happen at attention time, not at cache-write time, but they're cheap because they batch over the same latent vector that lives in the cache. RoPE is applied to the keys after the up-projection to maintain positional encoding.

Why this is fundamentally a rank trick

MLA assumes the keys (and values) for a given sequence position can be parameterised by a low-dimensional latent $\mathbf{c}$. This is exactly the rank-$r$ approximation argument applied to the joint $(\mathbf{k}, \mathbf{v})$ stack across heads. The weight matrices stay full rank; only the per-token activation is constrained. Same Eckart-Young idea, applied to a different object.

09

Interactive: Image Rank Slider

Pick a procedurally-generated image, then truncate its SVD to rank $k$. The image is recomposed as $A_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^\top$. As $k$ grows you see the picture sharpen: the first few components capture the broad structure, later ones add detail and texture, and the smallest add noise.

8
image size
32×32
k
8
params (full)
1024
params (rank-k)
520
Frob error
0.20

The middle bar chart shows the singular values $\sigma_1 \ge \sigma_2 \ge \cdots$. Violet bars are kept; grey bars are truncated. Notice that for the "Random low-rank" image the spectrum drops sharply to noise after a few components — rank-4 is essentially perfect. For the gradient/text images the decay is more gradual; you'd need more components to recover sharp edges.

10

The Pseudoinverse via SVD

For $A = U\Sigma V^\top$, the Moore-Penrose pseudoinverse is

$$A^+ = V \Sigma^+ U^\top,$$

where $\Sigma^+$ is obtained by transposing $\Sigma$ and replacing each non-zero $\sigma_i$ with $1/\sigma_i$. Zero singular values stay zero.

What it solves

The pseudoinverse is also the standard way to solve ridge regression in closed form; numerical stability of small singular values is exactly the issue ridge addresses (replacing $1/\sigma_i$ with $\sigma_i / (\sigma_i^2 + \lambda)$).

When to reach for SVD vs. eigendecomp

Use eigendecomp for square symmetric matrices (covariance, Hessian). Use SVD for everything else: rectangular weight matrices, Jacobians, data matrices, attention scores. SVD is strictly more general; for a symmetric PSD matrix the two coincide. Default to SVD unless you have a specific reason not to.

11

Cheat Sheet

Read next

Deck 08 — Orthogonality, QR & RoPE picks up where SVD's "rotations" left off: how to actually compute orthonormal bases, and why a particular family of 2×2 rotations — RoPE — has become standard for transformer position encoding.