Linear Algebra for AI 07 — SVD, Low-Rank & LoRA

00

Topics We'll Cover

The Theorem & the Three Matrices
Geometric Reading: Rotate, Scale, Rotate
Relation to Eigendecomposition
Truncated SVD & the Eckart-Young Theorem
Effective Rank & the Singular-Value Spectrum
LoRA — Fine-Tuning as a Rank Constraint
DoRA, VeRA & LoRA's Cousins
DeepSeek MLA — Low-Rank KV Cache
Interactive: Image Rank Slider
The Pseudoinverse via SVD
Cheat Sheet

01

The Theorem & the Three Matrices

The SVD theorem. Every matrix $A \in \mathbb{R}^{m\times n}$ — any shape, any rank — can be factorised as

$$A = U\Sigma V^\top,$$

where $U \in \mathbb{R}^{m\times m}$ is orthogonal ($U^\top U = I$), $V \in \mathbb{R}^{n\times n}$ is orthogonal ($V^\top V = I$), and $\Sigma \in \mathbb{R}^{m\times n}$ is "diagonal" — non-negative entries $\sigma_1 \ge \sigma_2 \ge \cdots \ge 0$ on the main diagonal, zeros elsewhere.

The columns of $U$ are left singular vectors, the columns of $V$ are right singular vectors, and the $\sigma_i$ are singular values. The number of non-zero singular values equals $\mathrm{rank}\,A$.

Compact (thin) SVD

If $\mathrm{rank}\,A = r$, store only the first $r$ columns of $U$ and $V$ and the $r\times r$ leading block of $\Sigma$:

$$A = U_r \Sigma_r V_r^\top = \sum_{i=1}^r \sigma_i\, \mathbf{u}_i \mathbf{v}_i^\top.$$

The sum on the right is the rank-1-decomposition view: $A$ is a weighted sum of $r$ rank-1 outer products, each weighted by its singular value. Often more useful than the full thing.

Why "every matrix"?

Eigendecomposition exists only for some matrices (must be square, must be diagonalisable). SVD exists for all matrices — rectangular, rank-deficient, even all-zero. This generality is why SVD is the workhorse: when you don't know what shape your matrix is, write SVD.

02

Geometric Reading: Rotate, Scale, Rotate

Multiplication by $A$ does three things in sequence:

1. $V^\top$ — rotate

Rewrite the input vector in the orthonormal "right singular" basis. No stretching; just a rotation/reflection of axes.

2. $\Sigma$ — scale

Stretch each axis by its singular value. The $i$-th coordinate is multiplied by $\sigma_i$. Coordinates beyond rank are discarded (multiplied by 0).

3. $U$ — rotate

Rewrite the result in the standard basis of the output space. Again pure rotation/reflection.

Every linear map is, in the right pair of orthonormal bases, just a per-axis scaling. The two rotations are precisely the change-of-basis from "ordinary" to "principal" axes — on the input side and on the output side independently.

What this means for the unit ball

$A$ takes the unit sphere $\{\mathbf{x} : \|\mathbf{x}\| = 1\}$ to an ellipsoid (or a degenerate flatter object if rank-deficient). The semi-axes of the ellipsoid are the singular values, and their directions are $U$. The largest singular value $\sigma_1$ is the maximum stretch any input direction can experience — the spectral norm $\|A\|_2 = \sigma_1$. The smallest non-zero singular value is the minimum stretch on the row space.

03

Relation to Eigendecomposition

For any $A$, the matrix $A^\top A$ is symmetric and PSD. Apply the spectral theorem (deck 06):

$$A^\top A = V \Sigma^\top \Sigma V^\top.$$

So the right singular vectors $V$ are the eigenvectors of $A^\top A$, and the singular values are the square roots of its eigenvalues.

Similarly $A A^\top = U \Sigma \Sigma^\top U^\top$ — left singular vectors are eigenvectors of $A A^\top$.

What this gives in practice

Computing SVD. Form $A^\top A$ (or $A A^\top$), find its eigendecomp, take square roots. Slow and numerically not the way real libraries do it — LAPACK uses bidiagonalisation + QR — but the connection is exact.
SVD of a covariance gives PCA. If $X$ is centred, $\mathrm{Cov}(X) \propto X^\top X$. So PCA = right singular vectors of the centred data matrix. This is what sklearn.decomposition.PCA actually computes (via SVD, not eigendecomp of covariance).
Symmetric SVD = signed eigendecomp. If $A$ is symmetric with eigenvalues $\lambda_i$, then singular values are $|\lambda_i|$ and the SVD differs from eigendecomp by sign flips on the eigenvectors of negative eigenvalues.

04

Truncated SVD & the Eckart-Young Theorem

The most-cited theorem in applied linear algebra:

Eckart-Young-Mirsky (1936/1960)

Among all matrices of rank at most $k$, the closest one to $A$ in either Frobenius or spectral norm is the truncated SVD: keep only the $k$ largest singular values.

$$A_k = \sum_{i=1}^k \sigma_i\, \mathbf{u}_i \mathbf{v}_i^\top, \qquad \|A - A_k\|_F^2 = \sum_{i=k+1}^r \sigma_i^2, \quad \|A - A_k\|_2 = \sigma_{k+1}.$$

This is the most general statement of "the best $k$-dimensional summary": truncate. The error is exactly the trailing singular values — nothing else.

Why this is the foundation of modern ML compression

PCA is truncated SVD on centred data. The "best $k$-dim subspace" interpretation is Eckart-Young.
LoRA approximates the fine-tune update $\Delta W$ by a rank-$r$ matrix. Eckart-Young says if $\Delta W$ has fast-decaying singular values, this throws away little.
Image compression / classical autoencoders are SVD truncation in disguise.
JPEG's DCT isn't SVD but exploits the same low-effective-rank structure of natural images.
Random sketches / Johnson-Lindenstrauss give a coarser version of low-rank approximation that doesn't require eigenstructure.

05

Effective Rank & the Singular-Value Spectrum

A matrix's nominal rank is the count of non-zero singular values. Its effective rank is, informally, "how many singular values are big enough to matter".

Formal versions:

Stable rank: $\|A\|_F^2 / \|A\|_2^2 = \sum_i \sigma_i^2 / \sigma_1^2$. Robust to noise.
$\varepsilon$-rank: the smallest $k$ such that $\sum_{i > k} \sigma_i^2 \le \varepsilon^2 \|A\|_F^2$. The number of components needed to explain $1 - \varepsilon^2$ of the energy.

What real ML matrices look like

Pre-trained transformer weights are mostly full rank but have fast-decaying singular value spectra. The first 100 singular values typically capture much more than 100/$d$ of the Frobenius norm. (Hu et al. 2021, the LoRA paper, shows this empirically for GPT-3.)
Fine-tuning updates $\Delta W = W_{\text{ft}} - W_{\text{pre}}$ are very low-effective-rank — usually an effective rank of 8-32, even for $d \in [4096, 16384]$. This is the empirical observation that justifies LoRA.
Activation matrices at intermediate layers are also low-effective-rank, especially after softmax (Bhojanapalli et al. 2020, "Low-rank bottleneck in multi-head attention").

The lottery-ticket / low-rank coincidence

The "lottery ticket hypothesis" (Frankle & Carbin 2019), the success of LoRA, and the success of weight-sharing tricks like cross-attention KV reuse are all variations on a single empirical fact: trained transformers have far less degrees-of-freedom than their parameter count suggests. Linear-algebraically, that's a low effective rank.

06

LoRA — Fine-Tuning as a Rank Constraint

LoRA (Hu et al., ICLR 2022) is the dominant parameter-efficient fine-tuning technique. Idea: freeze the pre-trained $W \in \mathbb{R}^{d_{out} \times d_{in}}$, add a low-rank trainable update.

$$W_{\text{ft}} = W + \Delta W, \qquad \Delta W = B A,\quad A \in \mathbb{R}^{r \times d_{in}},\ B \in \mathbb{R}^{d_{out} \times r}.$$

$\Delta W$ has rank $\le r$. Initialise $B = 0$, $A$ random Gaussian, so $\Delta W = 0$ at start. Train only $A$ and $B$.

The numbers

Param count. $r(d_{in} + d_{out})$ instead of $d_{in} d_{out}$. For $d_{in} = d_{out} = 4096$ and $r = 16$: 131,072 vs 16,777,216 — a 128× reduction.
Memory savings. No gradients, no optimiser state, no activations stored for the frozen $W$. Adam's optimiser state is 2× the trainable params; LoRA cuts that 128× too.
Inference cost. Either keep $A, B$ separate (small extra matmul per layer) or merge once: $W' = W + BA$ at deployment. Merged LoRA has zero inference overhead.

Why does it work?

Empirically, the rank of $\Delta W$ in full fine-tuning is very low — often $\le 32$. So the rank constraint loses very little. Aghajanyan et al. (2020) made this precise with their "intrinsic dimension" experiments before LoRA itself was published; LoRA generalised the observation into a clean parameterisation.

The scale factor $\alpha/r$

LoRA in practice is parameterised as $\Delta W = (\alpha/r)\, B A$, with a hyperparameter $\alpha$ — typically equal to $r$ (so the scale is 1) or $2r$. The $\alpha/r$ factor decouples the choice of rank from the effective learning rate; doubling $r$ doesn't change the effective scale of updates. A small but mattering detail.

07

DoRA, VeRA & LoRA's Cousins

The LoRA family has spawned a small zoo of refinements, each addressing a specific limitation.

Method	Idea	What it adds
QLoRA (Dettmers 2023)	4-bit-quantise the frozen $W$; LoRA in fp16 on top.	Trains 65B-param models on a single 48 GB GPU. Now standard for hobbyist fine-tuning.
DoRA (Liu 2024)	Decompose $W = m \cdot \frac{V}{\\|V\\|}$ into magnitude $m$ and direction $\frac{V}{\\|V\\|}$. LoRA on the direction; full-rank update on $m$.	Closes most of the gap to full FT at the same param budget. The mag/dir split mirrors weight normalisation.
VeRA (Kopiczko 2024)	Share random projections $A_{\text{shared}}, B_{\text{shared}}$ across all layers; learn only per-layer scaling vectors $\mathbf{b}, \mathbf{d}$.	Even fewer trainable params (~10× fewer than LoRA) at minor accuracy cost.
LoRA+ (Hayou 2024)	Different learning rates for $A$ and $B$ to balance their gradient scales.	Better convergence at the same FLOPs.
PiSSA (Meng 2024)	Initialise $A, B$ from the top-$r$ SVD components of $W$. Learn the residual.	Better starting point; faster convergence than zero-init LoRA.

All of them sit on the same linear-algebraic substrate: a rank-$r$ correction on top of a frozen full-rank pre-trained map. The differences are about how to budget the rank and where to put it.

08

DeepSeek MLA — Low-Rank KV Cache

Multi-head Latent Attention (DeepSeek-V2/V3, 2024) attacks a different problem: the KV cache that grows with sequence length. By pushing low-rank structure into activations rather than weights, MLA cuts the KV-cache footprint by an order of magnitude with negligible quality loss.

The standard KV cache

For a sequence of $L$ tokens, multi-head attention caches the keys and values at every layer:

$$\text{cache size per layer} = 2 \cdot L \cdot H \cdot d_h = 2L\, d_{model} \text{ floats}.$$

For Llama-2 70B at $d_{model} = 8192$, $L = 32k$, that's 0.5 GB per layer, 40 GB total — in fp16 — just for the cache.

MLA's substitution

Instead of caching full $\mathbf{k}_i, \mathbf{v}_i$ vectors per token, cache a compressed latent

$$\mathbf{c}_i = W^{KV}_{\text{down}} \mathbf{h}_i \in \mathbb{R}^{r}$$

with $r \ll d_{model}$ (DeepSeek-V2 uses $r = 512$ vs $d_{model} = 5120$). At inference, recompute keys and values on demand:

$$\mathbf{k}_i = W^K_{\text{up}} \mathbf{c}_i, \quad \mathbf{v}_i = W^V_{\text{up}} \mathbf{c}_i.$$

Cache size is now $L \cdot r$ instead of $2L\, d_{model}$ — a factor of $\sim 20\times$ reduction. The KV up-projections happen at attention time, not at cache-write time, but they're cheap because they batch over the same latent vector that lives in the cache. RoPE is applied to the keys after the up-projection to maintain positional encoding.

Why this is fundamentally a rank trick

MLA assumes the keys (and values) for a given sequence position can be parameterised by a low-dimensional latent $\mathbf{c}$. This is exactly the rank-$r$ approximation argument applied to the joint $(\mathbf{k}, \mathbf{v})$ stack across heads. The weight matrices stay full rank; only the per-token activation is constrained. Same Eckart-Young idea, applied to a different object.

09

Interactive: Image Rank Slider

Pick a procedurally-generated image, then truncate its SVD to rank $k$. The image is recomposed as $A_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^\top$. As $k$ grows you see the picture sharpen: the first few components capture the broad structure, later ones add detail and texture, and the smallest add noise.

Image

Rank k 8

image size

32×32

k

8

params (full)

1024

params (rank-k)

520

Frob error

0.20

The middle bar chart shows the singular values $\sigma_1 \ge \sigma_2 \ge \cdots$. Violet bars are kept; grey bars are truncated. Notice that for the "Random low-rank" image the spectrum drops sharply to noise after a few components — rank-4 is essentially perfect. For the gradient/text images the decay is more gradual; you'd need more components to recover sharp edges.

10

The Pseudoinverse via SVD

For $A = U\Sigma V^\top$, the Moore-Penrose pseudoinverse is

$$A^+ = V \Sigma^+ U^\top,$$

where $\Sigma^+$ is obtained by transposing $\Sigma$ and replacing each non-zero $\sigma_i$ with $1/\sigma_i$. Zero singular values stay zero.

What it solves

Tall full-rank $A$: $A^+ \mathbf{b}$ is the least-squares solution to $A\mathbf{x} = \mathbf{b}$.
Wide full-rank $A$: $A^+ \mathbf{b}$ is the minimum-norm solution among the affine space of solutions.
Rank-deficient $A$ (any shape): $A^+ \mathbf{b}$ is the minimum-norm least-squares solution — a single, well-defined answer.

The pseudoinverse is also the standard way to solve ridge regression in closed form; numerical stability of small singular values is exactly the issue ridge addresses (replacing $1/\sigma_i$ with $\sigma_i / (\sigma_i^2 + \lambda)$).

When to reach for SVD vs. eigendecomp

Use eigendecomp for square symmetric matrices (covariance, Hessian). Use SVD for everything else: rectangular weight matrices, Jacobians, data matrices, attention scores. SVD is strictly more general; for a symmetric PSD matrix the two coincide. Default to SVD unless you have a specific reason not to.

11

Cheat Sheet

Every matrix has an SVD: $A = U\Sigma V^\top$. $U, V$ orthogonal; $\Sigma$ "diagonal" with non-increasing non-negative singular values.
Geometric reading: rotate, scale per-axis, rotate. Spectral norm $\|A\|_2 = \sigma_1$; Frobenius norm $\|A\|_F = \sqrt{\sum \sigma_i^2}$.
Right singular vectors = eigenvectors of $A^\top A$; left = eigenvectors of $A A^\top$. Singular values = square roots of those eigenvalues.
Eckart-Young: the truncated SVD $A_k$ is the best rank-$k$ approximation in Frobenius and spectral norm. Error is the trailing singular-value tail.
Real ML matrices have fast-decaying spectra; fine-tune updates are extreme low-rank (effective rank 8–32).
LoRA: $\Delta W = BA$ with $A \in \mathbb{R}^{r \times d_{in}}$, $B \in \mathbb{R}^{d_{out} \times r}$, scaled by $\alpha/r$. ~100× param reduction, mergeable at inference.
QLoRA, DoRA, VeRA, PiSSA, LoRA+ are refinements that all sit on the same low-rank-correction foundation.
DeepSeek MLA: low-rank latent on the activation path of K and V, cutting the KV cache by an order of magnitude.
Pseudoinverse $A^+ = V\Sigma^+ U^\top$ generalises the inverse to any shape; it solves least-squares, minimum-norm and rank-deficient systems uniformly.