MML 04 — Matrix Decompositions

00

Topics We'll Cover

Determinant and trace
Eigenvalues and eigenvectors
The spectral theorem (symmetric matrices)
Cholesky factorisation
The singular value decomposition
Geometric picture of the SVD
Eckart–Young — the best rank-$k$ approximation
Interactive: SVD compressor
The decomposition taxonomy
Where this lands in Part II
Cheat sheet

01

Determinant and Trace

Two scalars that summarise a square matrix.

Determinant

$\det A$ is the signed volume of the parallelepiped spanned by the columns of $A$. Properties:

$\det(AB) = \det A \cdot \det B$ — volumes compose.
$\det A^\top = \det A$.
$\det A = 0$ iff $A$ has linearly dependent columns iff $A$ is non-invertible.
$\det A = \prod \lambda_i$ — the product of (complex) eigenvalues.

Trace

$\mathrm{tr}\, A = \sum_i A_{ii} = \sum_i \lambda_i$ — the sum of eigenvalues, and also the sum of diagonal entries. The single most useful identity:

$$\mathrm{tr}(AB) = \mathrm{tr}(BA) \quad \text{whenever both products are defined.}$$

This cyclic property is what makes the trace blind to basis change: $\mathrm{tr}(P^{-1}AP) = \mathrm{tr}(AP P^{-1}) = \mathrm{tr}(A)$.

A geometric reading of $\det = \prod\lambda_i$

A diagonal matrix with entries $\lambda_i$ scales axis $i$ by $\lambda_i$, so it scales volume by $\prod\lambda_i$. The eigendecomposition (next slide) is a basis change that turns $A$ into exactly such a diagonal — and basis change preserves volume.

02

Eigenvalues and Eigenvectors

For square $A$, a non-zero $\mathbf{v}$ is an eigenvector with eigenvalue $\lambda$ if

$$A\mathbf{v} = \lambda \mathbf{v}.$$

$\mathbf{v}$ is a special direction in which $A$ behaves as scalar multiplication.

Finding them

$A\mathbf{v} = \lambda\mathbf{v}$ iff $(A - \lambda I)\mathbf{v} = \mathbf{0}$, which has a non-zero solution iff $A - \lambda I$ is singular, i.e.

$$\det(A - \lambda I) = 0.$$

The left-hand side is the characteristic polynomial $\chi_A(\lambda)$, a degree-$n$ polynomial in $\lambda$. Its $n$ roots (counted with multiplicity, over $\mathbb{C}$) are the eigenvalues.

Diagonalisation

If $A$ has $n$ linearly independent eigenvectors $\mathbf{v}_1,\dots,\mathbf{v}_n$ with eigenvalues $\lambda_1,\dots,\lambda_n$, stack them as columns of $P$ and put the $\lambda_i$ on the diagonal of $\Lambda$. Then

$$A = P\Lambda P^{-1}.$$

This is the eigendecomposition. Not every matrix has one — defective matrices fail (eigenvectors don't span).

When does it fail?

The matrix $\bigl(\begin{smallmatrix}0&1\\0&0\end{smallmatrix}\bigr)$ has $\lambda=0$ with algebraic multiplicity 2 but only one eigenvector (up to scale). The general fix is the Jordan form, which the book doesn't cover and which ML rarely needs — the SVD on slide 05 sidesteps the issue.

03

The Spectral Theorem (Symmetric Matrices)

If $A$ is real and symmetric ($A^\top = A$), then much more is true:

All eigenvalues are real.
There exists an orthonormal basis of eigenvectors of $A$.
Hence $A = Q\Lambda Q^\top$ with $Q$ orthogonal ($Q^\top Q = I$) and $\Lambda$ diagonal.

This is the spectral theorem. Geometrically: every symmetric matrix is a stretching along orthogonal axes.

Positive (semi-) definite matrices

PSD ($A \succeq 0$): $A=A^\top$ and $\mathbf{x}^\top A\mathbf{x}\ge 0$ for all $\mathbf{x}$. Equivalently: all eigenvalues $\ge 0$.
PD ($A \succ 0$): strictly positive. All eigenvalues $> 0$, $A$ invertible.

Every covariance matrix is PSD; every invertible covariance is PD. We'll meet them at every step of decks 06 and 09.

The square root

For PSD $A$, define $A^{1/2} = Q\Lambda^{1/2} Q^\top$, where $\Lambda^{1/2}$ takes the square root of each diagonal. Then $A^{1/2}$ is PSD and $(A^{1/2})^2 = A$. The Mahalanobis distance $\mathbf{x}^\top \Sigma^{-1}\mathbf{x}$ becomes $\|\Sigma^{-1/2}\mathbf{x}\|^2$ — a plain Euclidean length after whitening.

04

Cholesky Factorisation

If $A$ is symmetric and positive definite there is a unique lower-triangular $L$ with positive diagonal such that

$$A = LL^\top.$$

This is the Cholesky factorisation. It's cheaper than the eigendecomposition (about half the flops of LU) and is the workhorse of Gaussian processes, kernel SVMs and any algorithm that needs to solve $A\mathbf{x}=\mathbf{b}$ with $A$ PD.

Sampling from a Gaussian

To sample $\mathbf{x} \sim \mathcal{N}(\mathbf{0}, \Sigma)$, compute the Cholesky $\Sigma = LL^\top$, draw $\mathbf{z}\sim\mathcal{N}(\mathbf{0}, I)$, and return $L\mathbf{z}$. Then $\mathbb{E}[L\mathbf{z}\mathbf{z}^\top L^\top] = LL^\top = \Sigma$ as required. Decks 06 and 11 use this trick.

Why "Cholesky" and not "$LDL^\top$"?

You can also write $A = L'DL'^\top$ with $L'$ unit-lower-triangular and $D$ diagonal positive — numerically equivalent and avoids a square root. Most ML libraries use Cholesky proper because $L$ ends up being directly useful (e.g. for log-determinants: $\log\det A = 2\sum_i \log L_{ii}$).

05

The Singular Value Decomposition

Every real $m\times n$ matrix $A$ admits a factorisation

$$A = U\Sigma V^\top,$$

with $U\in\mathbb{R}^{m\times m}$ orthogonal, $V\in\mathbb{R}^{n\times n}$ orthogonal, and $\Sigma\in\mathbb{R}^{m\times n}$ "diagonal" with non-negative singular values $\sigma_1 \ge \sigma_2 \ge \dots \ge 0$ on its main diagonal.

Where the pieces come from

$V$'s columns are an orthonormal eigenvector basis of $A^\top A$;
The non-zero $\sigma_i = \sqrt{\lambda_i}$ are the square roots of the non-zero eigenvalues of $A^\top A$;
$U$'s columns are an orthonormal eigenvector basis of $AA^\top$.

Two important variants

Thin / reduced SVD: $A = U_r \Sigma_r V_r^\top$ with only $r = \min(m,n)$ columns. Same factorisation, less storage.
Truncated SVD: keep only the top $k$ singular values, $A_k = U_k\Sigma_k V_k^\top$. This is the rank-$k$ approximation of slide 07.

SVD vs eigendecomposition

The SVD always exists; the eigendecomposition often doesn't. For square symmetric $A\succeq 0$ they coincide. For general rectangular $A$ the SVD is the only structure theorem you've got.

06

Geometric Picture of the SVD

Read $A = U\Sigma V^\top$ right to left as a sequence of three operations on a vector $\mathbf{x}\in\mathbb{R}^n$:

$V^\top$ — rotate

Express $\mathbf{x}$ in the orthonormal "right" basis $V$. No stretching.

$\Sigma$ — stretch axes

Scale coordinate $i$ by $\sigma_i$ (then possibly embed into $\mathbb{R}^m$ if $m\neq n$).

$U$ — rotate

Place the result into the "left" basis $U$ inside $\mathbb{R}^m$.

So every linear map is, in suitable orthonormal bases on each side, a non-negative diagonal scaling. This is the structural fact behind PCA (deck 10) and behind every "low-rank approximation" used in modern ML.

Reading off the four subspaces

If $\sigma_1,\dots,\sigma_r$ are non-zero and $\sigma_{r+1}=\dots=0$:

$\mathrm{Col}(A) = \mathrm{span}(\mathbf{u}_1,\dots,\mathbf{u}_r)$;
$\mathrm{Nul}(A) = \mathrm{span}(\mathbf{v}_{r+1},\dots,\mathbf{v}_n)$;
$\mathrm{Row}(A) = \mathrm{span}(\mathbf{v}_1,\dots,\mathbf{v}_r)$;
$\mathrm{Nul}(A^\top) = \mathrm{span}(\mathbf{u}_{r+1},\dots,\mathbf{u}_m)$.

07

Eckart–Young — the Best Rank-$k$ Approximation

Form $A_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^\top = U_k\Sigma_k V_k^\top$ — the truncated SVD. Then

$$A_k = \arg\min_{\mathrm{rank}(B)\le k}\|A - B\|_F = \arg\min_{\mathrm{rank}(B)\le k}\|A - B\|_2.$$

In words: $A_k$ is the closest rank-$k$ matrix to $A$ in both the Frobenius (sum of squares) norm and the spectral (largest singular value) norm. The minimum errors are

$$\|A - A_k\|_F^2 = \sum_{i=k+1}^{r}\sigma_i^2, \qquad \|A - A_k\|_2 = \sigma_{k+1}.$$

Why this is one of the deepest theorems in ML

"Compress this object as much as you can while losing as little information as possible" is the single most reused goal in machine learning. PCA, LoRA, KV-cache compression, autoencoders — they all answer the same question by taking truncated SVDs of one matrix or another. Eckart–Young tells you the answer is optimal.

08

Interactive: SVD Compressor

Below is a 32×32 grayscale "image" (a synthetic logo). Slide $k$ from 1 to 32 to keep that many singular values; the right pane shows the truncated reconstruction, the metrics report compression ratio and error.

original (rank 32)

rank 8 approximation

singular value spectrum

rank $k$

8

compression

—

‖A−Ak‖_F / ‖A‖_F

—

energy kept

—

For most natural images the spectrum drops off fast — you can keep a small fraction of the singular values and still see most of the picture. That observation, applied to the data matrix instead of an image, is exactly PCA (deck 10).

09

The Decomposition Taxonomy

The book consolidates everything into table 4.2. Here it is, slightly reshaped, with a "use it for" column added.

Decomposition	Form	When it exists	Use it for
LU	$A = LU$ ($L$ lower-, $U$ upper-triangular)	$A$ square, no zero pivots	Solving $A\mathbf{x}=\mathbf{b}$
QR	$A = QR$ ($Q$ orthogonal, $R$ upper-triangular)	Always for full-column-rank $A$	Numerically stable least squares
Cholesky	$A = LL^\top$	$A$ symmetric PD	Solving with PD $A$; sampling from $\mathcal{N}(\mathbf{0},\Sigma)$
Eigendecomposition	$A = P\Lambda P^{-1}$	$A$ has $n$ independent eigenvectors	Power iteration, dynamical systems
Spectral (symmetric eigen)	$A = Q\Lambda Q^\top$	$A$ symmetric	PCA, whitening, principal axes
SVD	$A = U\Sigma V^\top$	Always	Best low-rank approx, pseudoinverse

10

Where This Lands in Part II

Concept	Used by	Where
Eigendecomposition of covariance	PCA (ch. 10)	Principal components are the eigenvectors of $\frac{1}{N}X^\top X$
Spectral theorem — PSD inverse	GMM (ch. 11), linear regression (ch. 9)	Inverting covariance via eigen reciprocal
Cholesky	GMM, GP regression	Sampling $\mathcal{N}(\boldsymbol\mu,\Sigma)$; stable log-det
SVD	PCA, low-rank decompositions everywhere	PCA via SVD of $X$ skips forming $X^\top X$
Eckart–Young	Compression, LoRA, KV-cache	Optimal truncated representation
Determinant / trace	GMM (ch. 11)	Gaussian log-density: $-\tfrac12\log\det(2\pi\Sigma)-\tfrac12(\mathbf{x}-\boldsymbol\mu)^\top\Sigma^{-1}(\mathbf{x}-\boldsymbol\mu)$

11

Cheat Sheet

Object	Formula
$\det A$	$\prod_i \lambda_i$ — signed volume scaling
$\mathrm{tr}\, A$	$\sum_i A_{ii} = \sum_i \lambda_i$
Characteristic polynomial	$\chi_A(\lambda) = \det(A-\lambda I)$
Eigendecomposition	$A = P\Lambda P^{-1}$
Spectral theorem	$A = Q\Lambda Q^\top$ for symmetric $A$
Cholesky	$A = LL^\top$, $A$ PD
SVD	$A = U\Sigma V^\top$
$\sigma_i^2$	eigenvalues of $A^\top A$
Pseudoinverse	$A^+ = V \Sigma^+ U^\top$ — reciprocal non-zero singular values
Eckart–Young error	$\\|A-A_k\\|_F^2 = \sum_{i>k} \sigma_i^2$

Up next

Deck 05 returns to differentiation — gradients, Jacobians and the chain rule on a DAG (i.e. backpropagation). With Part I almost complete we're ready to start training models.