Mathematics for Machine Learning — Deck 04

Matrix Decompositions

Eigenvalues, Cholesky, SVD — the structure theorems that say "every matrix is, in disguise, something simple".

determinanteigenvalues spectralCholesky SVDEckart–Young
$\det, \mathrm{tr}$ eigen spectral thm SVD low rank
00

Topics We'll Cover

01

Determinant and Trace

Two scalars that summarise a square matrix.

Determinant

$\det A$ is the signed volume of the parallelepiped spanned by the columns of $A$. Properties:

Trace

$\mathrm{tr}\, A = \sum_i A_{ii} = \sum_i \lambda_i$ — the sum of eigenvalues, and also the sum of diagonal entries. The single most useful identity:

$$\mathrm{tr}(AB) = \mathrm{tr}(BA) \quad \text{whenever both products are defined.}$$

This cyclic property is what makes the trace blind to basis change: $\mathrm{tr}(P^{-1}AP) = \mathrm{tr}(AP P^{-1}) = \mathrm{tr}(A)$.

A geometric reading of $\det = \prod\lambda_i$

A diagonal matrix with entries $\lambda_i$ scales axis $i$ by $\lambda_i$, so it scales volume by $\prod\lambda_i$. The eigendecomposition (next slide) is a basis change that turns $A$ into exactly such a diagonal — and basis change preserves volume.

02

Eigenvalues and Eigenvectors

For square $A$, a non-zero $\mathbf{v}$ is an eigenvector with eigenvalue $\lambda$ if

$$A\mathbf{v} = \lambda \mathbf{v}.$$

$\mathbf{v}$ is a special direction in which $A$ behaves as scalar multiplication.

Finding them

$A\mathbf{v} = \lambda\mathbf{v}$ iff $(A - \lambda I)\mathbf{v} = \mathbf{0}$, which has a non-zero solution iff $A - \lambda I$ is singular, i.e.

$$\det(A - \lambda I) = 0.$$

The left-hand side is the characteristic polynomial $\chi_A(\lambda)$, a degree-$n$ polynomial in $\lambda$. Its $n$ roots (counted with multiplicity, over $\mathbb{C}$) are the eigenvalues.

Diagonalisation

If $A$ has $n$ linearly independent eigenvectors $\mathbf{v}_1,\dots,\mathbf{v}_n$ with eigenvalues $\lambda_1,\dots,\lambda_n$, stack them as columns of $P$ and put the $\lambda_i$ on the diagonal of $\Lambda$. Then

$$A = P\Lambda P^{-1}.$$

This is the eigendecomposition. Not every matrix has one — defective matrices fail (eigenvectors don't span).

When does it fail?

The matrix $\bigl(\begin{smallmatrix}0&1\\0&0\end{smallmatrix}\bigr)$ has $\lambda=0$ with algebraic multiplicity 2 but only one eigenvector (up to scale). The general fix is the Jordan form, which the book doesn't cover and which ML rarely needs — the SVD on slide 05 sidesteps the issue.

03

The Spectral Theorem (Symmetric Matrices)

If $A$ is real and symmetric ($A^\top = A$), then much more is true:

  1. All eigenvalues are real.
  2. There exists an orthonormal basis of eigenvectors of $A$.
  3. Hence $A = Q\Lambda Q^\top$ with $Q$ orthogonal ($Q^\top Q = I$) and $\Lambda$ diagonal.

This is the spectral theorem. Geometrically: every symmetric matrix is a stretching along orthogonal axes.

Positive (semi-) definite matrices

Every covariance matrix is PSD; every invertible covariance is PD. We'll meet them at every step of decks 06 and 09.

The square root

For PSD $A$, define $A^{1/2} = Q\Lambda^{1/2} Q^\top$, where $\Lambda^{1/2}$ takes the square root of each diagonal. Then $A^{1/2}$ is PSD and $(A^{1/2})^2 = A$. The Mahalanobis distance $\mathbf{x}^\top \Sigma^{-1}\mathbf{x}$ becomes $\|\Sigma^{-1/2}\mathbf{x}\|^2$ — a plain Euclidean length after whitening.

04

Cholesky Factorisation

If $A$ is symmetric and positive definite there is a unique lower-triangular $L$ with positive diagonal such that

$$A = LL^\top.$$

This is the Cholesky factorisation. It's cheaper than the eigendecomposition (about half the flops of LU) and is the workhorse of Gaussian processes, kernel SVMs and any algorithm that needs to solve $A\mathbf{x}=\mathbf{b}$ with $A$ PD.

Sampling from a Gaussian

To sample $\mathbf{x} \sim \mathcal{N}(\mathbf{0}, \Sigma)$, compute the Cholesky $\Sigma = LL^\top$, draw $\mathbf{z}\sim\mathcal{N}(\mathbf{0}, I)$, and return $L\mathbf{z}$. Then $\mathbb{E}[L\mathbf{z}\mathbf{z}^\top L^\top] = LL^\top = \Sigma$ as required. Decks 06 and 11 use this trick.

Why "Cholesky" and not "$LDL^\top$"?

You can also write $A = L'DL'^\top$ with $L'$ unit-lower-triangular and $D$ diagonal positive — numerically equivalent and avoids a square root. Most ML libraries use Cholesky proper because $L$ ends up being directly useful (e.g. for log-determinants: $\log\det A = 2\sum_i \log L_{ii}$).

05

The Singular Value Decomposition

Every real $m\times n$ matrix $A$ admits a factorisation

$$A = U\Sigma V^\top,$$

with $U\in\mathbb{R}^{m\times m}$ orthogonal, $V\in\mathbb{R}^{n\times n}$ orthogonal, and $\Sigma\in\mathbb{R}^{m\times n}$ "diagonal" with non-negative singular values $\sigma_1 \ge \sigma_2 \ge \dots \ge 0$ on its main diagonal.

Where the pieces come from

Two important variants

SVD vs eigendecomposition

The SVD always exists; the eigendecomposition often doesn't. For square symmetric $A\succeq 0$ they coincide. For general rectangular $A$ the SVD is the only structure theorem you've got.

06

Geometric Picture of the SVD

Read $A = U\Sigma V^\top$ right to left as a sequence of three operations on a vector $\mathbf{x}\in\mathbb{R}^n$:

$V^\top$ — rotate

Express $\mathbf{x}$ in the orthonormal "right" basis $V$. No stretching.

$\Sigma$ — stretch axes

Scale coordinate $i$ by $\sigma_i$ (then possibly embed into $\mathbb{R}^m$ if $m\neq n$).

$U$ — rotate

Place the result into the "left" basis $U$ inside $\mathbb{R}^m$.

So every linear map is, in suitable orthonormal bases on each side, a non-negative diagonal scaling. This is the structural fact behind PCA (deck 10) and behind every "low-rank approximation" used in modern ML.

Reading off the four subspaces

If $\sigma_1,\dots,\sigma_r$ are non-zero and $\sigma_{r+1}=\dots=0$:

07

Eckart–Young — the Best Rank-$k$ Approximation

Form $A_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^\top = U_k\Sigma_k V_k^\top$ — the truncated SVD. Then

$$A_k = \arg\min_{\mathrm{rank}(B)\le k}\|A - B\|_F = \arg\min_{\mathrm{rank}(B)\le k}\|A - B\|_2.$$

In words: $A_k$ is the closest rank-$k$ matrix to $A$ in both the Frobenius (sum of squares) norm and the spectral (largest singular value) norm. The minimum errors are

$$\|A - A_k\|_F^2 = \sum_{i=k+1}^{r}\sigma_i^2, \qquad \|A - A_k\|_2 = \sigma_{k+1}.$$

Why this is one of the deepest theorems in ML

"Compress this object as much as you can while losing as little information as possible" is the single most reused goal in machine learning. PCA, LoRA, KV-cache compression, autoencoders — they all answer the same question by taking truncated SVDs of one matrix or another. Eckart–Young tells you the answer is optimal.

08

Interactive: SVD Compressor

Below is a 32×32 grayscale "image" (a synthetic logo). Slide $k$ from 1 to 32 to keep that many singular values; the right pane shows the truncated reconstruction, the metrics report compression ratio and error.

original (rank 32)
rank 8 approximation
singular value spectrum
8
compression
‖A−Ak‖_F / ‖A‖_F
energy kept

For most natural images the spectrum drops off fast — you can keep a small fraction of the singular values and still see most of the picture. That observation, applied to the data matrix instead of an image, is exactly PCA (deck 10).

09

The Decomposition Taxonomy

The book consolidates everything into table 4.2. Here it is, slightly reshaped, with a "use it for" column added.

DecompositionFormWhen it existsUse it for
LU$A = LU$ ($L$ lower-, $U$ upper-triangular)$A$ square, no zero pivotsSolving $A\mathbf{x}=\mathbf{b}$
QR$A = QR$ ($Q$ orthogonal, $R$ upper-triangular)Always for full-column-rank $A$Numerically stable least squares
Cholesky$A = LL^\top$$A$ symmetric PDSolving with PD $A$; sampling from $\mathcal{N}(\mathbf{0},\Sigma)$
Eigendecomposition$A = P\Lambda P^{-1}$$A$ has $n$ independent eigenvectorsPower iteration, dynamical systems
Spectral (symmetric eigen)$A = Q\Lambda Q^\top$$A$ symmetricPCA, whitening, principal axes
SVD$A = U\Sigma V^\top$AlwaysBest low-rank approx, pseudoinverse
10

Where This Lands in Part II

ConceptUsed byWhere
Eigendecomposition of covariancePCA (ch. 10)Principal components are the eigenvectors of $\frac{1}{N}X^\top X$
Spectral theorem — PSD inverseGMM (ch. 11), linear regression (ch. 9)Inverting covariance via eigen reciprocal
CholeskyGMM, GP regressionSampling $\mathcal{N}(\boldsymbol\mu,\Sigma)$; stable log-det
SVDPCA, low-rank decompositions everywherePCA via SVD of $X$ skips forming $X^\top X$
Eckart–YoungCompression, LoRA, KV-cacheOptimal truncated representation
Determinant / traceGMM (ch. 11)Gaussian log-density: $-\tfrac12\log\det(2\pi\Sigma)-\tfrac12(\mathbf{x}-\boldsymbol\mu)^\top\Sigma^{-1}(\mathbf{x}-\boldsymbol\mu)$
11

Cheat Sheet

ObjectFormula
$\det A$$\prod_i \lambda_i$ — signed volume scaling
$\mathrm{tr}\, A$$\sum_i A_{ii} = \sum_i \lambda_i$
Characteristic polynomial$\chi_A(\lambda) = \det(A-\lambda I)$
Eigendecomposition$A = P\Lambda P^{-1}$
Spectral theorem$A = Q\Lambda Q^\top$ for symmetric $A$
Cholesky$A = LL^\top$, $A$ PD
SVD$A = U\Sigma V^\top$
$\sigma_i^2$eigenvalues of $A^\top A$
Pseudoinverse$A^+ = V \Sigma^+ U^\top$ — reciprocal non-zero singular values
Eckart–Young error$\|A-A_k\|_F^2 = \sum_{i>k} \sigma_i^2$
Up next

Deck 05 returns to differentiation — gradients, Jacobians and the chain rule on a DAG (i.e. backpropagation). With Part I almost complete we're ready to start training models.