Eigenvalues, Cholesky, SVD — the structure theorems that say "every matrix is, in disguise, something simple".
Two scalars that summarise a square matrix.
$\det A$ is the signed volume of the parallelepiped spanned by the columns of $A$. Properties:
$\mathrm{tr}\, A = \sum_i A_{ii} = \sum_i \lambda_i$ — the sum of eigenvalues, and also the sum of diagonal entries. The single most useful identity:
$$\mathrm{tr}(AB) = \mathrm{tr}(BA) \quad \text{whenever both products are defined.}$$
This cyclic property is what makes the trace blind to basis change: $\mathrm{tr}(P^{-1}AP) = \mathrm{tr}(AP P^{-1}) = \mathrm{tr}(A)$.
A diagonal matrix with entries $\lambda_i$ scales axis $i$ by $\lambda_i$, so it scales volume by $\prod\lambda_i$. The eigendecomposition (next slide) is a basis change that turns $A$ into exactly such a diagonal — and basis change preserves volume.
For square $A$, a non-zero $\mathbf{v}$ is an eigenvector with eigenvalue $\lambda$ if
$$A\mathbf{v} = \lambda \mathbf{v}.$$
$\mathbf{v}$ is a special direction in which $A$ behaves as scalar multiplication.
$A\mathbf{v} = \lambda\mathbf{v}$ iff $(A - \lambda I)\mathbf{v} = \mathbf{0}$, which has a non-zero solution iff $A - \lambda I$ is singular, i.e.
$$\det(A - \lambda I) = 0.$$
The left-hand side is the characteristic polynomial $\chi_A(\lambda)$, a degree-$n$ polynomial in $\lambda$. Its $n$ roots (counted with multiplicity, over $\mathbb{C}$) are the eigenvalues.
If $A$ has $n$ linearly independent eigenvectors $\mathbf{v}_1,\dots,\mathbf{v}_n$ with eigenvalues $\lambda_1,\dots,\lambda_n$, stack them as columns of $P$ and put the $\lambda_i$ on the diagonal of $\Lambda$. Then
$$A = P\Lambda P^{-1}.$$
This is the eigendecomposition. Not every matrix has one — defective matrices fail (eigenvectors don't span).
The matrix $\bigl(\begin{smallmatrix}0&1\\0&0\end{smallmatrix}\bigr)$ has $\lambda=0$ with algebraic multiplicity 2 but only one eigenvector (up to scale). The general fix is the Jordan form, which the book doesn't cover and which ML rarely needs — the SVD on slide 05 sidesteps the issue.
If $A$ is real and symmetric ($A^\top = A$), then much more is true:
This is the spectral theorem. Geometrically: every symmetric matrix is a stretching along orthogonal axes.
Every covariance matrix is PSD; every invertible covariance is PD. We'll meet them at every step of decks 06 and 09.
For PSD $A$, define $A^{1/2} = Q\Lambda^{1/2} Q^\top$, where $\Lambda^{1/2}$ takes the square root of each diagonal. Then $A^{1/2}$ is PSD and $(A^{1/2})^2 = A$. The Mahalanobis distance $\mathbf{x}^\top \Sigma^{-1}\mathbf{x}$ becomes $\|\Sigma^{-1/2}\mathbf{x}\|^2$ — a plain Euclidean length after whitening.
If $A$ is symmetric and positive definite there is a unique lower-triangular $L$ with positive diagonal such that
$$A = LL^\top.$$
This is the Cholesky factorisation. It's cheaper than the eigendecomposition (about half the flops of LU) and is the workhorse of Gaussian processes, kernel SVMs and any algorithm that needs to solve $A\mathbf{x}=\mathbf{b}$ with $A$ PD.
To sample $\mathbf{x} \sim \mathcal{N}(\mathbf{0}, \Sigma)$, compute the Cholesky $\Sigma = LL^\top$, draw $\mathbf{z}\sim\mathcal{N}(\mathbf{0}, I)$, and return $L\mathbf{z}$. Then $\mathbb{E}[L\mathbf{z}\mathbf{z}^\top L^\top] = LL^\top = \Sigma$ as required. Decks 06 and 11 use this trick.
You can also write $A = L'DL'^\top$ with $L'$ unit-lower-triangular and $D$ diagonal positive — numerically equivalent and avoids a square root. Most ML libraries use Cholesky proper because $L$ ends up being directly useful (e.g. for log-determinants: $\log\det A = 2\sum_i \log L_{ii}$).
Every real $m\times n$ matrix $A$ admits a factorisation
$$A = U\Sigma V^\top,$$
with $U\in\mathbb{R}^{m\times m}$ orthogonal, $V\in\mathbb{R}^{n\times n}$ orthogonal, and $\Sigma\in\mathbb{R}^{m\times n}$ "diagonal" with non-negative singular values $\sigma_1 \ge \sigma_2 \ge \dots \ge 0$ on its main diagonal.
The SVD always exists; the eigendecomposition often doesn't. For square symmetric $A\succeq 0$ they coincide. For general rectangular $A$ the SVD is the only structure theorem you've got.
Read $A = U\Sigma V^\top$ right to left as a sequence of three operations on a vector $\mathbf{x}\in\mathbb{R}^n$:
Express $\mathbf{x}$ in the orthonormal "right" basis $V$. No stretching.
Scale coordinate $i$ by $\sigma_i$ (then possibly embed into $\mathbb{R}^m$ if $m\neq n$).
Place the result into the "left" basis $U$ inside $\mathbb{R}^m$.
So every linear map is, in suitable orthonormal bases on each side, a non-negative diagonal scaling. This is the structural fact behind PCA (deck 10) and behind every "low-rank approximation" used in modern ML.
If $\sigma_1,\dots,\sigma_r$ are non-zero and $\sigma_{r+1}=\dots=0$:
Form $A_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^\top = U_k\Sigma_k V_k^\top$ — the truncated SVD. Then
$$A_k = \arg\min_{\mathrm{rank}(B)\le k}\|A - B\|_F = \arg\min_{\mathrm{rank}(B)\le k}\|A - B\|_2.$$
In words: $A_k$ is the closest rank-$k$ matrix to $A$ in both the Frobenius (sum of squares) norm and the spectral (largest singular value) norm. The minimum errors are
$$\|A - A_k\|_F^2 = \sum_{i=k+1}^{r}\sigma_i^2, \qquad \|A - A_k\|_2 = \sigma_{k+1}.$$
"Compress this object as much as you can while losing as little information as possible" is the single most reused goal in machine learning. PCA, LoRA, KV-cache compression, autoencoders — they all answer the same question by taking truncated SVDs of one matrix or another. Eckart–Young tells you the answer is optimal.
Below is a 32×32 grayscale "image" (a synthetic logo). Slide $k$ from 1 to 32 to keep that many singular values; the right pane shows the truncated reconstruction, the metrics report compression ratio and error.
For most natural images the spectrum drops off fast — you can keep a small fraction of the singular values and still see most of the picture. That observation, applied to the data matrix instead of an image, is exactly PCA (deck 10).
The book consolidates everything into table 4.2. Here it is, slightly reshaped, with a "use it for" column added.
| Decomposition | Form | When it exists | Use it for |
|---|---|---|---|
| LU | $A = LU$ ($L$ lower-, $U$ upper-triangular) | $A$ square, no zero pivots | Solving $A\mathbf{x}=\mathbf{b}$ |
| QR | $A = QR$ ($Q$ orthogonal, $R$ upper-triangular) | Always for full-column-rank $A$ | Numerically stable least squares |
| Cholesky | $A = LL^\top$ | $A$ symmetric PD | Solving with PD $A$; sampling from $\mathcal{N}(\mathbf{0},\Sigma)$ |
| Eigendecomposition | $A = P\Lambda P^{-1}$ | $A$ has $n$ independent eigenvectors | Power iteration, dynamical systems |
| Spectral (symmetric eigen) | $A = Q\Lambda Q^\top$ | $A$ symmetric | PCA, whitening, principal axes |
| SVD | $A = U\Sigma V^\top$ | Always | Best low-rank approx, pseudoinverse |
| Concept | Used by | Where |
|---|---|---|
| Eigendecomposition of covariance | PCA (ch. 10) | Principal components are the eigenvectors of $\frac{1}{N}X^\top X$ |
| Spectral theorem — PSD inverse | GMM (ch. 11), linear regression (ch. 9) | Inverting covariance via eigen reciprocal |
| Cholesky | GMM, GP regression | Sampling $\mathcal{N}(\boldsymbol\mu,\Sigma)$; stable log-det |
| SVD | PCA, low-rank decompositions everywhere | PCA via SVD of $X$ skips forming $X^\top X$ |
| Eckart–Young | Compression, LoRA, KV-cache | Optimal truncated representation |
| Determinant / trace | GMM (ch. 11) | Gaussian log-density: $-\tfrac12\log\det(2\pi\Sigma)-\tfrac12(\mathbf{x}-\boldsymbol\mu)^\top\Sigma^{-1}(\mathbf{x}-\boldsymbol\mu)$ |
| Object | Formula |
|---|---|
| $\det A$ | $\prod_i \lambda_i$ — signed volume scaling |
| $\mathrm{tr}\, A$ | $\sum_i A_{ii} = \sum_i \lambda_i$ |
| Characteristic polynomial | $\chi_A(\lambda) = \det(A-\lambda I)$ |
| Eigendecomposition | $A = P\Lambda P^{-1}$ |
| Spectral theorem | $A = Q\Lambda Q^\top$ for symmetric $A$ |
| Cholesky | $A = LL^\top$, $A$ PD |
| SVD | $A = U\Sigma V^\top$ |
| $\sigma_i^2$ | eigenvalues of $A^\top A$ |
| Pseudoinverse | $A^+ = V \Sigma^+ U^\top$ — reciprocal non-zero singular values |
| Eckart–Young error | $\|A-A_k\|_F^2 = \sum_{i>k} \sigma_i^2$ |
Deck 05 returns to differentiation — gradients, Jacobians and the chain rule on a DAG (i.e. backpropagation). With Part I almost complete we're ready to start training models.