MML 10 — Dimensionality Reduction with PCA

00

Topics We'll Cover

Setting up the problem
View 1 — maximise the projected variance
View 2 — minimise the reconstruction error
The two views give the same answer
PCA via the SVD
Interactive: 2D → 1D PCA
High-dimensional PCA, the dual trick
Probabilistic PCA
When PCA fails
Cheat sheet

01

Setting Up the Problem

We have data $X = [\mathbf{x}_1, \dots, \mathbf{x}_N]^\top \in\mathbb{R}^{N\times D}$. We want a projection to $M< D$ dimensions:

$$\mathbf{z}_n = B^\top(\mathbf{x}_n - \boldsymbol\mu),\qquad \tilde{\mathbf{x}}_n = \boldsymbol\mu + B\mathbf{z}_n,$$

where $B\in\mathbb{R}^{D\times M}$ has orthonormal columns and $\boldsymbol\mu$ is the (sample) mean. Three quantities are at play:

codes $\mathbf{z}_n\in\mathbb{R}^M$ — the reduced representation;
reconstruction $\tilde{\mathbf{x}}_n\in\mathbb{R}^D$ — what you'd recover after lossy compression;
residual $\mathbf{x}_n - \tilde{\mathbf{x}}_n$ — what was lost.

From deck 03 we know that for any fixed $B$ the choice $\mathbf{z}_n = B^\top(\mathbf{x}_n - \boldsymbol\mu)$ produces $\tilde{\mathbf{x}}_n$ as the orthogonal projection of $\mathbf{x}_n$ onto $\boldsymbol\mu + \mathrm{Col}(B)$. So the only design choice is $B$.

02

View 1 — Maximise the Projected Variance

Centre the data: set $\boldsymbol\mu = \tfrac1N\sum_n\mathbf{x}_n$, replace $\mathbf{x}_n \to \mathbf{x}_n - \boldsymbol\mu$. The sample covariance is

$$S = \frac{1}{N}\sum_n \mathbf{x}_n\mathbf{x}_n^\top.$$

For a unit-norm direction $\mathbf{b}$, the variance of $\mathbf{b}^\top\mathbf{x}$ is $\mathbf{b}^\top S\mathbf{b}$. We want to find

$$\mathbf{b}_1 = \arg\max_{\|\mathbf{b}\|=1}\;\mathbf{b}^\top S\mathbf{b}.$$

Form the Lagrangian (deck 07):

$$\mathcal L(\mathbf{b}, \lambda) = \mathbf{b}^\top S\mathbf{b} - \lambda(\mathbf{b}^\top\mathbf{b}-1), \qquad \nabla_{\mathbf{b}}\mathcal L = 2S\mathbf{b} - 2\lambda\mathbf{b} = 0.$$

So $S\mathbf{b}_1 = \lambda_1\mathbf{b}_1$: $\mathbf{b}_1$ is an eigenvector of $S$ with eigenvalue equal to the variance it captures. To maximise, pick the eigenvector with the largest eigenvalue. The next direction $\mathbf{b}_2$ is the largest eigenvector among those orthogonal to $\mathbf{b}_1$, etc.

03

View 2 — Minimise the Reconstruction Error

Same setup, but now minimise

$$J(B) = \frac{1}{N}\sum_n \|\mathbf{x}_n - BB^\top\mathbf{x}_n\|^2 = \frac{1}{N}\sum_n \mathbf{x}_n^\top(I - BB^\top)\mathbf{x}_n.$$

Use the trace identity $\mathbf{a}^\top A\mathbf{a} = \mathrm{tr}(\mathbf{a}\mathbf{a}^\top A)$ and cyclic trace:

$$J(B) = \mathrm{tr}\bigl((I - BB^\top)S\bigr) = \mathrm{tr}(S) - \mathrm{tr}(B^\top S B).$$

So minimising reconstruction error is the same as maximising $\mathrm{tr}(B^\top SB)$ — which, subject to $B^\top B = I$, is solved by taking the top $M$ eigenvectors of $S$ as the columns of $B$.

Why two views matter

The variance view says: "find directions of greatest spread". The reconstruction view says: "find the subspace closest to the data". They give the same answer in the Gaussian / squared-loss setting — but the two questions split when the loss changes (robust PCA, sparse PCA, autoencoders).

04

The Two Views Give the Same Answer

Combining: the optimal $B$ is the matrix whose columns are the top $M$ orthonormal eigenvectors of $S$.

$$S = U\Lambda U^\top, \quad \Lambda = \mathrm{diag}(\lambda_1\ge\lambda_2\ge\dots).$$

$$B_M = U[:, 1{:}M], \qquad \text{reconstruction error} = \sum_{i=M+1}^D \lambda_i.$$

Total variance is $\sum_i\lambda_i = \mathrm{tr}(S)$; the variance kept by $B_M$ is $\sum_{i=1}^M \lambda_i$, the variance lost is $\sum_{i=M+1}^D \lambda_i$ — which is exactly the reconstruction error.

Whitening

Once you have $B$ and $\Lambda$, the whitened data is $\mathbf{z}_n = \Lambda^{-1/2}B^\top\mathbf{x}_n$. It has identity sample covariance and is the input you want for any subsequent geometric algorithm.

05

PCA via the SVD

Stack the centred data as rows: $\tilde X\in\mathbb{R}^{N\times D}$. Then $S = \tilde X^\top\tilde X / N$.

Compute the SVD $\tilde X = U\Sigma V^\top$. Then:

$\tilde X^\top\tilde X = V\Sigma^2 V^\top$, so $V$ holds the eigenvectors of $NS$;
The singular values relate to eigenvalues by $\sigma_i^2 = N\lambda_i$;
The first $M$ principal components of $\tilde X$ are $\tilde X V[:, 1{:}M] = U[:,1{:}M]\Sigma[1{:}M, 1{:}M]$.

Computing the SVD of $\tilde X$ directly is more numerically stable than forming $\tilde X^\top\tilde X$ and eigendecomposing. Production libraries (NumPy, sklearn) implement PCA via the SVD.

Eckart–Young again

The rank-$M$ matrix $U[:,1{:}M]\Sigma[1{:}M,1{:}M]V[:,1{:}M]^\top$ is the closest rank-$M$ approximation of $\tilde X$ (deck 04). PCA is therefore Eckart–Young applied to the centred data matrix.

06

Interactive: 2D → 1D PCA

Click on the canvas to add points; the principal-axis line and the second axis update live, with the projections drawn as ticks on the major axis. Numerical readouts give the variance kept on the major axis vs the residual.

N

0

λ₁

—

λ₂

—

variance kept

—

For an isotropic blob, $\lambda_1 \approx \lambda_2$ and PCA is essentially uninformative — every direction is equally good. The more anisotropic the cloud, the better the rank-1 approximation.

07

High-Dimensional PCA: The Dual Trick

When $D \gg N$ (think images: $D = 10^6$ pixels, $N = 1000$ examples), $S$ is a $D\times D$ matrix — intractable to form, let alone eigendecompose.

But the eigenvalues of $\tilde X^\top\tilde X$ and $\tilde X\tilde X^\top$ are the same (up to zeros). The matrix $\tilde X\tilde X^\top$ is $N\times N$, often tractable:

$$\tilde X\tilde X^\top \mathbf{u}_i = \mu_i \mathbf{u}_i, \qquad \mathbf{v}_i = \tilde X^\top\mathbf{u}_i / \|\tilde X^\top\mathbf{u}_i\|.$$

So you can recover the principal directions $\mathbf{v}_i$ in $\mathbb{R}^D$ by eigendecomposing the $N\times N$ Gram matrix and then projecting back through $\tilde X^\top$. This is the foundation of kernel PCA, where $\tilde X\tilde X^\top$ is replaced by the kernel matrix.

08

Probabilistic PCA

Treat PCA as a generative model:

$$\mathbf{z}_n \sim \mathcal N(\mathbf 0, I_M), \qquad \mathbf{x}_n \mid \mathbf{z}_n \sim \mathcal N(W\mathbf{z}_n + \boldsymbol\mu, \sigma^2 I_D).$$

Latent factors $\mathbf{z}_n$ are sampled, then a linear map plus isotropic noise produces $\mathbf{x}_n$. Marginalising out $\mathbf{z}_n$ (deck 06):

$$\mathbf{x}_n \sim \mathcal N(\boldsymbol\mu, WW^\top + \sigma^2 I).$$

The maximum-likelihood solution recovers exactly classical PCA (with a corrective $\sigma^2 I$ in the covariance). As $\sigma^2 \to 0$, probabilistic PCA → classical PCA.

Why bother with the generative version?

It lets you sample new data, handle missing entries by EM (deck 11 generalises), evaluate a likelihood, and naturally extend to mixtures of PCAs — each cluster gets its own low-dim subspace.

09

When PCA Fails

Non-Gaussian structure

Variance is a second-moment summary. For heavy-tailed or skewed distributions it can be a poor proxy for "interesting structure". ICA, robust PCA fix this.

Non-linear manifolds

A 2D swiss roll embedded in $\mathbb{R}^3$ has variance in every direction. PCA flattens it; you want a manifold method (t-SNE, UMAP, kernel PCA, autoencoders).

Scale-sensitive data

Features measured in different units dominate based on their numerical scale, not their information content. Always normalise (z-score) before PCA on tabular data.

PCA is still — by a long margin — the most-used dimensionality reduction tool in ML. It is well-understood, fast, and gives an explicit basis to interpret. Use it as a sanity check before trying anything fancier.

10

Cheat Sheet

Object	Formula
Sample covariance	$S = \tfrac1N \sum_n (\mathbf x_n-\boldsymbol\mu)(\mathbf x_n-\boldsymbol\mu)^\top$
PCA directions	$\mathbf b_i$ = top eigenvectors of $S$
Variance captured	$\sum_{i=1}^M \lambda_i / \sum_{i=1}^D \lambda_i$
Reconstruction error	$\sum_{i=M+1}^D \lambda_i$
Code	$\mathbf z_n = B^\top(\mathbf x_n - \boldsymbol\mu)$
Whitening	$\mathbf w_n = \Lambda^{-1/2}B^\top(\mathbf x_n - \boldsymbol\mu)$
PCA via SVD	$\tilde X = U\Sigma V^\top$, principal axes $=$ columns of $V$
Dual / kernel	$\tilde X\tilde X^\top$ — size $N\times N$ instead of $D\times D$
Probabilistic PCA	$\mathbf x \sim \mathcal N(\boldsymbol\mu, WW^\top + \sigma^2 I)$