Two objectives, one matrix decomposition — PCA is the projection that loses the least and the direction of greatest variance, simultaneously.
We have data $X = [\mathbf{x}_1, \dots, \mathbf{x}_N]^\top \in\mathbb{R}^{N\times D}$. We want a projection to $M< D$ dimensions:
$$\mathbf{z}_n = B^\top(\mathbf{x}_n - \boldsymbol\mu),\qquad \tilde{\mathbf{x}}_n = \boldsymbol\mu + B\mathbf{z}_n,$$
where $B\in\mathbb{R}^{D\times M}$ has orthonormal columns and $\boldsymbol\mu$ is the (sample) mean. Three quantities are at play:
From deck 03 we know that for any fixed $B$ the choice $\mathbf{z}_n = B^\top(\mathbf{x}_n - \boldsymbol\mu)$ produces $\tilde{\mathbf{x}}_n$ as the orthogonal projection of $\mathbf{x}_n$ onto $\boldsymbol\mu + \mathrm{Col}(B)$. So the only design choice is $B$.
Centre the data: set $\boldsymbol\mu = \tfrac1N\sum_n\mathbf{x}_n$, replace $\mathbf{x}_n \to \mathbf{x}_n - \boldsymbol\mu$. The sample covariance is
$$S = \frac{1}{N}\sum_n \mathbf{x}_n\mathbf{x}_n^\top.$$
For a unit-norm direction $\mathbf{b}$, the variance of $\mathbf{b}^\top\mathbf{x}$ is $\mathbf{b}^\top S\mathbf{b}$. We want to find
$$\mathbf{b}_1 = \arg\max_{\|\mathbf{b}\|=1}\;\mathbf{b}^\top S\mathbf{b}.$$
Form the Lagrangian (deck 07):
$$\mathcal L(\mathbf{b}, \lambda) = \mathbf{b}^\top S\mathbf{b} - \lambda(\mathbf{b}^\top\mathbf{b}-1), \qquad \nabla_{\mathbf{b}}\mathcal L = 2S\mathbf{b} - 2\lambda\mathbf{b} = 0.$$
So $S\mathbf{b}_1 = \lambda_1\mathbf{b}_1$: $\mathbf{b}_1$ is an eigenvector of $S$ with eigenvalue equal to the variance it captures. To maximise, pick the eigenvector with the largest eigenvalue. The next direction $\mathbf{b}_2$ is the largest eigenvector among those orthogonal to $\mathbf{b}_1$, etc.
Same setup, but now minimise
$$J(B) = \frac{1}{N}\sum_n \|\mathbf{x}_n - BB^\top\mathbf{x}_n\|^2 = \frac{1}{N}\sum_n \mathbf{x}_n^\top(I - BB^\top)\mathbf{x}_n.$$
Use the trace identity $\mathbf{a}^\top A\mathbf{a} = \mathrm{tr}(\mathbf{a}\mathbf{a}^\top A)$ and cyclic trace:
$$J(B) = \mathrm{tr}\bigl((I - BB^\top)S\bigr) = \mathrm{tr}(S) - \mathrm{tr}(B^\top S B).$$
So minimising reconstruction error is the same as maximising $\mathrm{tr}(B^\top SB)$ — which, subject to $B^\top B = I$, is solved by taking the top $M$ eigenvectors of $S$ as the columns of $B$.
The variance view says: "find directions of greatest spread". The reconstruction view says: "find the subspace closest to the data". They give the same answer in the Gaussian / squared-loss setting — but the two questions split when the loss changes (robust PCA, sparse PCA, autoencoders).
Combining: the optimal $B$ is the matrix whose columns are the top $M$ orthonormal eigenvectors of $S$.
$$S = U\Lambda U^\top, \quad \Lambda = \mathrm{diag}(\lambda_1\ge\lambda_2\ge\dots).$$
$$B_M = U[:, 1{:}M], \qquad \text{reconstruction error} = \sum_{i=M+1}^D \lambda_i.$$
Total variance is $\sum_i\lambda_i = \mathrm{tr}(S)$; the variance kept by $B_M$ is $\sum_{i=1}^M \lambda_i$, the variance lost is $\sum_{i=M+1}^D \lambda_i$ — which is exactly the reconstruction error.
Once you have $B$ and $\Lambda$, the whitened data is $\mathbf{z}_n = \Lambda^{-1/2}B^\top\mathbf{x}_n$. It has identity sample covariance and is the input you want for any subsequent geometric algorithm.
Stack the centred data as rows: $\tilde X\in\mathbb{R}^{N\times D}$. Then $S = \tilde X^\top\tilde X / N$.
Compute the SVD $\tilde X = U\Sigma V^\top$. Then:
Computing the SVD of $\tilde X$ directly is more numerically stable than forming $\tilde X^\top\tilde X$ and eigendecomposing. Production libraries (NumPy, sklearn) implement PCA via the SVD.
The rank-$M$ matrix $U[:,1{:}M]\Sigma[1{:}M,1{:}M]V[:,1{:}M]^\top$ is the closest rank-$M$ approximation of $\tilde X$ (deck 04). PCA is therefore Eckart–Young applied to the centred data matrix.
Click on the canvas to add points; the principal-axis line and the second axis update live, with the projections drawn as ticks on the major axis. Numerical readouts give the variance kept on the major axis vs the residual.
For an isotropic blob, $\lambda_1 \approx \lambda_2$ and PCA is essentially uninformative — every direction is equally good. The more anisotropic the cloud, the better the rank-1 approximation.
When $D \gg N$ (think images: $D = 10^6$ pixels, $N = 1000$ examples), $S$ is a $D\times D$ matrix — intractable to form, let alone eigendecompose.
But the eigenvalues of $\tilde X^\top\tilde X$ and $\tilde X\tilde X^\top$ are the same (up to zeros). The matrix $\tilde X\tilde X^\top$ is $N\times N$, often tractable:
$$\tilde X\tilde X^\top \mathbf{u}_i = \mu_i \mathbf{u}_i, \qquad \mathbf{v}_i = \tilde X^\top\mathbf{u}_i / \|\tilde X^\top\mathbf{u}_i\|.$$
So you can recover the principal directions $\mathbf{v}_i$ in $\mathbb{R}^D$ by eigendecomposing the $N\times N$ Gram matrix and then projecting back through $\tilde X^\top$. This is the foundation of kernel PCA, where $\tilde X\tilde X^\top$ is replaced by the kernel matrix.
Treat PCA as a generative model:
$$\mathbf{z}_n \sim \mathcal N(\mathbf 0, I_M), \qquad \mathbf{x}_n \mid \mathbf{z}_n \sim \mathcal N(W\mathbf{z}_n + \boldsymbol\mu, \sigma^2 I_D).$$
Latent factors $\mathbf{z}_n$ are sampled, then a linear map plus isotropic noise produces $\mathbf{x}_n$. Marginalising out $\mathbf{z}_n$ (deck 06):
$$\mathbf{x}_n \sim \mathcal N(\boldsymbol\mu, WW^\top + \sigma^2 I).$$
The maximum-likelihood solution recovers exactly classical PCA (with a corrective $\sigma^2 I$ in the covariance). As $\sigma^2 \to 0$, probabilistic PCA → classical PCA.
It lets you sample new data, handle missing entries by EM (deck 11 generalises), evaluate a likelihood, and naturally extend to mixtures of PCAs — each cluster gets its own low-dim subspace.
Variance is a second-moment summary. For heavy-tailed or skewed distributions it can be a poor proxy for "interesting structure". ICA, robust PCA fix this.
A 2D swiss roll embedded in $\mathbb{R}^3$ has variance in every direction. PCA flattens it; you want a manifold method (t-SNE, UMAP, kernel PCA, autoencoders).
Features measured in different units dominate based on their numerical scale, not their information content. Always normalise (z-score) before PCA on tabular data.
PCA is still — by a long margin — the most-used dimensionality reduction tool in ML. It is well-understood, fast, and gives an explicit basis to interpret. Use it as a sanity check before trying anything fancier.
| Object | Formula |
|---|---|
| Sample covariance | $S = \tfrac1N \sum_n (\mathbf x_n-\boldsymbol\mu)(\mathbf x_n-\boldsymbol\mu)^\top$ |
| PCA directions | $\mathbf b_i$ = top eigenvectors of $S$ |
| Variance captured | $\sum_{i=1}^M \lambda_i / \sum_{i=1}^D \lambda_i$ |
| Reconstruction error | $\sum_{i=M+1}^D \lambda_i$ |
| Code | $\mathbf z_n = B^\top(\mathbf x_n - \boldsymbol\mu)$ |
| Whitening | $\mathbf w_n = \Lambda^{-1/2}B^\top(\mathbf x_n - \boldsymbol\mu)$ |
| PCA via SVD | $\tilde X = U\Sigma V^\top$, principal axes $=$ columns of $V$ |
| Dual / kernel | $\tilde X\tilde X^\top$ — size $N\times N$ instead of $D\times D$ |
| Probabilistic PCA | $\mathbf x \sim \mathcal N(\boldsymbol\mu, WW^\top + \sigma^2 I)$ |