Empirical risk, the bias-variance trade-off, regularisation, cross-validation, and the MLE / MAP split — the language of generalisation.
The setup, used unchanged by every Part II chapter:
$$R(\boldsymbol\theta) = \mathbb{E}_{(\mathbf{x},y)\sim p}[\ell(f_{\boldsymbol\theta}(\mathbf{x}), y)].$$
We can't compute the expectation — we don't know $p$ — so we replace it with an average. That gives the next slide.
$$\hat R_N(\boldsymbol\theta) = \frac{1}{N}\sum_{n=1}^N \ell(f_{\boldsymbol\theta}(\mathbf{x}_n), y_n),\qquad \hat{\boldsymbol\theta} = \arg\min_{\boldsymbol\theta\in\Theta} \hat R_N(\boldsymbol\theta).$$
Why it works: by the law of large numbers, $\hat R_N(\boldsymbol\theta) \to R(\boldsymbol\theta)$ pointwise as $N\to\infty$. With a bit more machinery (uniform convergence over $\Theta$), $\hat{\boldsymbol\theta}\to\boldsymbol\theta^\star$.
| Problem | Loss | Predictor at min |
|---|---|---|
| Regression | squared $\ell(\hat y, y) = (\hat y - y)^2$ | $\mathbb{E}[y\mid \mathbf{x}]$ (conditional mean) |
| Robust regression | absolute $\ell = |\hat y - y|$ | conditional median |
| Classification | 0-1 (intractable) / cross-entropy (proxy) | argmax of class posterior |
If the hypothesis class is too rich, $\hat R_N$ can be made arbitrarily small (e.g. memorising the training set) while the true $R$ stays large. The bias-variance picture on the next slide and the regularisation picture two slides later are both responses to this failure.
For squared loss, the expected test error at $\mathbf{x}$ decomposes into three additive pieces:
$$\underbrace{\mathbb{E}\bigl[(y - f_D(\mathbf{x}))^2\bigr]}_{\text{expected error}} = \underbrace{\bigl(\mathbb{E}[f_D(\mathbf{x})] - f^\star(\mathbf{x})\bigr)^2}_{\text{bias}^2} + \underbrace{\mathbb{V}[f_D(\mathbf{x})]}_{\text{variance}} + \underbrace{\sigma^2}_{\text{noise}}.$$
$D$ is the random training set; $f_D$ is the predictor we'd fit to it; $f^\star = \mathbb{E}[y\mid\mathbf{x}]$ is the Bayes optimum.
Adding capacity drops bias and increases variance. The total error has a U shape. Regularisation, cross-validation and ensembling are all attempts to land at the bottom of that U.
Click on the canvas to add data points; slide the polynomial degree and the ridge $\lambda$. The fit, training error and (independently sampled) test error update live.
The "truth" curve is $y = \sin(\pi x) + \tfrac12 x$ plus Gaussian noise. Push degree past 8 with no regularisation and you'll see test MSE blow up while train MSE keeps falling — the textbook overfit. Add a tiny ridge and the high-degree fit calms down.
Add a penalty $\Omega(\boldsymbol\theta)$ to ERM:
$$\hat{\boldsymbol\theta} = \arg\min_{\boldsymbol\theta}\;\hat R_N(\boldsymbol\theta) + \lambda\,\Omega(\boldsymbol\theta).$$
| Penalty | Effect | Equivalent prior |
|---|---|---|
| $\|\boldsymbol\theta\|_2^2$ — ridge / L2 | shrinks toward 0; closed-form | Gaussian prior $\mathcal N(\mathbf 0, \lambda^{-1}I)$ |
| $\|\boldsymbol\theta\|_1$ — lasso / L1 | encourages sparsity | Laplace prior |
| $\|D\boldsymbol\theta\|_2^2$ — smoothness | penalises differences | GP prior (in function space) |
Reading the penalty as a log-prior:
$$\arg\min \;\hat R_N + \lambda\Omega \;=\; \arg\max\;\sum_n \log p(y_n\mid f_{\boldsymbol\theta}(\mathbf{x}_n)) + \log p(\boldsymbol\theta).$$
So regularised ERM = MAP estimation with the appropriate prior. The Bayesian framework gives the same answer at the mode, and an uncertainty estimate too.
Split once into train/validation. Fast, noisy with small $N$.
Split into $k$ folds; train on $k-1$ and test on the held-out fold; average. Standard for moderate $N$. $k=5$ or $10$ are the typical choices.
$k=N$. Almost unbiased; high variance; expensive. Cheap for linear models thanks to a closed-form (the hat matrix trick).
Cross-validation estimates the test error of a hyperparameter setting. For unbiased estimates of final test error, hold out a separate, untouched test set.
If you cross-validate over many hyperparameters, the best CV score is itself an overfit estimate. The "nested" CV ($k$-fold inside $k$-fold) is the safe pattern.
$$\hat{\boldsymbol\theta}_{\text{ML}} = \arg\max_{\boldsymbol\theta} \log p(\mathcal D\mid \boldsymbol\theta) = \arg\max \sum_n \log p(y_n\mid \mathbf{x}_n, \boldsymbol\theta).$$
For Gaussian noise it reduces to least squares. Consistent under model correctness; can overfit when $N$ is small.
$$\hat{\boldsymbol\theta}_{\text{MAP}} = \arg\max_{\boldsymbol\theta} \log p(\boldsymbol\theta\mid\mathcal D) = \arg\max \bigl[\log p(\mathcal D\mid\boldsymbol\theta) + \log p(\boldsymbol\theta)\bigr].$$
Adds a prior; equivalent to regularised MLE. Still a point estimate — doesn't capture uncertainty.
Compute (or approximate) the full posterior $p(\boldsymbol\theta\mid\mathcal D)$. Decks 09 and 11 do this in closed form for the Gaussian / mixture cases.
Model $p(y\mid\mathbf{x},\boldsymbol\theta)$ directly. Cheaper and usually more accurate when you only care about predicting $y$.
Examples: linear / logistic regression, SVM, neural classifier.
Model the joint $p(\mathbf{x}, y\mid\boldsymbol\theta) = p(\mathbf{x}\mid y,\boldsymbol\theta)\,p(y\mid\boldsymbol\theta)$. Generates new data; gives a natural way to handle missing inputs.
Examples: naïve Bayes, LDA / QDA, GMM, generative models (VAEs, diffusion).
The book reminds us that the choice isn't binary: many discriminative models (logistic regression) are best-derived from generative ones (LDA), and conversely many generative models train discriminatively today.
A directed graphical model (Bayesian network) is a DAG whose nodes are random variables and whose edges encode conditional dependence. The joint factorises:
$$p(\mathbf{x}_1,\dots,\mathbf{x}_n) = \prod_{i=1}^n p(\mathbf{x}_i \mid \mathrm{Pa}(\mathbf{x}_i)).$$
Two readings:
Once you have a family parameterised by a hyperparameter (regularisation strength, number of components, polynomial degree…), you have to pick one. Three strategies:
| Strategy | Idea | Strengths |
|---|---|---|
| Cross-validation | Estimate test error by held-out data; pick min | Model-agnostic; just an extra fit |
| Information criteria (AIC, BIC) | Penalise log-likelihood by parameter count | No extra fits; needs likelihood |
| Marginal likelihood / "evidence" | $p(\mathcal D) = \int p(\mathcal D\mid\boldsymbol\theta)p(\boldsymbol\theta)d\boldsymbol\theta$ | Fully Bayesian; automatic Occam's razor |
The evidence integral is the same $p(\mathcal D)$ that we discarded as a normaliser in Bayes' rule. When you can compute it (Gaussian linear regression in deck 09), it picks the model size that balances data fit and prior plausibility — without needing held-out data.
| Concept | Used by | Where |
|---|---|---|
| ERM with MSE | Linear regression (ch. 9) | Becomes the normal equation |
| Ridge = MAP | Linear regression, kernel methods | Penalty $\lambda \|\boldsymbol\theta\|^2$ |
| Generative likelihood | GMM (ch. 11) | EM maximises the data likelihood |
| Bias-variance & CV | All | Choosing model size / kernel / $\lambda$ |
| Posterior | Bayesian linear regression (ch. 9) | Closed-form Gaussian posterior |
| Graphical model | GMM, PCA | Latent variables, plate notation |
| Quantity | Formula |
|---|---|
| Empirical risk | $\hat R_N(\boldsymbol\theta) = \tfrac1N\sum_n \ell(f_{\boldsymbol\theta}(\mathbf x_n), y_n)$ |
| Squared loss min | at $f^\star(\mathbf x) = \mathbb E[y\mid\mathbf x]$ |
| Bias-variance | err $=$ bias$^2 +$ variance $+$ noise |
| MLE | $\arg\max \sum\log p(y_n\mid\mathbf x_n,\boldsymbol\theta)$ |
| MAP | $\arg\max [\log p(\mathcal D\mid\boldsymbol\theta) + \log p(\boldsymbol\theta)]$ |
| Ridge = Gaussian prior | $\lambda\|\boldsymbol\theta\|^2 \leftrightarrow \mathcal N(\mathbf 0, \lambda^{-1}I)$ |
| Lasso = Laplace prior | $\lambda\|\boldsymbol\theta\|_1$ |
| DAG factorisation | $p(\mathbf x) = \prod_i p(\mathbf x_i\mid \mathrm{Pa}(\mathbf x_i))$ |
| Evidence | $p(\mathcal D) = \int p(\mathcal D\mid\boldsymbol\theta)p(\boldsymbol\theta)d\boldsymbol\theta$ |