Mathematics for Machine Learning — Deck 08

When Models Meet Data

Empirical risk, the bias-variance trade-off, regularisation, cross-validation, and the MLE / MAP split — the language of generalisation.

ERMbias-variance regularisationcross-val MLEMAPgraphical models
00

Topics We'll Cover

01

Data, Model, Predictor — the Contract

The setup, used unchanged by every Part II chapter:

$$R(\boldsymbol\theta) = \mathbb{E}_{(\mathbf{x},y)\sim p}[\ell(f_{\boldsymbol\theta}(\mathbf{x}), y)].$$

We can't compute the expectation — we don't know $p$ — so we replace it with an average. That gives the next slide.

02

Empirical Risk Minimisation

$$\hat R_N(\boldsymbol\theta) = \frac{1}{N}\sum_{n=1}^N \ell(f_{\boldsymbol\theta}(\mathbf{x}_n), y_n),\qquad \hat{\boldsymbol\theta} = \arg\min_{\boldsymbol\theta\in\Theta} \hat R_N(\boldsymbol\theta).$$

Why it works: by the law of large numbers, $\hat R_N(\boldsymbol\theta) \to R(\boldsymbol\theta)$ pointwise as $N\to\infty$. With a bit more machinery (uniform convergence over $\Theta$), $\hat{\boldsymbol\theta}\to\boldsymbol\theta^\star$.

Three canonical losses

ProblemLossPredictor at min
Regressionsquared $\ell(\hat y, y) = (\hat y - y)^2$$\mathbb{E}[y\mid \mathbf{x}]$ (conditional mean)
Robust regressionabsolute $\ell = |\hat y - y|$conditional median
Classification0-1 (intractable) / cross-entropy (proxy)argmax of class posterior
When ERM fails

If the hypothesis class is too rich, $\hat R_N$ can be made arbitrarily small (e.g. memorising the training set) while the true $R$ stays large. The bias-variance picture on the next slide and the regularisation picture two slides later are both responses to this failure.

03

The Bias-Variance Decomposition

For squared loss, the expected test error at $\mathbf{x}$ decomposes into three additive pieces:

$$\underbrace{\mathbb{E}\bigl[(y - f_D(\mathbf{x}))^2\bigr]}_{\text{expected error}} = \underbrace{\bigl(\mathbb{E}[f_D(\mathbf{x})] - f^\star(\mathbf{x})\bigr)^2}_{\text{bias}^2} + \underbrace{\mathbb{V}[f_D(\mathbf{x})]}_{\text{variance}} + \underbrace{\sigma^2}_{\text{noise}}.$$

$D$ is the random training set; $f_D$ is the predictor we'd fit to it; $f^\star = \mathbb{E}[y\mid\mathbf{x}]$ is the Bayes optimum.

Reading the decomposition

Adding capacity drops bias and increases variance. The total error has a U shape. Regularisation, cross-validation and ensembling are all attempts to land at the bottom of that U.

04

Interactive: Polynomial Fitting

Click on the canvas to add data points; slide the polynomial degree and the ridge $\lambda$. The fit, training error and (independently sampled) test error update live.

3
1e-6
N (training)
train MSE
test MSE

The "truth" curve is $y = \sin(\pi x) + \tfrac12 x$ plus Gaussian noise. Push degree past 8 with no regularisation and you'll see test MSE blow up while train MSE keeps falling — the textbook overfit. Add a tiny ridge and the high-degree fit calms down.

05

Regularisation as a Prior

Add a penalty $\Omega(\boldsymbol\theta)$ to ERM:

$$\hat{\boldsymbol\theta} = \arg\min_{\boldsymbol\theta}\;\hat R_N(\boldsymbol\theta) + \lambda\,\Omega(\boldsymbol\theta).$$

PenaltyEffectEquivalent prior
$\|\boldsymbol\theta\|_2^2$ — ridge / L2shrinks toward 0; closed-formGaussian prior $\mathcal N(\mathbf 0, \lambda^{-1}I)$
$\|\boldsymbol\theta\|_1$ — lasso / L1encourages sparsityLaplace prior
$\|D\boldsymbol\theta\|_2^2$ — smoothnesspenalises differencesGP prior (in function space)

Reading the penalty as a log-prior:

$$\arg\min \;\hat R_N + \lambda\Omega \;=\; \arg\max\;\sum_n \log p(y_n\mid f_{\boldsymbol\theta}(\mathbf{x}_n)) + \log p(\boldsymbol\theta).$$

So regularised ERM = MAP estimation with the appropriate prior. The Bayesian framework gives the same answer at the mode, and an uncertainty estimate too.

06

Cross-Validation

Hold-out

Split once into train/validation. Fast, noisy with small $N$.

$k$-fold

Split into $k$ folds; train on $k-1$ and test on the held-out fold; average. Standard for moderate $N$. $k=5$ or $10$ are the typical choices.

Leave-one-out

$k=N$. Almost unbiased; high variance; expensive. Cheap for linear models thanks to a closed-form (the hat matrix trick).

Cross-validation estimates the test error of a hyperparameter setting. For unbiased estimates of final test error, hold out a separate, untouched test set.

Common foot-gun

If you cross-validate over many hyperparameters, the best CV score is itself an overfit estimate. The "nested" CV ($k$-fold inside $k$-fold) is the safe pattern.

07

MLE vs MAP

Maximum likelihood

$$\hat{\boldsymbol\theta}_{\text{ML}} = \arg\max_{\boldsymbol\theta} \log p(\mathcal D\mid \boldsymbol\theta) = \arg\max \sum_n \log p(y_n\mid \mathbf{x}_n, \boldsymbol\theta).$$

For Gaussian noise it reduces to least squares. Consistent under model correctness; can overfit when $N$ is small.

Maximum a posteriori

$$\hat{\boldsymbol\theta}_{\text{MAP}} = \arg\max_{\boldsymbol\theta} \log p(\boldsymbol\theta\mid\mathcal D) = \arg\max \bigl[\log p(\mathcal D\mid\boldsymbol\theta) + \log p(\boldsymbol\theta)\bigr].$$

Adds a prior; equivalent to regularised MLE. Still a point estimate — doesn't capture uncertainty.

Fully Bayesian

Compute (or approximate) the full posterior $p(\boldsymbol\theta\mid\mathcal D)$. Decks 09 and 11 do this in closed form for the Gaussian / mixture cases.

08

Probabilistic Modelling: Generative vs Discriminative

Discriminative

Model $p(y\mid\mathbf{x},\boldsymbol\theta)$ directly. Cheaper and usually more accurate when you only care about predicting $y$.

Examples: linear / logistic regression, SVM, neural classifier.

Generative

Model the joint $p(\mathbf{x}, y\mid\boldsymbol\theta) = p(\mathbf{x}\mid y,\boldsymbol\theta)\,p(y\mid\boldsymbol\theta)$. Generates new data; gives a natural way to handle missing inputs.

Examples: naïve Bayes, LDA / QDA, GMM, generative models (VAEs, diffusion).

The book reminds us that the choice isn't binary: many discriminative models (logistic regression) are best-derived from generative ones (LDA), and conversely many generative models train discriminatively today.

09

Directed Graphical Models

A directed graphical model (Bayesian network) is a DAG whose nodes are random variables and whose edges encode conditional dependence. The joint factorises:

$$p(\mathbf{x}_1,\dots,\mathbf{x}_n) = \prod_{i=1}^n p(\mathbf{x}_i \mid \mathrm{Pa}(\mathbf{x}_i)).$$

Two readings:

Examples used in this book

10

Model Selection

Once you have a family parameterised by a hyperparameter (regularisation strength, number of components, polynomial degree…), you have to pick one. Three strategies:

StrategyIdeaStrengths
Cross-validationEstimate test error by held-out data; pick minModel-agnostic; just an extra fit
Information criteria (AIC, BIC)Penalise log-likelihood by parameter countNo extra fits; needs likelihood
Marginal likelihood / "evidence"$p(\mathcal D) = \int p(\mathcal D\mid\boldsymbol\theta)p(\boldsymbol\theta)d\boldsymbol\theta$Fully Bayesian; automatic Occam's razor

The evidence integral is the same $p(\mathcal D)$ that we discarded as a normaliser in Bayes' rule. When you can compute it (Gaussian linear regression in deck 09), it picks the model size that balances data fit and prior plausibility — without needing held-out data.

11

Where This Lands in Part II

ConceptUsed byWhere
ERM with MSELinear regression (ch. 9)Becomes the normal equation
Ridge = MAPLinear regression, kernel methodsPenalty $\lambda \|\boldsymbol\theta\|^2$
Generative likelihoodGMM (ch. 11)EM maximises the data likelihood
Bias-variance & CVAllChoosing model size / kernel / $\lambda$
PosteriorBayesian linear regression (ch. 9)Closed-form Gaussian posterior
Graphical modelGMM, PCALatent variables, plate notation
12

Cheat Sheet

QuantityFormula
Empirical risk$\hat R_N(\boldsymbol\theta) = \tfrac1N\sum_n \ell(f_{\boldsymbol\theta}(\mathbf x_n), y_n)$
Squared loss minat $f^\star(\mathbf x) = \mathbb E[y\mid\mathbf x]$
Bias-varianceerr $=$ bias$^2 +$ variance $+$ noise
MLE$\arg\max \sum\log p(y_n\mid\mathbf x_n,\boldsymbol\theta)$
MAP$\arg\max [\log p(\mathcal D\mid\boldsymbol\theta) + \log p(\boldsymbol\theta)]$
Ridge = Gaussian prior$\lambda\|\boldsymbol\theta\|^2 \leftrightarrow \mathcal N(\mathbf 0, \lambda^{-1}I)$
Lasso = Laplace prior$\lambda\|\boldsymbol\theta\|_1$
DAG factorisation$p(\mathbf x) = \prod_i p(\mathbf x_i\mid \mathrm{Pa}(\mathbf x_i))$
Evidence$p(\mathcal D) = \int p(\mathcal D\mid\boldsymbol\theta)p(\boldsymbol\theta)d\boldsymbol\theta$