MML 08 — When Models Meet Data

00

Topics We'll Cover

Data, model, predictor — the contract
Empirical risk minimisation
The bias-variance decomposition
Interactive: polynomial fitting
Regularisation as a prior
Cross-validation
MLE vs MAP
Probabilistic modelling & the gen/disc split
Directed graphical models
Model selection
Where this lands in Part II
Cheat sheet

01

Data, Model, Predictor — the Contract

The setup, used unchanged by every Part II chapter:

Data: an i.i.d. sample $\{(\mathbf{x}_n, y_n)\}_{n=1}^N$ from an unknown joint $p(\mathbf{x}, y)$.
Hypothesis class: a set of candidate predictors $\mathcal{H} = \{f_{\boldsymbol\theta} : \boldsymbol\theta\in\Theta\}$.
Loss: $\ell:\mathcal Y\times\mathcal Y\to\mathbb{R}_{\ge 0}$ scoring a prediction against ground truth.
Goal: find $\boldsymbol\theta$ minimising the expected risk

$$R(\boldsymbol\theta) = \mathbb{E}_{(\mathbf{x},y)\sim p}[\ell(f_{\boldsymbol\theta}(\mathbf{x}), y)].$$

We can't compute the expectation — we don't know $p$ — so we replace it with an average. That gives the next slide.

02

Empirical Risk Minimisation

$$\hat R_N(\boldsymbol\theta) = \frac{1}{N}\sum_{n=1}^N \ell(f_{\boldsymbol\theta}(\mathbf{x}_n), y_n),\qquad \hat{\boldsymbol\theta} = \arg\min_{\boldsymbol\theta\in\Theta} \hat R_N(\boldsymbol\theta).$$

Why it works: by the law of large numbers, $\hat R_N(\boldsymbol\theta) \to R(\boldsymbol\theta)$ pointwise as $N\to\infty$. With a bit more machinery (uniform convergence over $\Theta$), $\hat{\boldsymbol\theta}\to\boldsymbol\theta^\star$.

Three canonical losses

Problem	Loss	Predictor at min
Regression	squared $\ell(\hat y, y) = (\hat y - y)^2$	$\mathbb{E}[y\mid \mathbf{x}]$ (conditional mean)
Robust regression	absolute $\ell = \|\hat y - y\|$	conditional median
Classification	0-1 (intractable) / cross-entropy (proxy)	argmax of class posterior

When ERM fails

If the hypothesis class is too rich, $\hat R_N$ can be made arbitrarily small (e.g. memorising the training set) while the true $R$ stays large. The bias-variance picture on the next slide and the regularisation picture two slides later are both responses to this failure.

03

The Bias-Variance Decomposition

For squared loss, the expected test error at $\mathbf{x}$ decomposes into three additive pieces:

$$\underbrace{\mathbb{E}\bigl[(y - f_D(\mathbf{x}))^2\bigr]}_{\text{expected error}} = \underbrace{\bigl(\mathbb{E}[f_D(\mathbf{x})] - f^\star(\mathbf{x})\bigr)^2}_{\text{bias}^2} + \underbrace{\mathbb{V}[f_D(\mathbf{x})]}_{\text{variance}} + \underbrace{\sigma^2}_{\text{noise}}.$$

$D$ is the random training set; $f_D$ is the predictor we'd fit to it; $f^\star = \mathbb{E}[y\mid\mathbf{x}]$ is the Bayes optimum.

Reading the decomposition

Bias is how far our average predictor is from the truth — high when the model class is too rigid (underfitting).
Variance is how much $f_D$ wobbles when we redraw the data — high when the class is too flexible (overfitting).
Noise is irreducible.

Adding capacity drops bias and increases variance. The total error has a U shape. Regularisation, cross-validation and ensembling are all attempts to land at the bottom of that U.

04

Interactive: Polynomial Fitting

Click on the canvas to add data points; slide the polynomial degree and the ridge $\lambda$. The fit, training error and (independently sampled) test error update live.

degree

3

ridge λ (log)

1e-6

N (training)

—

train MSE

—

test MSE

—

The "truth" curve is $y = \sin(\pi x) + \tfrac12 x$ plus Gaussian noise. Push degree past 8 with no regularisation and you'll see test MSE blow up while train MSE keeps falling — the textbook overfit. Add a tiny ridge and the high-degree fit calms down.

05

Regularisation as a Prior

Add a penalty $\Omega(\boldsymbol\theta)$ to ERM:

$$\hat{\boldsymbol\theta} = \arg\min_{\boldsymbol\theta}\;\hat R_N(\boldsymbol\theta) + \lambda\,\Omega(\boldsymbol\theta).$$

Penalty	Effect	Equivalent prior
$\\|\boldsymbol\theta\\|_2^2$ — ridge / L2	shrinks toward 0; closed-form	Gaussian prior $\mathcal N(\mathbf 0, \lambda^{-1}I)$
$\\|\boldsymbol\theta\\|_1$ — lasso / L1	encourages sparsity	Laplace prior
$\\|D\boldsymbol\theta\\|_2^2$ — smoothness	penalises differences	GP prior (in function space)

Reading the penalty as a log-prior:

$$\arg\min \;\hat R_N + \lambda\Omega \;=\; \arg\max\;\sum_n \log p(y_n\mid f_{\boldsymbol\theta}(\mathbf{x}_n)) + \log p(\boldsymbol\theta).$$

So regularised ERM = MAP estimation with the appropriate prior. The Bayesian framework gives the same answer at the mode, and an uncertainty estimate too.

06

Cross-Validation

Hold-out

Split once into train/validation. Fast, noisy with small $N$.

$k$-fold

Split into $k$ folds; train on $k-1$ and test on the held-out fold; average. Standard for moderate $N$. $k=5$ or $10$ are the typical choices.

Leave-one-out

$k=N$. Almost unbiased; high variance; expensive. Cheap for linear models thanks to a closed-form (the hat matrix trick).

Cross-validation estimates the test error of a hyperparameter setting. For unbiased estimates of final test error, hold out a separate, untouched test set.

Common foot-gun

If you cross-validate over many hyperparameters, the best CV score is itself an overfit estimate. The "nested" CV ($k$-fold inside $k$-fold) is the safe pattern.

07

MLE vs MAP

Maximum likelihood

$$\hat{\boldsymbol\theta}_{\text{ML}} = \arg\max_{\boldsymbol\theta} \log p(\mathcal D\mid \boldsymbol\theta) = \arg\max \sum_n \log p(y_n\mid \mathbf{x}_n, \boldsymbol\theta).$$

For Gaussian noise it reduces to least squares. Consistent under model correctness; can overfit when $N$ is small.

Maximum a posteriori

$$\hat{\boldsymbol\theta}_{\text{MAP}} = \arg\max_{\boldsymbol\theta} \log p(\boldsymbol\theta\mid\mathcal D) = \arg\max \bigl[\log p(\mathcal D\mid\boldsymbol\theta) + \log p(\boldsymbol\theta)\bigr].$$

Adds a prior; equivalent to regularised MLE. Still a point estimate — doesn't capture uncertainty.

Fully Bayesian

Compute (or approximate) the full posterior $p(\boldsymbol\theta\mid\mathcal D)$. Decks 09 and 11 do this in closed form for the Gaussian / mixture cases.

08

Probabilistic Modelling: Generative vs Discriminative

Discriminative

Model $p(y\mid\mathbf{x},\boldsymbol\theta)$ directly. Cheaper and usually more accurate when you only care about predicting $y$.

Examples: linear / logistic regression, SVM, neural classifier.

Generative

Model the joint $p(\mathbf{x}, y\mid\boldsymbol\theta) = p(\mathbf{x}\mid y,\boldsymbol\theta)\,p(y\mid\boldsymbol\theta)$. Generates new data; gives a natural way to handle missing inputs.

Examples: naïve Bayes, LDA / QDA, GMM, generative models (VAEs, diffusion).

The book reminds us that the choice isn't binary: many discriminative models (logistic regression) are best-derived from generative ones (LDA), and conversely many generative models train discriminatively today.

09

Directed Graphical Models

A directed graphical model (Bayesian network) is a DAG whose nodes are random variables and whose edges encode conditional dependence. The joint factorises:

$$p(\mathbf{x}_1,\dots,\mathbf{x}_n) = \prod_{i=1}^n p(\mathbf{x}_i \mid \mathrm{Pa}(\mathbf{x}_i)).$$

Two readings:

Modelling: writing down the DAG is writing down independence assumptions.
Inference: efficient algorithms (variable elimination, belief propagation, message passing) exploit the factorisation.

Examples used in this book

Linear regression: $\boldsymbol\theta \to y_n \leftarrow \mathbf{x}_n$, plus shared $\sigma^2$.
GMM: $\pi \to z_n \to \mathbf{x}_n \leftarrow \boldsymbol\mu_k, \Sigma_k$.
Probabilistic PCA: $\mathbf{z}_n \to \mathbf{x}_n \leftarrow W, \boldsymbol\mu, \sigma^2$.

10

Model Selection

Once you have a family parameterised by a hyperparameter (regularisation strength, number of components, polynomial degree…), you have to pick one. Three strategies:

Strategy	Idea	Strengths
Cross-validation	Estimate test error by held-out data; pick min	Model-agnostic; just an extra fit
Information criteria (AIC, BIC)	Penalise log-likelihood by parameter count	No extra fits; needs likelihood
Marginal likelihood / "evidence"	$p(\mathcal D) = \int p(\mathcal D\mid\boldsymbol\theta)p(\boldsymbol\theta)d\boldsymbol\theta$	Fully Bayesian; automatic Occam's razor

The evidence integral is the same $p(\mathcal D)$ that we discarded as a normaliser in Bayes' rule. When you can compute it (Gaussian linear regression in deck 09), it picks the model size that balances data fit and prior plausibility — without needing held-out data.

11

Where This Lands in Part II

Concept	Used by	Where
ERM with MSE	Linear regression (ch. 9)	Becomes the normal equation
Ridge = MAP	Linear regression, kernel methods	Penalty $\lambda \\|\boldsymbol\theta\\|^2$
Generative likelihood	GMM (ch. 11)	EM maximises the data likelihood
Bias-variance & CV	All	Choosing model size / kernel / $\lambda$
Posterior	Bayesian linear regression (ch. 9)	Closed-form Gaussian posterior
Graphical model	GMM, PCA	Latent variables, plate notation

12

Cheat Sheet

Quantity	Formula
Empirical risk	$\hat R_N(\boldsymbol\theta) = \tfrac1N\sum_n \ell(f_{\boldsymbol\theta}(\mathbf x_n), y_n)$
Squared loss min	at $f^\star(\mathbf x) = \mathbb E[y\mid\mathbf x]$
Bias-variance	err $=$ bias$^2 +$ variance $+$ noise
MLE	$\arg\max \sum\log p(y_n\mid\mathbf x_n,\boldsymbol\theta)$
MAP	$\arg\max [\log p(\mathcal D\mid\boldsymbol\theta) + \log p(\boldsymbol\theta)]$
Ridge = Gaussian prior	$\lambda\\|\boldsymbol\theta\\|^2 \leftrightarrow \mathcal N(\mathbf 0, \lambda^{-1}I)$
Lasso = Laplace prior	$\lambda\\|\boldsymbol\theta\\|_1$
DAG factorisation	$p(\mathbf x) = \prod_i p(\mathbf x_i\mid \mathrm{Pa}(\mathbf x_i))$
Evidence	$p(\mathcal D) = \int p(\mathcal D\mid\boldsymbol\theta)p(\boldsymbol\theta)d\boldsymbol\theta$