MML 09 — Linear Regression

00

Topics We'll Cover

The model and its assumptions
Maximum-likelihood = least squares
Least squares as orthogonal projection
Ridge regression = MAP under a Gaussian prior
Feature maps
Bayesian linear regression: prior & posterior
Posterior predictive distribution
Interactive: posterior shrink as data arrives
Marginal likelihood & model selection
Cheat sheet

01

The Model and Its Assumptions

$$y_n = \boldsymbol\theta^\top \boldsymbol\phi(\mathbf{x}_n) + \varepsilon_n, \qquad \varepsilon_n \overset{\text{iid}}{\sim} \mathcal{N}(0, \sigma^2).$$

$\boldsymbol\phi:\mathbb{R}^D\to\mathbb{R}^M$ is a fixed feature map;
$\boldsymbol\theta\in\mathbb{R}^M$ are the parameters we'll learn;
$\varepsilon_n$ is independent Gaussian noise.

"Linear" refers to linearity in $\boldsymbol\theta$, not in $\mathbf{x}$. Quadratic, sinusoidal and RBF regression all sit inside this model with the right $\boldsymbol\phi$.

Vector form

Stack the data: $\mathbf{y} = (y_1,\dots,y_N)^\top$, $\Phi = [\boldsymbol\phi(\mathbf{x}_1)^\top; \dots; \boldsymbol\phi(\mathbf{x}_N)^\top]\in\mathbb{R}^{N\times M}$. Then

$$\mathbf{y} = \Phi\boldsymbol\theta + \boldsymbol\varepsilon, \qquad \boldsymbol\varepsilon\sim\mathcal{N}(\mathbf{0}, \sigma^2 I).$$

$\Phi$ is the design matrix. The entire chapter is a study of this single equation.

02

Maximum-Likelihood = Least Squares

Under the Gaussian noise model the log-likelihood is

$$\log p(\mathbf{y}\mid\boldsymbol\theta) = -\frac{1}{2\sigma^2}\|\mathbf{y}-\Phi\boldsymbol\theta\|^2 + \text{const.}$$

So maximising likelihood is exactly minimising the sum of squared residuals. Setting the gradient (deck 05) to zero:

$$\nabla_{\boldsymbol\theta}\|\mathbf{y}-\Phi\boldsymbol\theta\|^2 = -2\Phi^\top(\mathbf{y}-\Phi\boldsymbol\theta) = 0 \;\Longrightarrow\; \Phi^\top\Phi\,\boldsymbol\theta = \Phi^\top\mathbf{y}.$$

This is the normal equation. If $\Phi^\top\Phi$ is invertible (columns of $\Phi$ independent):

$$\hat{\boldsymbol\theta}_{\text{ML}} = (\Phi^\top\Phi)^{-1}\Phi^\top\mathbf{y}.$$

When $\Phi$ is rank-deficient

The solution is no longer unique; the set of minimisers is an affine subspace. The minimum-norm solution is $\hat{\boldsymbol\theta} = \Phi^+ \mathbf{y}$ where $\Phi^+ = V\Sigma^+ U^\top$ is the Moore-Penrose pseudoinverse (deck 04). Equivalently it's the limit of ridge regression as $\lambda\to 0^+$.

03

Least Squares as Orthogonal Projection

The fitted values are

$$\hat{\mathbf{y}} = \Phi\hat{\boldsymbol\theta}_{\text{ML}} = \Phi(\Phi^\top\Phi)^{-1}\Phi^\top \mathbf{y}.$$

The matrix in the middle is exactly the projection onto $\mathrm{Col}(\Phi)$ from deck 03. So:

Geometric statement

$\hat{\mathbf{y}}$ is the closest point in the column space $\mathrm{Col}(\Phi)$ to the observation vector $\mathbf{y}$. The residual $\mathbf{y}-\hat{\mathbf{y}}$ is perpendicular to $\mathrm{Col}(\Phi)$ — that's why $\Phi^\top(\mathbf{y}-\Phi\hat{\boldsymbol\theta}_{\text{ML}}) = 0$.

Once you see this, every property of linear regression follows by reading the deck-03 picture:

$\hat{\mathbf{y}}$ is unique even when $\hat{\boldsymbol\theta}$ isn't;
The "hat matrix" $H = \Phi(\Phi^\top\Phi)^{-1}\Phi^\top$ is idempotent ($H^2 = H$) and symmetric;
$\mathrm{tr}(H) = \mathrm{rank}(\Phi)$ — the number of "effective degrees of freedom".

04

Ridge Regression = MAP under a Gaussian Prior

Place a prior $\boldsymbol\theta \sim \mathcal{N}(\mathbf{0}, \tau^2 I)$. The MAP estimator (deck 08) is

$$\hat{\boldsymbol\theta}_{\text{MAP}} = \arg\min_{\boldsymbol\theta}\;\frac{1}{2\sigma^2}\|\mathbf{y}-\Phi\boldsymbol\theta\|^2 + \frac{1}{2\tau^2}\|\boldsymbol\theta\|^2.$$

Setting the gradient to zero:

$$\hat{\boldsymbol\theta}_{\text{MAP}} = \Bigl(\Phi^\top\Phi + \frac{\sigma^2}{\tau^2} I\Bigr)^{-1}\Phi^\top\mathbf{y}.$$

This is ridge regression with $\lambda = \sigma^2/\tau^2$. Three observations:

The added $\lambda I$ guarantees invertibility — even when $\Phi^\top\Phi$ is singular.
The Gaussian prior shrinks $\boldsymbol\theta$ toward $\mathbf{0}$; the strength $\lambda$ is the ratio of likelihood vs prior precision.
As $N\to\infty$ the data overwhelms the prior and $\hat{\boldsymbol\theta}_{\text{MAP}}\to\hat{\boldsymbol\theta}_{\text{ML}}$.

05

Feature Maps

The same machinery works for any $\boldsymbol\phi$. Three useful choices:

Feature map	Form	Comment
Polynomial	$\boldsymbol\phi(x) = (1, x, x^2, \dots, x^M)$	Cheap to evaluate; basis is numerically ill-conditioned for large $M$. Orthogonal polynomials fix this.
Radial basis	$\phi_m(x) = e^{-\\|x-\mu_m\\|^2 / 2s^2}$	Place RBF centres on the data; localised features.
Fourier	$\phi_m(x) \in \{\sin(mx), \cos(mx)\}$	Orthonormal basis on $[0, 2\pi]$ — well-conditioned.

Kernel ridge regression

Ridge regression with feature map $\boldsymbol\phi$ has the dual form (Woodbury identity):

$$\hat{\mathbf{y}}_\star = K(\mathbf{x}_\star, X)\,(K(X,X) + \lambda I)^{-1}\mathbf{y},$$

where $K(\mathbf{x}, \mathbf{x}') = \boldsymbol\phi(\mathbf{x})^\top\boldsymbol\phi(\mathbf{x}')$ is the kernel. This is the road into kernel methods and Gaussian process regression — the SVM (deck 12) is the classification analogue.

06

Bayesian Linear Regression: Prior & Posterior

Prior $\boldsymbol\theta\sim\mathcal N(\mathbf m_0, S_0)$; likelihood $\mathbf y\mid\boldsymbol\theta\sim\mathcal N(\Phi\boldsymbol\theta, \sigma^2 I)$. Both Gaussian, in $\boldsymbol\theta$ — the posterior is Gaussian (deck 06 conjugacy).

$$\boldsymbol\theta\mid\mathbf{y}\sim\mathcal N(\mathbf m_N, S_N), \qquad S_N^{-1} = S_0^{-1} + \frac{1}{\sigma^2}\Phi^\top\Phi,$$

$$\mathbf m_N = S_N\Bigl(S_0^{-1}\mathbf m_0 + \frac{1}{\sigma^2}\Phi^\top \mathbf{y}\Bigr).$$

Two specialisations:

Zero-mean isotropic prior $\mathbf m_0=\mathbf 0$, $S_0 = \tau^2 I$: $\mathbf m_N = (\Phi^\top\Phi + (\sigma^2/\tau^2)I)^{-1}\Phi^\top\mathbf{y}$. MAP = ridge above.
Flat prior ($S_0\to\infty I$): $\mathbf m_N$ → $\hat{\boldsymbol\theta}_{\text{ML}}$.

Why this is one of the most useful formulae in ML

$S_N^{-1}$ is the sum of two precision matrices: the prior precision and the data precision $\Phi^\top\Phi / \sigma^2$. Precisions add — that's why uncertainty (variance) shrinks as data arrives. Every Gaussian Bayesian update has this shape.

07

Posterior Predictive Distribution

For a new input $\mathbf{x}_\star$, integrating out $\boldsymbol\theta$ under the posterior yields a Gaussian predictive distribution:

$$p(y_\star\mid\mathbf{x}_\star, \mathbf{y}) = \mathcal N\bigl(\boldsymbol\phi(\mathbf{x}_\star)^\top\mathbf m_N,\; \sigma^2 + \boldsymbol\phi(\mathbf{x}_\star)^\top S_N \boldsymbol\phi(\mathbf{x}_\star)\bigr).$$

Mean and variance split cleanly:

$\boldsymbol\phi(\mathbf{x}_\star)^\top\mathbf m_N$ — the posterior mean prediction, same as plug-in MAP.
$\sigma^2$ — aleatoric uncertainty: irreducible noise.
$\boldsymbol\phi(\mathbf{x}_\star)^\top S_N \boldsymbol\phi(\mathbf{x}_\star)$ — epistemic uncertainty: how much we still don't know about $\boldsymbol\theta$. This term grows at inputs far from any training point and shrinks toward zero where data is plentiful.

08

Interactive: Posterior Shrink as Data Arrives

Click on the canvas to add a $(x, y)$ point. The black curve is the posterior mean; the purple band is the 95% predictive interval. Each point pulls the mean toward itself and narrows the band locally. Use RBF features by default; switch to polynomial in the dropdown.

noise σ

0.20

prior τ

1.00

N

0

mean |m_N|

—

posterior width

—

With no data, the prior dominates: the band is uniformly wide. With many points, the band collapses toward $2\sigma$ — the irreducible noise from slide 07.

09

Marginal Likelihood & Model Selection

The Bayesian evidence (deck 08, slide 10) for linear regression is in closed form:

$$p(\mathbf{y}\mid \Phi, \sigma^2, \tau^2) = \mathcal N\bigl(\mathbf{0},\, \sigma^2 I + \tau^2 \Phi\Phi^\top\bigr).$$

This is a Gaussian on $\mathbf{y}$ with covariance $K + \sigma^2 I$ where $K = \tau^2 \Phi\Phi^\top$ — the kernel matrix again. Maximising the log-evidence over $\sigma^2, \tau^2$ does model selection automatically — this is the type-II maximum likelihood or empirical Bayes, the engine behind Gaussian process hyperparameter learning.

Automatic Occam's razor

A flexible feature map ($\tau^2$ big) can fit the data well but spreads the prior mass thinly: the evidence integral is small. A rigid feature map ($\tau^2$ small) concentrates the mass but may fit poorly. The maximum is in the middle — without any held-out data.

10

Cheat Sheet

Estimator	Formula
$\hat{\boldsymbol\theta}_{\text{ML}}$	$(\Phi^\top\Phi)^{-1}\Phi^\top\mathbf{y}$ (or $\Phi^+\mathbf y$ for rank-deficient $\Phi$)
$\hat{\boldsymbol\theta}_{\text{MAP / ridge}}$	$(\Phi^\top\Phi + \lambda I)^{-1}\Phi^\top\mathbf y$, $\lambda = \sigma^2/\tau^2$
Posterior covariance	$S_N^{-1} = S_0^{-1} + \sigma^{-2}\Phi^\top\Phi$
Posterior mean	$\mathbf m_N = S_N(S_0^{-1}\mathbf m_0 + \sigma^{-2}\Phi^\top\mathbf y)$
Predictive mean	$\boldsymbol\phi(\mathbf x_\star)^\top \mathbf m_N$
Predictive variance	$\sigma^2 + \boldsymbol\phi(\mathbf x_\star)^\top S_N \boldsymbol\phi(\mathbf x_\star)$
Hat matrix	$H = \Phi(\Phi^\top\Phi)^{-1}\Phi^\top$, idempotent, symmetric
DoF	$\mathrm{tr}(H) = \mathrm{rank}(\Phi)$
Dual / kernel form	$\hat{\mathbf y}_\star = K(\mathbf x_\star, X)(K(X,X)+\lambda I)^{-1}\mathbf y$