Least squares as projection, ridge as a Gaussian prior, and the full Bayesian posterior — all from Part I in five steps.
$$y_n = \boldsymbol\theta^\top \boldsymbol\phi(\mathbf{x}_n) + \varepsilon_n, \qquad \varepsilon_n \overset{\text{iid}}{\sim} \mathcal{N}(0, \sigma^2).$$
"Linear" refers to linearity in $\boldsymbol\theta$, not in $\mathbf{x}$. Quadratic, sinusoidal and RBF regression all sit inside this model with the right $\boldsymbol\phi$.
Stack the data: $\mathbf{y} = (y_1,\dots,y_N)^\top$, $\Phi = [\boldsymbol\phi(\mathbf{x}_1)^\top; \dots; \boldsymbol\phi(\mathbf{x}_N)^\top]\in\mathbb{R}^{N\times M}$. Then
$$\mathbf{y} = \Phi\boldsymbol\theta + \boldsymbol\varepsilon, \qquad \boldsymbol\varepsilon\sim\mathcal{N}(\mathbf{0}, \sigma^2 I).$$
$\Phi$ is the design matrix. The entire chapter is a study of this single equation.
Under the Gaussian noise model the log-likelihood is
$$\log p(\mathbf{y}\mid\boldsymbol\theta) = -\frac{1}{2\sigma^2}\|\mathbf{y}-\Phi\boldsymbol\theta\|^2 + \text{const.}$$
So maximising likelihood is exactly minimising the sum of squared residuals. Setting the gradient (deck 05) to zero:
$$\nabla_{\boldsymbol\theta}\|\mathbf{y}-\Phi\boldsymbol\theta\|^2 = -2\Phi^\top(\mathbf{y}-\Phi\boldsymbol\theta) = 0 \;\Longrightarrow\; \Phi^\top\Phi\,\boldsymbol\theta = \Phi^\top\mathbf{y}.$$
This is the normal equation. If $\Phi^\top\Phi$ is invertible (columns of $\Phi$ independent):
$$\hat{\boldsymbol\theta}_{\text{ML}} = (\Phi^\top\Phi)^{-1}\Phi^\top\mathbf{y}.$$
The solution is no longer unique; the set of minimisers is an affine subspace. The minimum-norm solution is $\hat{\boldsymbol\theta} = \Phi^+ \mathbf{y}$ where $\Phi^+ = V\Sigma^+ U^\top$ is the Moore-Penrose pseudoinverse (deck 04). Equivalently it's the limit of ridge regression as $\lambda\to 0^+$.
The fitted values are
$$\hat{\mathbf{y}} = \Phi\hat{\boldsymbol\theta}_{\text{ML}} = \Phi(\Phi^\top\Phi)^{-1}\Phi^\top \mathbf{y}.$$
The matrix in the middle is exactly the projection onto $\mathrm{Col}(\Phi)$ from deck 03. So:
$\hat{\mathbf{y}}$ is the closest point in the column space $\mathrm{Col}(\Phi)$ to the observation vector $\mathbf{y}$. The residual $\mathbf{y}-\hat{\mathbf{y}}$ is perpendicular to $\mathrm{Col}(\Phi)$ — that's why $\Phi^\top(\mathbf{y}-\Phi\hat{\boldsymbol\theta}_{\text{ML}}) = 0$.
Once you see this, every property of linear regression follows by reading the deck-03 picture:
Place a prior $\boldsymbol\theta \sim \mathcal{N}(\mathbf{0}, \tau^2 I)$. The MAP estimator (deck 08) is
$$\hat{\boldsymbol\theta}_{\text{MAP}} = \arg\min_{\boldsymbol\theta}\;\frac{1}{2\sigma^2}\|\mathbf{y}-\Phi\boldsymbol\theta\|^2 + \frac{1}{2\tau^2}\|\boldsymbol\theta\|^2.$$
Setting the gradient to zero:
$$\hat{\boldsymbol\theta}_{\text{MAP}} = \Bigl(\Phi^\top\Phi + \frac{\sigma^2}{\tau^2} I\Bigr)^{-1}\Phi^\top\mathbf{y}.$$
This is ridge regression with $\lambda = \sigma^2/\tau^2$. Three observations:
The same machinery works for any $\boldsymbol\phi$. Three useful choices:
| Feature map | Form | Comment |
|---|---|---|
| Polynomial | $\boldsymbol\phi(x) = (1, x, x^2, \dots, x^M)$ | Cheap to evaluate; basis is numerically ill-conditioned for large $M$. Orthogonal polynomials fix this. |
| Radial basis | $\phi_m(x) = e^{-\|x-\mu_m\|^2 / 2s^2}$ | Place RBF centres on the data; localised features. |
| Fourier | $\phi_m(x) \in \{\sin(mx), \cos(mx)\}$ | Orthonormal basis on $[0, 2\pi]$ — well-conditioned. |
Ridge regression with feature map $\boldsymbol\phi$ has the dual form (Woodbury identity):
$$\hat{\mathbf{y}}_\star = K(\mathbf{x}_\star, X)\,(K(X,X) + \lambda I)^{-1}\mathbf{y},$$
where $K(\mathbf{x}, \mathbf{x}') = \boldsymbol\phi(\mathbf{x})^\top\boldsymbol\phi(\mathbf{x}')$ is the kernel. This is the road into kernel methods and Gaussian process regression — the SVM (deck 12) is the classification analogue.
Prior $\boldsymbol\theta\sim\mathcal N(\mathbf m_0, S_0)$; likelihood $\mathbf y\mid\boldsymbol\theta\sim\mathcal N(\Phi\boldsymbol\theta, \sigma^2 I)$. Both Gaussian, in $\boldsymbol\theta$ — the posterior is Gaussian (deck 06 conjugacy).
$$\boldsymbol\theta\mid\mathbf{y}\sim\mathcal N(\mathbf m_N, S_N), \qquad S_N^{-1} = S_0^{-1} + \frac{1}{\sigma^2}\Phi^\top\Phi,$$
$$\mathbf m_N = S_N\Bigl(S_0^{-1}\mathbf m_0 + \frac{1}{\sigma^2}\Phi^\top \mathbf{y}\Bigr).$$
Two specialisations:
$S_N^{-1}$ is the sum of two precision matrices: the prior precision and the data precision $\Phi^\top\Phi / \sigma^2$. Precisions add — that's why uncertainty (variance) shrinks as data arrives. Every Gaussian Bayesian update has this shape.
For a new input $\mathbf{x}_\star$, integrating out $\boldsymbol\theta$ under the posterior yields a Gaussian predictive distribution:
$$p(y_\star\mid\mathbf{x}_\star, \mathbf{y}) = \mathcal N\bigl(\boldsymbol\phi(\mathbf{x}_\star)^\top\mathbf m_N,\; \sigma^2 + \boldsymbol\phi(\mathbf{x}_\star)^\top S_N \boldsymbol\phi(\mathbf{x}_\star)\bigr).$$
Mean and variance split cleanly:
Click on the canvas to add a $(x, y)$ point. The black curve is the posterior mean; the purple band is the 95% predictive interval. Each point pulls the mean toward itself and narrows the band locally. Use RBF features by default; switch to polynomial in the dropdown.
With no data, the prior dominates: the band is uniformly wide. With many points, the band collapses toward $2\sigma$ — the irreducible noise from slide 07.
The Bayesian evidence (deck 08, slide 10) for linear regression is in closed form:
$$p(\mathbf{y}\mid \Phi, \sigma^2, \tau^2) = \mathcal N\bigl(\mathbf{0},\, \sigma^2 I + \tau^2 \Phi\Phi^\top\bigr).$$
This is a Gaussian on $\mathbf{y}$ with covariance $K + \sigma^2 I$ where $K = \tau^2 \Phi\Phi^\top$ — the kernel matrix again. Maximising the log-evidence over $\sigma^2, \tau^2$ does model selection automatically — this is the type-II maximum likelihood or empirical Bayes, the engine behind Gaussian process hyperparameter learning.
A flexible feature map ($\tau^2$ big) can fit the data well but spreads the prior mass thinly: the evidence integral is small. A rigid feature map ($\tau^2$ small) concentrates the mass but may fit poorly. The maximum is in the middle — without any held-out data.
| Estimator | Formula |
|---|---|
| $\hat{\boldsymbol\theta}_{\text{ML}}$ | $(\Phi^\top\Phi)^{-1}\Phi^\top\mathbf{y}$ (or $\Phi^+\mathbf y$ for rank-deficient $\Phi$) |
| $\hat{\boldsymbol\theta}_{\text{MAP / ridge}}$ | $(\Phi^\top\Phi + \lambda I)^{-1}\Phi^\top\mathbf y$, $\lambda = \sigma^2/\tau^2$ |
| Posterior covariance | $S_N^{-1} = S_0^{-1} + \sigma^{-2}\Phi^\top\Phi$ |
| Posterior mean | $\mathbf m_N = S_N(S_0^{-1}\mathbf m_0 + \sigma^{-2}\Phi^\top\mathbf y)$ |
| Predictive mean | $\boldsymbol\phi(\mathbf x_\star)^\top \mathbf m_N$ |
| Predictive variance | $\sigma^2 + \boldsymbol\phi(\mathbf x_\star)^\top S_N \boldsymbol\phi(\mathbf x_\star)$ |
| Hat matrix | $H = \Phi(\Phi^\top\Phi)^{-1}\Phi^\top$, idempotent, symmetric |
| DoF | $\mathrm{tr}(H) = \mathrm{rank}(\Phi)$ |
| Dual / kernel form | $\hat{\mathbf y}_\star = K(\mathbf x_\star, X)(K(X,X)+\lambda I)^{-1}\mathbf y$ |