From sum-rule, product-rule, and Bayes' theorem to the multivariate Gaussian's marginal and conditional — the toolkit of every probabilistic ML model.
A sample space $\Omega$ is the set of possible outcomes of an experiment; an event is a subset $A\subseteq\Omega$. A probability measure $P$ assigns a number $P(A)\in[0,1]$ to every event, with $P(\Omega)=1$ and $P\bigl(\bigcup_i A_i\bigr) = \sum_i P(A_i)$ for disjoint $A_i$.
Most of the book glosses the measure-theoretic details because, in practice, ML works with two well-behaved cases:
In both cases we'll write $p(x)$ and let context disambiguate.
$p(x) = P(X = x) \in [0,1]$. Counts probability of an outcome.
$$\sum_{x}p(x) = 1.$$
Examples: Bernoulli, Binomial, Categorical, Poisson.
$p(x) \ge 0$ but not a probability — can exceed 1. Probability of a region is the integral:
$$P(X\in A) = \int_A p(x)\,dx, \qquad \int p(x)\,dx = 1.$$
Examples: Gaussian, Exponential, Beta, Gamma.
The cumulative distribution function $F(x) = P(X\le x)$ is well-defined for both cases. In the continuous case $p = F'$ wherever $F$ is differentiable.
The three rules that all of probabilistic ML is built from. State them once for joint $p(x,y)$ and read them everywhere.
$$p(x) = \sum_{y} p(x,y) \quad\text{(or}\;\int p(x,y)\,dy\;\text{)}.$$
$$p(x,y) = p(x\mid y)\, p(y) = p(y\mid x)\, p(x).$$
Equate the two product-rule decompositions:
$$\underbrace{p(\theta\mid \mathcal D)}_{\text{posterior}} = \frac{\overbrace{p(\mathcal D\mid \theta)}^{\text{likelihood}}\,\overbrace{p(\theta)}^{\text{prior}}}{\underbrace{p(\mathcal D)}_{\text{evidence}}}.$$
$p(\mathcal D)$ is just a normalising constant: $\int p(\mathcal D\mid\theta)\,p(\theta)\,d\theta$. It doesn't depend on $\theta$, which is why posterior maximisation (MAP) doesn't need to compute it.
Pick parameters $\theta$; pick likelihood $p(\mathcal D\mid\theta)$ and prior $p(\theta)$. Bayes hands you $p(\theta\mid\mathcal D)$, which is a complete description of what the data have taught you about $\theta$. All downstream questions — predictions, decisions, uncertainty — reduce to integrals against this posterior.
A coin has unknown head probability $\theta\in[0,1]$. The prior is $\mathrm{Beta}(\alpha_0,\beta_0)$. Click the buttons to feed in observed heads / tails; the posterior $\mathrm{Beta}(\alpha_0+h,\, \beta_0+t)$ updates live.
The Beta is the conjugate prior for a Bernoulli likelihood: prior and posterior live in the same family, and the posterior pseudo-counts are just $(\alpha_0+h,\beta_0+t)$. Slide 09 explains why this works.
$$\mathbb{E}_p[f(X)] = \int f(x)\, p(x)\, dx.$$
Linear: $\mathbb{E}[\alpha X + \beta Y] = \alpha\mathbb{E}[X] + \beta\mathbb{E}[Y]$ — with or without independence.
$$\mathbb{V}[X] = \mathbb{E}[(X - \mathbb{E}[X])^2], \qquad \mathrm{Cov}(X,Y) = \mathbb{E}[(X-\mathbb{E}X)(Y-\mathbb{E}Y)].$$
For a random vector $\mathbf{X}$ the covariance is the matrix $\Sigma = \mathbb{E}[(\mathbf{X}-\boldsymbol\mu)(\mathbf{X}-\boldsymbol\mu)^\top]$ — symmetric and PSD.
$$p(x\mid\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right).$$
$$p(\mathbf{x}\mid\boldsymbol\mu,\Sigma) = \frac{1}{(2\pi)^{D/2}|\Sigma|^{1/2}}\exp\!\left(-\tfrac12 (\mathbf{x}-\boldsymbol\mu)^\top\Sigma^{-1}(\mathbf{x}-\boldsymbol\mu)\right).$$
Three facts you'll use constantly:
It is the maximum-entropy distribution given a mean and covariance; the sum of independent terms tends to one (CLT); and it's closed under marginalisation, conditioning, sums, and linear maps. That algebraic closure is why deck 09 (Bayesian linear regression) has any closed-form solution at all.
Drag the cross to move $\boldsymbol\mu$. Use the sliders to set $\sigma_x$, $\sigma_y$ and the correlation $\rho\in(-1,1)$. The density contours, principal axes (eigenvectors of $\Sigma$) and marginal densities update live.
The contours are ellipses; their axes are the eigenvectors of $\Sigma$. When $\rho=0$ the axes are aligned with $(x,y)$; turning up $|\rho|$ rotates the ellipse and stretches one axis.
Partition a joint Gaussian $\mathbf{X} = (\mathbf{X}_a, \mathbf{X}_b)$ with
$$\boldsymbol\mu = \begin{bmatrix}\boldsymbol\mu_a\\\boldsymbol\mu_b\end{bmatrix}, \quad \Sigma = \begin{bmatrix}\Sigma_{aa} & \Sigma_{ab}\\\Sigma_{ba} & \Sigma_{bb}\end{bmatrix}.$$
$\mathbf{X}_a \sim \mathcal{N}(\boldsymbol\mu_a, \Sigma_{aa})$. Just throw away the rows and columns you don't want.
$$\mathbf{X}_a \mid \mathbf{X}_b=\mathbf{x}_b \sim \mathcal{N}\bigl(\boldsymbol\mu_a + \Sigma_{ab}\Sigma_{bb}^{-1}(\mathbf{x}_b-\boldsymbol\mu_b),\; \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba}\bigr).$$
The conditional mean is the linear regression of $\mathbf{X}_a$ on $\mathbf{X}_b$; the conditional covariance is the Schur complement of $\Sigma_{bb}$ in $\Sigma$.
The Bayesian-linear-regression posterior of deck 09 is exactly the conditional distribution of weights given observed targets, when both are Gaussian. The formulae on this slide give it in one line.
A prior $p(\theta)$ is conjugate to a likelihood $p(\mathcal D\mid\theta)$ if the posterior is in the same family as the prior. The update is then a simple parameter shift — no integrals.
| Likelihood | Conjugate prior | Update |
|---|---|---|
| Bernoulli | Beta($\alpha,\beta$) | $\alpha\to\alpha+h$, $\beta\to\beta+t$ |
| Gaussian (known $\sigma^2$) | Gaussian on $\mu$ | standard formula in deck 09 |
| Gaussian (known $\mu$) | Inverse-Gamma on $\sigma^2$ | Inverse-Gamma update |
| Multivariate Gaussian | Normal-Inverse-Wishart | Joint update on $\boldsymbol\mu,\Sigma$ |
| Categorical | Dirichlet | $\alpha_k\to\alpha_k + n_k$ |
| Poisson | Gamma | standard Gamma update |
A family with density $p(x\mid\boldsymbol\theta) = h(x)\exp(\boldsymbol\eta(\boldsymbol\theta)^\top T(x) - A(\boldsymbol\theta))$ has a conjugate prior automatically (by inspection of $T$). All the distributions in the table above are exponential-family members.
Push-forward of a density under an invertible $\mathbf{y} = \mathbf{g}(\mathbf{x})$ with Jacobian $J_\mathbf{g}$:
$$p_Y(\mathbf{y}) = p_X(\mathbf{g}^{-1}(\mathbf{y}))\,\bigl|\det J_{\mathbf{g}^{-1}}(\mathbf{y})\bigr| = p_X(\mathbf{g}^{-1}(\mathbf{y}))\,\bigl|\det J_{\mathbf{g}}(\mathbf{g}^{-1}(\mathbf{y}))\bigr|^{-1}.$$
For an affine $\mathbf{y} = A\mathbf{x}+\mathbf{c}$ with $A$ invertible:
$$p_Y(\mathbf{y}) = p_X(A^{-1}(\mathbf{y}-\mathbf{c}))\,|\det A|^{-1}.$$
To sample $\mathbf{x}\sim\mathcal{N}(\boldsymbol\mu, \Sigma)$, sample $\mathbf{z}\sim\mathcal{N}(\mathbf{0}, I)$ and return $\boldsymbol\mu + L\mathbf{z}$ where $\Sigma = LL^\top$ is the Cholesky. This is the reparameterisation trick behind VAEs and almost every score-based generative model.
For any continuous $X$, $F_X(X)\sim\mathrm{Uniform}(0,1)$. Inverting: $X = F_X^{-1}(U)$ samples $X$ from a uniform $U$. This underlies inverse-transform sampling.
| Concept | Used by | Where |
|---|---|---|
| Likelihood & prior | Every Bayesian Part II algorithm | MAP / posterior inference |
| Conjugate Gauss-Gauss update | Bayesian linear regression (ch. 9) | Closed-form posterior on weights |
| Multivariate Gaussian conditional | Linear regression, GP | Predictive distribution |
| Gaussian density | GMM (ch. 11) | Mixture components |
| Reparameterisation | Generative models | Backprop through sampling |
| Exponential family | Generalised linear models | Natural parameterisation |
| Quantity | Formula |
|---|---|
| sum rule | $p(x) = \int p(x,y)\,dy$ |
| product rule | $p(x,y) = p(x\mid y)p(y)$ |
| Bayes | $p(\theta\mid\mathcal D)\propto p(\mathcal D\mid\theta)p(\theta)$ |
| Gaussian density | $\frac{1}{(2\pi)^{D/2}|\Sigma|^{1/2}}e^{-\tfrac12 (\mathbf x-\boldsymbol\mu)^\top\Sigma^{-1}(\mathbf x-\boldsymbol\mu)}$ |
| affine of Gaussian | $A\mathbf X+\mathbf c\sim\mathcal N(A\boldsymbol\mu+\mathbf c, A\Sigma A^\top)$ |
| Gaussian sum | $\mathcal N(\boldsymbol\mu_1,\Sigma_1)+\mathcal N(\boldsymbol\mu_2,\Sigma_2)=\mathcal N(\boldsymbol\mu_1+\boldsymbol\mu_2,\Sigma_1+\Sigma_2)$ |
| conditional Gaussian mean | $\boldsymbol\mu_a + \Sigma_{ab}\Sigma_{bb}^{-1}(\mathbf x_b-\boldsymbol\mu_b)$ |
| conditional Gaussian covariance | $\Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba}$ |
| Beta-Bernoulli update | $\alpha\to\alpha+h,\;\beta\to\beta+t$ |
| change of variables | $p_Y(\mathbf y)=p_X(\mathbf g^{-1}(\mathbf y))|\det J_{\mathbf g^{-1}}|$ |