Mathematics for Machine Learning — Deck 06

Probability & Distributions

From sum-rule, product-rule, and Bayes' theorem to the multivariate Gaussian's marginal and conditional — the toolkit of every probabilistic ML model.

sum ruleproduct rule BayesGaussian conjugacyexp. family
sample space density Bayes Gaussian conjugate update
00

Topics We'll Cover

01

Sample Spaces, Events, Probability

A sample space $\Omega$ is the set of possible outcomes of an experiment; an event is a subset $A\subseteq\Omega$. A probability measure $P$ assigns a number $P(A)\in[0,1]$ to every event, with $P(\Omega)=1$ and $P\bigl(\bigcup_i A_i\bigr) = \sum_i P(A_i)$ for disjoint $A_i$.

Most of the book glosses the measure-theoretic details because, in practice, ML works with two well-behaved cases:

In both cases we'll write $p(x)$ and let context disambiguate.

02

Discrete vs Continuous — pmf vs pdf

Discrete: pmf

$p(x) = P(X = x) \in [0,1]$. Counts probability of an outcome.

$$\sum_{x}p(x) = 1.$$

Examples: Bernoulli, Binomial, Categorical, Poisson.

Continuous: pdf

$p(x) \ge 0$ but not a probability — can exceed 1. Probability of a region is the integral:

$$P(X\in A) = \int_A p(x)\,dx, \qquad \int p(x)\,dx = 1.$$

Examples: Gaussian, Exponential, Beta, Gamma.

The CDF unifies them

The cumulative distribution function $F(x) = P(X\le x)$ is well-defined for both cases. In the continuous case $p = F'$ wherever $F$ is differentiable.

03

Sum, Product, Bayes

The three rules that all of probabilistic ML is built from. State them once for joint $p(x,y)$ and read them everywhere.

Sum rule (marginalisation)

$$p(x) = \sum_{y} p(x,y) \quad\text{(or}\;\int p(x,y)\,dy\;\text{)}.$$

Product rule (chain rule of probability)

$$p(x,y) = p(x\mid y)\, p(y) = p(y\mid x)\, p(x).$$

Bayes' theorem

Equate the two product-rule decompositions:

$$\underbrace{p(\theta\mid \mathcal D)}_{\text{posterior}} = \frac{\overbrace{p(\mathcal D\mid \theta)}^{\text{likelihood}}\,\overbrace{p(\theta)}^{\text{prior}}}{\underbrace{p(\mathcal D)}_{\text{evidence}}}.$$

$p(\mathcal D)$ is just a normalising constant: $\int p(\mathcal D\mid\theta)\,p(\theta)\,d\theta$. It doesn't depend on $\theta$, which is why posterior maximisation (MAP) doesn't need to compute it.

Read every Bayesian model this way

Pick parameters $\theta$; pick likelihood $p(\mathcal D\mid\theta)$ and prior $p(\theta)$. Bayes hands you $p(\theta\mid\mathcal D)$, which is a complete description of what the data have taught you about $\theta$. All downstream questions — predictions, decisions, uncertainty — reduce to integrals against this posterior.

04

Interactive: Bayes for a Biased Coin

A coin has unknown head probability $\theta\in[0,1]$. The prior is $\mathrm{Beta}(\alpha_0,\beta_0)$. Click the buttons to feed in observed heads / tails; the posterior $\mathrm{Beta}(\alpha_0+h,\, \beta_0+t)$ updates live.

2
2
heads
0
tails
0
posterior mean
95% credible

The Beta is the conjugate prior for a Bernoulli likelihood: prior and posterior live in the same family, and the posterior pseudo-counts are just $(\alpha_0+h,\beta_0+t)$. Slide 09 explains why this works.

05

Expectation, Variance, Covariance

Expectation

$$\mathbb{E}_p[f(X)] = \int f(x)\, p(x)\, dx.$$

Linear: $\mathbb{E}[\alpha X + \beta Y] = \alpha\mathbb{E}[X] + \beta\mathbb{E}[Y]$ — with or without independence.

Variance and covariance

$$\mathbb{V}[X] = \mathbb{E}[(X - \mathbb{E}[X])^2], \qquad \mathrm{Cov}(X,Y) = \mathbb{E}[(X-\mathbb{E}X)(Y-\mathbb{E}Y)].$$

For a random vector $\mathbf{X}$ the covariance is the matrix $\Sigma = \mathbb{E}[(\mathbf{X}-\boldsymbol\mu)(\mathbf{X}-\boldsymbol\mu)^\top]$ — symmetric and PSD.

Useful identities

06

The Univariate and Multivariate Gaussian

Univariate

$$p(x\mid\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right).$$

Multivariate

$$p(\mathbf{x}\mid\boldsymbol\mu,\Sigma) = \frac{1}{(2\pi)^{D/2}|\Sigma|^{1/2}}\exp\!\left(-\tfrac12 (\mathbf{x}-\boldsymbol\mu)^\top\Sigma^{-1}(\mathbf{x}-\boldsymbol\mu)\right).$$

Three facts you'll use constantly:

  1. Sum: $\mathcal{N}(\boldsymbol\mu_1,\Sigma_1) + \mathcal{N}(\boldsymbol\mu_2,\Sigma_2) = \mathcal{N}(\boldsymbol\mu_1+\boldsymbol\mu_2, \Sigma_1+\Sigma_2)$ (independent).
  2. Affine: $A\mathbf{X}+\mathbf{c}\sim\mathcal{N}(A\boldsymbol\mu+\mathbf{c}, A\Sigma A^\top)$.
  3. Product of two Gaussians (in $\mathbf{x}$) is Gaussian: the posterior of $\mathbf{x}$ under a Gaussian likelihood and Gaussian prior is Gaussian.
Why the Gaussian is everywhere

It is the maximum-entropy distribution given a mean and covariance; the sum of independent terms tends to one (CLT); and it's closed under marginalisation, conditioning, sums, and linear maps. That algebraic closure is why deck 09 (Bayesian linear regression) has any closed-form solution at all.

07

Interactive: Bivariate Gaussian Explorer

Drag the cross to move $\boldsymbol\mu$. Use the sliders to set $\sigma_x$, $\sigma_y$ and the correlation $\rho\in(-1,1)$. The density contours, principal axes (eigenvectors of $\Sigma$) and marginal densities update live.

1.00
0.60
0.60

The contours are ellipses; their axes are the eigenvectors of $\Sigma$. When $\rho=0$ the axes are aligned with $(x,y)$; turning up $|\rho|$ rotates the ellipse and stretches one axis.

08

Marginal and Conditional Gaussians

Partition a joint Gaussian $\mathbf{X} = (\mathbf{X}_a, \mathbf{X}_b)$ with

$$\boldsymbol\mu = \begin{bmatrix}\boldsymbol\mu_a\\\boldsymbol\mu_b\end{bmatrix}, \quad \Sigma = \begin{bmatrix}\Sigma_{aa} & \Sigma_{ab}\\\Sigma_{ba} & \Sigma_{bb}\end{bmatrix}.$$

Marginal

$\mathbf{X}_a \sim \mathcal{N}(\boldsymbol\mu_a, \Sigma_{aa})$. Just throw away the rows and columns you don't want.

Conditional

$$\mathbf{X}_a \mid \mathbf{X}_b=\mathbf{x}_b \sim \mathcal{N}\bigl(\boldsymbol\mu_a + \Sigma_{ab}\Sigma_{bb}^{-1}(\mathbf{x}_b-\boldsymbol\mu_b),\; \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba}\bigr).$$

The conditional mean is the linear regression of $\mathbf{X}_a$ on $\mathbf{X}_b$; the conditional covariance is the Schur complement of $\Sigma_{bb}$ in $\Sigma$.

Linear regression preview

The Bayesian-linear-regression posterior of deck 09 is exactly the conditional distribution of weights given observed targets, when both are Gaussian. The formulae on this slide give it in one line.

09

Conjugacy and the Exponential Family

Conjugate prior

A prior $p(\theta)$ is conjugate to a likelihood $p(\mathcal D\mid\theta)$ if the posterior is in the same family as the prior. The update is then a simple parameter shift — no integrals.

LikelihoodConjugate priorUpdate
BernoulliBeta($\alpha,\beta$)$\alpha\to\alpha+h$, $\beta\to\beta+t$
Gaussian (known $\sigma^2$)Gaussian on $\mu$standard formula in deck 09
Gaussian (known $\mu$)Inverse-Gamma on $\sigma^2$Inverse-Gamma update
Multivariate GaussianNormal-Inverse-WishartJoint update on $\boldsymbol\mu,\Sigma$
CategoricalDirichlet$\alpha_k\to\alpha_k + n_k$
PoissonGammastandard Gamma update

The exponential family

A family with density $p(x\mid\boldsymbol\theta) = h(x)\exp(\boldsymbol\eta(\boldsymbol\theta)^\top T(x) - A(\boldsymbol\theta))$ has a conjugate prior automatically (by inspection of $T$). All the distributions in the table above are exponential-family members.

10

Change of Variables

Push-forward of a density under an invertible $\mathbf{y} = \mathbf{g}(\mathbf{x})$ with Jacobian $J_\mathbf{g}$:

$$p_Y(\mathbf{y}) = p_X(\mathbf{g}^{-1}(\mathbf{y}))\,\bigl|\det J_{\mathbf{g}^{-1}}(\mathbf{y})\bigr| = p_X(\mathbf{g}^{-1}(\mathbf{y}))\,\bigl|\det J_{\mathbf{g}}(\mathbf{g}^{-1}(\mathbf{y}))\bigr|^{-1}.$$

For an affine $\mathbf{y} = A\mathbf{x}+\mathbf{c}$ with $A$ invertible:

$$p_Y(\mathbf{y}) = p_X(A^{-1}(\mathbf{y}-\mathbf{c}))\,|\det A|^{-1}.$$

Reparameterisation

To sample $\mathbf{x}\sim\mathcal{N}(\boldsymbol\mu, \Sigma)$, sample $\mathbf{z}\sim\mathcal{N}(\mathbf{0}, I)$ and return $\boldsymbol\mu + L\mathbf{z}$ where $\Sigma = LL^\top$ is the Cholesky. This is the reparameterisation trick behind VAEs and almost every score-based generative model.

Probability integral transform

For any continuous $X$, $F_X(X)\sim\mathrm{Uniform}(0,1)$. Inverting: $X = F_X^{-1}(U)$ samples $X$ from a uniform $U$. This underlies inverse-transform sampling.

11

Where This Lands in Part II

ConceptUsed byWhere
Likelihood & priorEvery Bayesian Part II algorithmMAP / posterior inference
Conjugate Gauss-Gauss updateBayesian linear regression (ch. 9)Closed-form posterior on weights
Multivariate Gaussian conditionalLinear regression, GPPredictive distribution
Gaussian densityGMM (ch. 11)Mixture components
ReparameterisationGenerative modelsBackprop through sampling
Exponential familyGeneralised linear modelsNatural parameterisation
12

Cheat Sheet

QuantityFormula
sum rule$p(x) = \int p(x,y)\,dy$
product rule$p(x,y) = p(x\mid y)p(y)$
Bayes$p(\theta\mid\mathcal D)\propto p(\mathcal D\mid\theta)p(\theta)$
Gaussian density$\frac{1}{(2\pi)^{D/2}|\Sigma|^{1/2}}e^{-\tfrac12 (\mathbf x-\boldsymbol\mu)^\top\Sigma^{-1}(\mathbf x-\boldsymbol\mu)}$
affine of Gaussian$A\mathbf X+\mathbf c\sim\mathcal N(A\boldsymbol\mu+\mathbf c, A\Sigma A^\top)$
Gaussian sum$\mathcal N(\boldsymbol\mu_1,\Sigma_1)+\mathcal N(\boldsymbol\mu_2,\Sigma_2)=\mathcal N(\boldsymbol\mu_1+\boldsymbol\mu_2,\Sigma_1+\Sigma_2)$
conditional Gaussian mean$\boldsymbol\mu_a + \Sigma_{ab}\Sigma_{bb}^{-1}(\mathbf x_b-\boldsymbol\mu_b)$
conditional Gaussian covariance$\Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba}$
Beta-Bernoulli update$\alpha\to\alpha+h,\;\beta\to\beta+t$
change of variables$p_Y(\mathbf y)=p_X(\mathbf g^{-1}(\mathbf y))|\det J_{\mathbf g^{-1}}|$