MML 06 — Probability and Distributions

00

Topics We'll Cover

Sample spaces, events, probability
Discrete vs continuous — pmf vs pdf
Sum, product, Bayes
Interactive: Bayes for a biased coin
Expectation, variance, covariance
The univariate and multivariate Gaussian
Interactive: bivariate Gaussian explorer
Marginal and conditional Gaussians
Conjugacy and the exponential family
Change of variables
Where this lands in Part II
Cheat sheet

01

Sample Spaces, Events, Probability

A sample space $\Omega$ is the set of possible outcomes of an experiment; an event is a subset $A\subseteq\Omega$. A probability measure $P$ assigns a number $P(A)\in[0,1]$ to every event, with $P(\Omega)=1$ and $P\bigl(\bigcup_i A_i\bigr) = \sum_i P(A_i)$ for disjoint $A_i$.

Most of the book glosses the measure-theoretic details because, in practice, ML works with two well-behaved cases:

Discrete: $\Omega$ is countable; $P$ is fixed by a probability mass function $p:\Omega\to[0,1]$ with $\sum_x p(x)=1$.
Continuous: $\Omega\subseteq\mathbb{R}^D$; $P$ is fixed by a density $p:\mathbb{R}^D\to\mathbb{R}_{\ge 0}$ with $\int p(\mathbf{x})\,d\mathbf{x}=1$, and $P(A) = \int_A p(\mathbf{x})\,d\mathbf{x}$.

In both cases we'll write $p(x)$ and let context disambiguate.

02

Discrete vs Continuous — pmf vs pdf

Discrete: pmf

$p(x) = P(X = x) \in [0,1]$. Counts probability of an outcome.

$$\sum_{x}p(x) = 1.$$

Examples: Bernoulli, Binomial, Categorical, Poisson.

Continuous: pdf

$p(x) \ge 0$ but not a probability — can exceed 1. Probability of a region is the integral:

$$P(X\in A) = \int_A p(x)\,dx, \qquad \int p(x)\,dx = 1.$$

Examples: Gaussian, Exponential, Beta, Gamma.

The CDF unifies them

The cumulative distribution function $F(x) = P(X\le x)$ is well-defined for both cases. In the continuous case $p = F'$ wherever $F$ is differentiable.

03

Sum, Product, Bayes

The three rules that all of probabilistic ML is built from. State them once for joint $p(x,y)$ and read them everywhere.

Sum rule (marginalisation)

$$p(x) = \sum_{y} p(x,y) \quad\text{(or}\;\int p(x,y)\,dy\;\text{)}.$$

Product rule (chain rule of probability)

$$p(x,y) = p(x\mid y)\, p(y) = p(y\mid x)\, p(x).$$

Bayes' theorem

Equate the two product-rule decompositions:

$$\underbrace{p(\theta\mid \mathcal D)}_{\text{posterior}} = \frac{\overbrace{p(\mathcal D\mid \theta)}^{\text{likelihood}}\,\overbrace{p(\theta)}^{\text{prior}}}{\underbrace{p(\mathcal D)}_{\text{evidence}}}.$$

$p(\mathcal D)$ is just a normalising constant: $\int p(\mathcal D\mid\theta)\,p(\theta)\,d\theta$. It doesn't depend on $\theta$, which is why posterior maximisation (MAP) doesn't need to compute it.

Read every Bayesian model this way

Pick parameters $\theta$; pick likelihood $p(\mathcal D\mid\theta)$ and prior $p(\theta)$. Bayes hands you $p(\theta\mid\mathcal D)$, which is a complete description of what the data have taught you about $\theta$. All downstream questions — predictions, decisions, uncertainty — reduce to integrals against this posterior.

04

Interactive: Bayes for a Biased Coin

A coin has unknown head probability $\theta\in[0,1]$. The prior is $\mathrm{Beta}(\alpha_0,\beta_0)$. Click the buttons to feed in observed heads / tails; the posterior $\mathrm{Beta}(\alpha_0+h,\, \beta_0+t)$ updates live.

prior α₀

2

prior β₀

2

heads

0

tails

0

posterior mean

—

95% credible

—

The Beta is the conjugate prior for a Bernoulli likelihood: prior and posterior live in the same family, and the posterior pseudo-counts are just $(\alpha_0+h,\beta_0+t)$. Slide 09 explains why this works.

05

Expectation, Variance, Covariance

Expectation

$$\mathbb{E}_p[f(X)] = \int f(x)\, p(x)\, dx.$$

Linear: $\mathbb{E}[\alpha X + \beta Y] = \alpha\mathbb{E}[X] + \beta\mathbb{E}[Y]$ — with or without independence.

Variance and covariance

$$\mathbb{V}[X] = \mathbb{E}[(X - \mathbb{E}[X])^2], \qquad \mathrm{Cov}(X,Y) = \mathbb{E}[(X-\mathbb{E}X)(Y-\mathbb{E}Y)].$$

For a random vector $\mathbf{X}$ the covariance is the matrix $\Sigma = \mathbb{E}[(\mathbf{X}-\boldsymbol\mu)(\mathbf{X}-\boldsymbol\mu)^\top]$ — symmetric and PSD.

Useful identities

$\mathbb{V}[\alpha X + \beta Y] = \alpha^2\mathbb{V}[X] + \beta^2\mathbb{V}[Y] + 2\alpha\beta\,\mathrm{Cov}(X,Y)$;
For a matrix $A$: $\mathrm{Cov}(A\mathbf{X}) = A\,\mathrm{Cov}(\mathbf{X})\,A^\top$;
Independence implies $\mathrm{Cov} = 0$, but $\mathrm{Cov} = 0$ doesn't imply independence (unless Gaussian).

06

The Univariate and Multivariate Gaussian

Univariate

$$p(x\mid\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right).$$

Multivariate

$$p(\mathbf{x}\mid\boldsymbol\mu,\Sigma) = \frac{1}{(2\pi)^{D/2}|\Sigma|^{1/2}}\exp\!\left(-\tfrac12 (\mathbf{x}-\boldsymbol\mu)^\top\Sigma^{-1}(\mathbf{x}-\boldsymbol\mu)\right).$$

Three facts you'll use constantly:

Sum: $\mathcal{N}(\boldsymbol\mu_1,\Sigma_1) + \mathcal{N}(\boldsymbol\mu_2,\Sigma_2) = \mathcal{N}(\boldsymbol\mu_1+\boldsymbol\mu_2, \Sigma_1+\Sigma_2)$ (independent).
Affine: $A\mathbf{X}+\mathbf{c}\sim\mathcal{N}(A\boldsymbol\mu+\mathbf{c}, A\Sigma A^\top)$.
Product of two Gaussians (in $\mathbf{x}$) is Gaussian: the posterior of $\mathbf{x}$ under a Gaussian likelihood and Gaussian prior is Gaussian.

Why the Gaussian is everywhere

It is the maximum-entropy distribution given a mean and covariance; the sum of independent terms tends to one (CLT); and it's closed under marginalisation, conditioning, sums, and linear maps. That algebraic closure is why deck 09 (Bayesian linear regression) has any closed-form solution at all.

07

Interactive: Bivariate Gaussian Explorer

Drag the cross to move $\boldsymbol\mu$. Use the sliders to set $\sigma_x$, $\sigma_y$ and the correlation $\rho\in(-1,1)$. The density contours, principal axes (eigenvectors of $\Sigma$) and marginal densities update live.

σ_x

1.00

σ_y

0.60

correlation ρ

0.60

The contours are ellipses; their axes are the eigenvectors of $\Sigma$. When $\rho=0$ the axes are aligned with $(x,y)$; turning up $|\rho|$ rotates the ellipse and stretches one axis.

08

Marginal and Conditional Gaussians

Partition a joint Gaussian $\mathbf{X} = (\mathbf{X}_a, \mathbf{X}_b)$ with

$$\boldsymbol\mu = \begin{bmatrix}\boldsymbol\mu_a\\\boldsymbol\mu_b\end{bmatrix}, \quad \Sigma = \begin{bmatrix}\Sigma_{aa} & \Sigma_{ab}\\\Sigma_{ba} & \Sigma_{bb}\end{bmatrix}.$$

Marginal

$\mathbf{X}_a \sim \mathcal{N}(\boldsymbol\mu_a, \Sigma_{aa})$. Just throw away the rows and columns you don't want.

Conditional

$$\mathbf{X}_a \mid \mathbf{X}_b=\mathbf{x}_b \sim \mathcal{N}\bigl(\boldsymbol\mu_a + \Sigma_{ab}\Sigma_{bb}^{-1}(\mathbf{x}_b-\boldsymbol\mu_b),\; \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba}\bigr).$$

The conditional mean is the linear regression of $\mathbf{X}_a$ on $\mathbf{X}_b$; the conditional covariance is the Schur complement of $\Sigma_{bb}$ in $\Sigma$.

Linear regression preview

The Bayesian-linear-regression posterior of deck 09 is exactly the conditional distribution of weights given observed targets, when both are Gaussian. The formulae on this slide give it in one line.

09

Conjugacy and the Exponential Family

Conjugate prior

A prior $p(\theta)$ is conjugate to a likelihood $p(\mathcal D\mid\theta)$ if the posterior is in the same family as the prior. The update is then a simple parameter shift — no integrals.

Likelihood	Conjugate prior	Update
Bernoulli	Beta($\alpha,\beta$)	$\alpha\to\alpha+h$, $\beta\to\beta+t$
Gaussian (known $\sigma^2$)	Gaussian on $\mu$	standard formula in deck 09
Gaussian (known $\mu$)	Inverse-Gamma on $\sigma^2$	Inverse-Gamma update
Multivariate Gaussian	Normal-Inverse-Wishart	Joint update on $\boldsymbol\mu,\Sigma$
Categorical	Dirichlet	$\alpha_k\to\alpha_k + n_k$
Poisson	Gamma	standard Gamma update

The exponential family

A family with density $p(x\mid\boldsymbol\theta) = h(x)\exp(\boldsymbol\eta(\boldsymbol\theta)^\top T(x) - A(\boldsymbol\theta))$ has a conjugate prior automatically (by inspection of $T$). All the distributions in the table above are exponential-family members.

10

Change of Variables

Push-forward of a density under an invertible $\mathbf{y} = \mathbf{g}(\mathbf{x})$ with Jacobian $J_\mathbf{g}$:

$$p_Y(\mathbf{y}) = p_X(\mathbf{g}^{-1}(\mathbf{y}))\,\bigl|\det J_{\mathbf{g}^{-1}}(\mathbf{y})\bigr| = p_X(\mathbf{g}^{-1}(\mathbf{y}))\,\bigl|\det J_{\mathbf{g}}(\mathbf{g}^{-1}(\mathbf{y}))\bigr|^{-1}.$$

For an affine $\mathbf{y} = A\mathbf{x}+\mathbf{c}$ with $A$ invertible:

$$p_Y(\mathbf{y}) = p_X(A^{-1}(\mathbf{y}-\mathbf{c}))\,|\det A|^{-1}.$$

Reparameterisation

To sample $\mathbf{x}\sim\mathcal{N}(\boldsymbol\mu, \Sigma)$, sample $\mathbf{z}\sim\mathcal{N}(\mathbf{0}, I)$ and return $\boldsymbol\mu + L\mathbf{z}$ where $\Sigma = LL^\top$ is the Cholesky. This is the reparameterisation trick behind VAEs and almost every score-based generative model.

Probability integral transform

For any continuous $X$, $F_X(X)\sim\mathrm{Uniform}(0,1)$. Inverting: $X = F_X^{-1}(U)$ samples $X$ from a uniform $U$. This underlies inverse-transform sampling.

11

Where This Lands in Part II

Concept	Used by	Where
Likelihood & prior	Every Bayesian Part II algorithm	MAP / posterior inference
Conjugate Gauss-Gauss update	Bayesian linear regression (ch. 9)	Closed-form posterior on weights
Multivariate Gaussian conditional	Linear regression, GP	Predictive distribution
Gaussian density	GMM (ch. 11)	Mixture components
Reparameterisation	Generative models	Backprop through sampling
Exponential family	Generalised linear models	Natural parameterisation

12

Cheat Sheet

Quantity	Formula
sum rule	$p(x) = \int p(x,y)\,dy$
product rule	$p(x,y) = p(x\mid y)p(y)$
Bayes	$p(\theta\mid\mathcal D)\propto p(\mathcal D\mid\theta)p(\theta)$
Gaussian density	$\frac{1}{(2\pi)^{D/2}\|\Sigma\|^{1/2}}e^{-\tfrac12 (\mathbf x-\boldsymbol\mu)^\top\Sigma^{-1}(\mathbf x-\boldsymbol\mu)}$
affine of Gaussian	$A\mathbf X+\mathbf c\sim\mathcal N(A\boldsymbol\mu+\mathbf c, A\Sigma A^\top)$
Gaussian sum	$\mathcal N(\boldsymbol\mu_1,\Sigma_1)+\mathcal N(\boldsymbol\mu_2,\Sigma_2)=\mathcal N(\boldsymbol\mu_1+\boldsymbol\mu_2,\Sigma_1+\Sigma_2)$
conditional Gaussian mean	$\boldsymbol\mu_a + \Sigma_{ab}\Sigma_{bb}^{-1}(\mathbf x_b-\boldsymbol\mu_b)$
conditional Gaussian covariance	$\Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba}$
Beta-Bernoulli update	$\alpha\to\alpha+h,\;\beta\to\beta+t$
change of variables	$p_Y(\mathbf y)=p_X(\mathbf g^{-1}(\mathbf y))\|\det J_{\mathbf g^{-1}}\|$