Mathematics for Machine Learning — Deck 11

Density Estimation with GMMs

Mix Gaussians, introduce latent assignments, derive EM as a lower-bound maximiser — and watch it cluster a 2D scatter live.

mixturelatent z EMresponsibilities ELBOBIC
00

Topics We'll Cover

01

Why Mix Gaussians

A single Gaussian is unimodal. Real-world densities (e.g. height of adults, distribution of pixel intensity in a face image, embeddings clustering by topic) are multi-modal. The simplest fix: write the density as a weighted sum of Gaussians.

What you get

  • Multi-modal density estimator;
  • Soft clustering (each point partly belongs to each cluster);
  • Generative model: sample $\mathbf{x}\sim p(\mathbf{x}\mid\boldsymbol\theta)$;
  • Closed-form M-step under EM.

What you trade away

  • The likelihood is no longer concave — multiple local optima;
  • Identifiability up to permutation of cluster labels;
  • Cluster covariances can collapse to a point on a single sample (singularity);
  • Choosing $K$ is now part of the problem.
02

The Mixture Density

$$p(\mathbf{x}\mid\boldsymbol\theta) = \sum_{k=1}^K \pi_k \,\mathcal N(\mathbf{x}\mid\boldsymbol\mu_k, \Sigma_k),$$

with mixing weights $\pi_k\ge 0$, $\sum_k \pi_k = 1$, means $\boldsymbol\mu_k\in\mathbb{R}^D$, and covariances $\Sigma_k \succeq 0$. Parameters: $\boldsymbol\theta = \{\pi_k, \boldsymbol\mu_k, \Sigma_k\}_{k=1}^K$.

Generative reading

  1. Sample a cluster $z\in\{1,\dots,K\}$ with $\mathbb{P}(z=k) = \pi_k$.
  2. Sample $\mathbf{x}\mid z=k \sim \mathcal N(\boldsymbol\mu_k, \Sigma_k)$.

The marginal $p(\mathbf{x})$ is exactly the mixture density on top — that's the sum rule from deck 06.

03

The Latent-Variable View

Introduce a latent cluster assignment $z_n\in\{1,\dots,K\}$ for each data point. The complete-data likelihood factorises:

$$p(\mathbf{x}_n, z_n\mid\boldsymbol\theta) = \pi_{z_n}\,\mathcal N(\mathbf{x}_n\mid\boldsymbol\mu_{z_n}, \Sigma_{z_n}).$$

This is one term, not a sum, because we condition on $z_n$. If we knew $z_n$ for every point we could maximise this complete-data log-likelihood in closed form (just MLE for each cluster's Gaussian, plus counting for $\pi_k$).

The graphical model

Plate diagram: $\pi \to z_n \to \mathbf{x}_n$, with $\boldsymbol\mu_k, \Sigma_k$ feeding $\mathbf{x}_n$ via the chosen $z_n$. EM alternates between treating $z_n$ as known (M-step) and as unknown but averaged (E-step).

04

The Likelihood — Why Direct Optimisation is Hard

$$\log p(\mathcal D\mid\boldsymbol\theta) = \sum_{n=1}^N \log \sum_{k=1}^K \pi_k\,\mathcal N(\mathbf{x}_n\mid\boldsymbol\mu_k, \Sigma_k).$$

The "log of sum" prevents the log from splitting linearly across components — we cannot just write down the MLE for each cluster. The function is not concave in $\boldsymbol\theta$, has many local maxima, and the gradient w.r.t. $\boldsymbol\mu_k$ couples through every term.

Worse: with no constraint, sending one $\boldsymbol\mu_k$ to a data point and shrinking that component's covariance to zero makes the likelihood $\to \infty$ — a degenerate solution. EM and a small covariance floor handle this in practice.

The EM idea

Instead of maximising $\log p(\mathcal D\mid\boldsymbol\theta)$ directly, alternate between (E) estimating $p(z_n\mid\mathbf{x}_n,\boldsymbol\theta)$ and (M) maximising the expected complete-data log-likelihood. Each round strictly increases $\log p(\mathcal D\mid\boldsymbol\theta)$ until a stationary point.

05

EM from a Variational Lower Bound

For any distribution $q(z_n)$ on the latents, Jensen's inequality gives

$$\log p(\mathbf{x}_n\mid\boldsymbol\theta) = \log \sum_{z_n} p(\mathbf{x}_n, z_n\mid\boldsymbol\theta) \;\ge\; \sum_{z_n} q(z_n)\log\frac{p(\mathbf{x}_n, z_n\mid\boldsymbol\theta)}{q(z_n)} \;=\; \mathcal{L}(q, \boldsymbol\theta).$$

$\mathcal L$ is the evidence lower bound (ELBO). The gap to the true log-likelihood is exactly $\mathrm{KL}(q\,\|\,p(z_n\mid\mathbf{x}_n,\boldsymbol\theta))$.

EM = coordinate ascent on the ELBO

Each step increases $\log p(\mathcal D\mid\boldsymbol\theta)$ — the E-step closes the gap, the M-step pushes the bound up. EM monotonically increases the likelihood; it converges to a local optimum.

06

E-step: Responsibilities

By Bayes' theorem (deck 06):

$$r_{nk} \equiv p(z_n=k\mid\mathbf{x}_n,\boldsymbol\theta) = \frac{\pi_k\,\mathcal N(\mathbf{x}_n\mid\boldsymbol\mu_k, \Sigma_k)}{\sum_{j=1}^K \pi_j\,\mathcal N(\mathbf{x}_n\mid\boldsymbol\mu_j, \Sigma_j)}.$$

$r_{nk}\in[0,1]$ with $\sum_k r_{nk} = 1$. The responsibility $r_{nk}$ is "how much cluster $k$ owns point $n$" under the current parameters.

Connecting to $k$-means

If the Gaussians have identical isotropic covariance $\sigma^2 I$ and you let $\sigma^2\to 0$, the softmax-like responsibility collapses into a hard assignment ($r_{nk}=1$ for the nearest centroid, 0 elsewhere). EM on this degenerate GMM is exactly Lloyd's $k$-means algorithm.

07

M-step: Closed-Form Updates

Maximising the expected complete-data log-likelihood subject to $\sum_k \pi_k = 1$ gives, with $N_k = \sum_n r_{nk}$:

$$\pi_k = \frac{N_k}{N}, \qquad \boldsymbol\mu_k = \frac{1}{N_k}\sum_n r_{nk}\,\mathbf{x}_n, \qquad \Sigma_k = \frac{1}{N_k}\sum_n r_{nk}\,(\mathbf{x}_n - \boldsymbol\mu_k)(\mathbf{x}_n-\boldsymbol\mu_k)^\top.$$

Each component's mean and covariance are weighted MLEs of a Gaussian, with the responsibilities as the weights. The mixing weights are normalised counts.

Why the formulae are this simple

The expected complete-data log-likelihood is a sum over $n$ and $k$ of $r_{nk}\log\bigl[\pi_k\,\mathcal N(\mathbf{x}_n\mid\boldsymbol\mu_k,\Sigma_k)\bigr]$. For fixed $r$ it is a sum of independent weighted-Gaussian MLEs (one per cluster) plus a categorical MLE on $\pi$. The decoupling is the whole point of introducing the latents.

08

Interactive: 2D EM Animator

Click in the canvas to add data points. Press E-step to recompute responsibilities (colours soft-blend by component); press M-step to refit the means and covariances. The log-likelihood is reported below.

3
N
0
log-lik
iter
0

Watch what happens with $K$ larger than the truth: spurious components either grab a single point with a tiny covariance (degeneracy) or split a true cluster. With $K$ smaller, a single component covers two true clusters with a larger covariance — the bias-variance trade-off from deck 08 on a clustering problem.

09

Initialisation & Local Optima

Three common initialisation strategies:

  1. Random: draw initial $\boldsymbol\mu_k$ from the data uniformly. Cheap; often fine.
  2. $k$-means++: draw the first centre uniformly; each subsequent centre with probability proportional to squared distance to the nearest existing centre. Spreads the centres out; theoretically near-optimal in expectation.
  3. Multiple restarts: run EM from $R$ different inits and keep the run with highest likelihood.

Without a covariance floor or a Bayesian prior on $\Sigma_k$, a component can latch onto a single point and shrink — the likelihood diverges. The simple fix is to add a small $\epsilon I$ to every M-step covariance estimate; the principled fix is an inverse-Wishart prior on $\Sigma_k$.

10

Choosing $K$

MethodIdeaNotes
BIC$-2\log\hat L + p\log N$, $p$ = parameter countCheap; penalises overfitting
AIC$-2\log\hat L + 2p$Less aggressive penalty; tends to over-pick $K$
Cross-validated log-likelihoodFit on training, score on held-outMost generally trustworthy
Variational Bayes / Dirichlet processPrior over $K$; many components get $\pi_k\approx 0$Automatically prunes; needs more machinery

For the book's running example, BIC works well; for image / embedding clusters it's not unusual to see hundreds or thousands of components, and BIC drastically under-picks.

11

Cheat Sheet

QuantityFormula
density$p(\mathbf x) = \sum_k \pi_k \mathcal N(\mathbf x\mid\boldsymbol\mu_k,\Sigma_k)$
responsibility$r_{nk} = \pi_k\mathcal N(\mathbf x_n\mid\boldsymbol\mu_k,\Sigma_k) / \sum_j(\dots)$
$N_k$$\sum_n r_{nk}$
$\pi_k$ update$N_k / N$
$\boldsymbol\mu_k$ update$\tfrac1{N_k}\sum_n r_{nk}\mathbf x_n$
$\Sigma_k$ update$\tfrac1{N_k}\sum_n r_{nk}(\mathbf x_n-\boldsymbol\mu_k)(\mathbf x_n-\boldsymbol\mu_k)^\top$
ELBO$\sum_n \mathbb E_{q_n}\log p(\mathbf x_n, z_n\mid\boldsymbol\theta) - \mathbb E_{q_n}\log q_n$
EM monotonicity$\log p(\mathcal D\mid\boldsymbol\theta_{t+1}) \ge \log p(\mathcal D\mid\boldsymbol\theta_t)$
$k$-means limitEM with shared isotropic $\Sigma\to 0$