Mix Gaussians, introduce latent assignments, derive EM as a lower-bound maximiser — and watch it cluster a 2D scatter live.
A single Gaussian is unimodal. Real-world densities (e.g. height of adults, distribution of pixel intensity in a face image, embeddings clustering by topic) are multi-modal. The simplest fix: write the density as a weighted sum of Gaussians.
$$p(\mathbf{x}\mid\boldsymbol\theta) = \sum_{k=1}^K \pi_k \,\mathcal N(\mathbf{x}\mid\boldsymbol\mu_k, \Sigma_k),$$
with mixing weights $\pi_k\ge 0$, $\sum_k \pi_k = 1$, means $\boldsymbol\mu_k\in\mathbb{R}^D$, and covariances $\Sigma_k \succeq 0$. Parameters: $\boldsymbol\theta = \{\pi_k, \boldsymbol\mu_k, \Sigma_k\}_{k=1}^K$.
The marginal $p(\mathbf{x})$ is exactly the mixture density on top — that's the sum rule from deck 06.
Introduce a latent cluster assignment $z_n\in\{1,\dots,K\}$ for each data point. The complete-data likelihood factorises:
$$p(\mathbf{x}_n, z_n\mid\boldsymbol\theta) = \pi_{z_n}\,\mathcal N(\mathbf{x}_n\mid\boldsymbol\mu_{z_n}, \Sigma_{z_n}).$$
This is one term, not a sum, because we condition on $z_n$. If we knew $z_n$ for every point we could maximise this complete-data log-likelihood in closed form (just MLE for each cluster's Gaussian, plus counting for $\pi_k$).
Plate diagram: $\pi \to z_n \to \mathbf{x}_n$, with $\boldsymbol\mu_k, \Sigma_k$ feeding $\mathbf{x}_n$ via the chosen $z_n$. EM alternates between treating $z_n$ as known (M-step) and as unknown but averaged (E-step).
$$\log p(\mathcal D\mid\boldsymbol\theta) = \sum_{n=1}^N \log \sum_{k=1}^K \pi_k\,\mathcal N(\mathbf{x}_n\mid\boldsymbol\mu_k, \Sigma_k).$$
The "log of sum" prevents the log from splitting linearly across components — we cannot just write down the MLE for each cluster. The function is not concave in $\boldsymbol\theta$, has many local maxima, and the gradient w.r.t. $\boldsymbol\mu_k$ couples through every term.
Worse: with no constraint, sending one $\boldsymbol\mu_k$ to a data point and shrinking that component's covariance to zero makes the likelihood $\to \infty$ — a degenerate solution. EM and a small covariance floor handle this in practice.
Instead of maximising $\log p(\mathcal D\mid\boldsymbol\theta)$ directly, alternate between (E) estimating $p(z_n\mid\mathbf{x}_n,\boldsymbol\theta)$ and (M) maximising the expected complete-data log-likelihood. Each round strictly increases $\log p(\mathcal D\mid\boldsymbol\theta)$ until a stationary point.
For any distribution $q(z_n)$ on the latents, Jensen's inequality gives
$$\log p(\mathbf{x}_n\mid\boldsymbol\theta) = \log \sum_{z_n} p(\mathbf{x}_n, z_n\mid\boldsymbol\theta) \;\ge\; \sum_{z_n} q(z_n)\log\frac{p(\mathbf{x}_n, z_n\mid\boldsymbol\theta)}{q(z_n)} \;=\; \mathcal{L}(q, \boldsymbol\theta).$$
$\mathcal L$ is the evidence lower bound (ELBO). The gap to the true log-likelihood is exactly $\mathrm{KL}(q\,\|\,p(z_n\mid\mathbf{x}_n,\boldsymbol\theta))$.
Each step increases $\log p(\mathcal D\mid\boldsymbol\theta)$ — the E-step closes the gap, the M-step pushes the bound up. EM monotonically increases the likelihood; it converges to a local optimum.
By Bayes' theorem (deck 06):
$$r_{nk} \equiv p(z_n=k\mid\mathbf{x}_n,\boldsymbol\theta) = \frac{\pi_k\,\mathcal N(\mathbf{x}_n\mid\boldsymbol\mu_k, \Sigma_k)}{\sum_{j=1}^K \pi_j\,\mathcal N(\mathbf{x}_n\mid\boldsymbol\mu_j, \Sigma_j)}.$$
$r_{nk}\in[0,1]$ with $\sum_k r_{nk} = 1$. The responsibility $r_{nk}$ is "how much cluster $k$ owns point $n$" under the current parameters.
If the Gaussians have identical isotropic covariance $\sigma^2 I$ and you let $\sigma^2\to 0$, the softmax-like responsibility collapses into a hard assignment ($r_{nk}=1$ for the nearest centroid, 0 elsewhere). EM on this degenerate GMM is exactly Lloyd's $k$-means algorithm.
Maximising the expected complete-data log-likelihood subject to $\sum_k \pi_k = 1$ gives, with $N_k = \sum_n r_{nk}$:
$$\pi_k = \frac{N_k}{N}, \qquad \boldsymbol\mu_k = \frac{1}{N_k}\sum_n r_{nk}\,\mathbf{x}_n, \qquad \Sigma_k = \frac{1}{N_k}\sum_n r_{nk}\,(\mathbf{x}_n - \boldsymbol\mu_k)(\mathbf{x}_n-\boldsymbol\mu_k)^\top.$$
Each component's mean and covariance are weighted MLEs of a Gaussian, with the responsibilities as the weights. The mixing weights are normalised counts.
The expected complete-data log-likelihood is a sum over $n$ and $k$ of $r_{nk}\log\bigl[\pi_k\,\mathcal N(\mathbf{x}_n\mid\boldsymbol\mu_k,\Sigma_k)\bigr]$. For fixed $r$ it is a sum of independent weighted-Gaussian MLEs (one per cluster) plus a categorical MLE on $\pi$. The decoupling is the whole point of introducing the latents.
Click in the canvas to add data points. Press E-step to recompute responsibilities (colours soft-blend by component); press M-step to refit the means and covariances. The log-likelihood is reported below.
Watch what happens with $K$ larger than the truth: spurious components either grab a single point with a tiny covariance (degeneracy) or split a true cluster. With $K$ smaller, a single component covers two true clusters with a larger covariance — the bias-variance trade-off from deck 08 on a clustering problem.
Three common initialisation strategies:
Without a covariance floor or a Bayesian prior on $\Sigma_k$, a component can latch onto a single point and shrink — the likelihood diverges. The simple fix is to add a small $\epsilon I$ to every M-step covariance estimate; the principled fix is an inverse-Wishart prior on $\Sigma_k$.
| Method | Idea | Notes |
|---|---|---|
| BIC | $-2\log\hat L + p\log N$, $p$ = parameter count | Cheap; penalises overfitting |
| AIC | $-2\log\hat L + 2p$ | Less aggressive penalty; tends to over-pick $K$ |
| Cross-validated log-likelihood | Fit on training, score on held-out | Most generally trustworthy |
| Variational Bayes / Dirichlet process | Prior over $K$; many components get $\pi_k\approx 0$ | Automatically prunes; needs more machinery |
For the book's running example, BIC works well; for image / embedding clusters it's not unusual to see hundreds or thousands of components, and BIC drastically under-picks.
| Quantity | Formula |
|---|---|
| density | $p(\mathbf x) = \sum_k \pi_k \mathcal N(\mathbf x\mid\boldsymbol\mu_k,\Sigma_k)$ |
| responsibility | $r_{nk} = \pi_k\mathcal N(\mathbf x_n\mid\boldsymbol\mu_k,\Sigma_k) / \sum_j(\dots)$ |
| $N_k$ | $\sum_n r_{nk}$ |
| $\pi_k$ update | $N_k / N$ |
| $\boldsymbol\mu_k$ update | $\tfrac1{N_k}\sum_n r_{nk}\mathbf x_n$ |
| $\Sigma_k$ update | $\tfrac1{N_k}\sum_n r_{nk}(\mathbf x_n-\boldsymbol\mu_k)(\mathbf x_n-\boldsymbol\mu_k)^\top$ |
| ELBO | $\sum_n \mathbb E_{q_n}\log p(\mathbf x_n, z_n\mid\boldsymbol\theta) - \mathbb E_{q_n}\log q_n$ |
| EM monotonicity | $\log p(\mathcal D\mid\boldsymbol\theta_{t+1}) \ge \log p(\mathcal D\mid\boldsymbol\theta_t)$ |
| $k$-means limit | EM with shared isotropic $\Sigma\to 0$ |