A map of the book: which mathematics, for which machine learning problem, and in what order.
Deisenroth, Faisal & Ong frame the book around a single observation:
"Most machine learning algorithms can be cast as a combination of four pillars — regression, dimensionality reduction, density estimation and classification — and each of them is built from four mathematical foundations: linear algebra, analytic geometry & vector calculus, probability & distributions, and continuous optimisation."
— paraphrased from the book's prefacePart I (chapters 2–7) develops the four foundations from scratch. Part II (chapters 8–12) takes each algorithm in turn and re-derives it using the tools of Part I — not as a list of recipes, but as something that had to look this way.
This deck series follows the same arc. Decks 02–07 visualise the foundations; decks 08–12 visualise how the four algorithms are built from them. The chapter and section numbers in every deck correspond to the book, so you can flip between them.
A theme of the book is that every ML algorithm can be cast in two equivalent forms, and the choice of form is largely a matter of taste — until you need uncertainty.
A deterministic map $f:\mathbb{R}^D \to \mathbb{R}$ or $\{1,\dots,K\}$.
$$\hat y = f(\mathbf{x};\boldsymbol\theta)$$
"Given $\mathbf{x}$, return the best guess for $y$." Risk-minimisation, regularisation, the bias-variance picture all live here.
A conditional distribution $p(y\mid\mathbf{x};\boldsymbol\theta)$, or a joint $p(\mathbf{x},y\mid\boldsymbol\theta)$.
$$\hat y \sim p(y\mid \mathbf{x};\boldsymbol\theta)$$
Same predictor at the mode — but now you also have a posterior, an entropy, and a calibrated uncertainty.
The book gives both versions of every Part II algorithm. The companion does the same: where it matters, you'll see the deterministic picture and the Bayesian posterior side by side.
The book's section 1.1 walks through the standard ML pipeline. Every later algorithm is some specialisation of this picture.
Raw observations $\{(\mathbf{x}_n, y_n)\}_{n=1}^N$. Numerical, after preprocessing.
$\boldsymbol\phi(\mathbf{x}_n)$ — a vector representation suitable for the model class.
A function (or distribution) class parameterised by $\boldsymbol\theta$.
The instance $f(\,\cdot\,;\hat{\boldsymbol\theta})$ produced once training has chosen $\hat{\boldsymbol\theta}$.
Training is the inverse problem: given data, find $\boldsymbol\theta$ that makes the model fit well, without overfitting. That single sentence hides three of the book's deepest themes — the loss / likelihood, regularisation / prior, and generalisation. Chapter 8 (deck 08) makes them explicit.
Six chapters, four foundations. The book splits "analytic geometry & vector calculus" across two chapters, and adds optimisation as a fifth foundation; we group them here exactly as the book does.
Vectors, matrices, systems of equations, vector spaces, basis & rank, linear maps. The language every later chapter uses.
Inner products, norms, projections, rotations, then determinants, eigenvalues, Cholesky, SVD. The structure theorems.
Partial derivatives, gradients, Jacobians, chain rule, backpropagation, multivariate Taylor.
Sample spaces, sum / product / Bayes, the Gaussian; then gradient descent, Lagrange multipliers, convex optimisation.
None of these chapters mention machine learning — but every example is chosen because Part II will use it. The Gaussian gets ten pages because deck 11 needs every line of them.
Each Part II chapter takes one algorithm and re-derives it. The structure of every chapter is the same: state the problem, write the model, set up the loss / likelihood, solve, then critique.
| Algorithm | Pillar | Heaviest Part I dependencies |
|---|---|---|
| Linear regression (ch. 9) | Regression | linear algebra (ch. 2), analytic geometry (ch. 3 — projection), probability (ch. 6 — Gaussian) |
| PCA (ch. 10) | Dim. reduction | matrix decompositions (ch. 4 — eigendecomposition, SVD), analytic geometry (ch. 3 — orthogonal projection) |
| GMM (ch. 11) | Density estimation | probability (ch. 6 — Gaussian, marginalisation), optimisation (ch. 7 — EM) |
| SVM (ch. 12) | Classification | optimisation (ch. 7 — Lagrange dual, convexity), analytic geometry (ch. 3 — hyperplane distance) |
Notice that every Part II algorithm uses at least two Part I chapters. This is why the book is the size it is: you can't shortcut to PCA without knowing what an eigenvector is.
Hover or tap any chapter node. Its prerequisites in Part I light up; the algorithms it feeds in Part II light up too. This is the same dependency graph the authors draw in figure 1.1 of the book, made live.
The graph is not quite a tree — chapter 9 (regression) depends on both projection (ch. 3) and the Gaussian (ch. 6), and the SVM in chapter 12 depends on chapter 3, 7 and a hint of chapter 6. The shortest path through the book is therefore not 1 → 2 → 3 → ... but rather a topological order over this DAG.
The book is careful about notation; so is the companion. The conventions below match the book exactly.
| Symbol | Meaning |
|---|---|
| $\mathbf{x}, \mathbf{v}, \boldsymbol\theta$ | Column vectors (bold lowercase) |
| $A, B, \Sigma$ | Matrices (capital letters) |
| $\alpha, \lambda, \sigma$ | Scalars (Greek lowercase) |
| $\mathbb{R}^D$ | $D$-dimensional real space |
| $\mathcal{N}(\boldsymbol\mu, \Sigma)$ | Multivariate Gaussian with mean $\boldsymbol\mu$ and covariance $\Sigma$ |
| $p(\mathbf{x}\mid\boldsymbol\theta)$ | Conditional density of $\mathbf{x}$ given parameter $\boldsymbol\theta$ |
| $\mathbb{E}_p[\cdot]$, $\mathbb{V}_p[\cdot]$ | Expectation and variance under $p$ |
| $\langle \mathbf{u}, \mathbf{v}\rangle$ | Inner product; equals $\mathbf{u}^\top\mathbf{v}$ in $\mathbb{R}^D$ with the standard inner product |
| $\|\mathbf{v}\|$ | Norm of $\mathbf{v}$ — defaults to $\sqrt{\langle\mathbf{v},\mathbf{v}\rangle}$ unless specified |
Two minor abuses we accept (the book accepts them too): writing $p(\mathbf{x})$ for both a probability mass function and a density when context is clear, and using the same symbol for a function and its evaluation when no confusion can arise.
Keep the free PDF on one screen and the deck on another. Every deck's slide titles match the book's section titles.
Every deck has at least one interactive widget. The maths is in the book; the widget is what builds your intuition. Drag, scrub, click.
The book's end-of-chapter exercises are excellent. The deck does not reproduce them — it builds the geometric and numerical intuition that makes them tractable.
If you already know chapter 2, skim deck 02 for the conventions and dive into deck 03. The DAG on slide 06 tells you what is safe to skip.
Just to make the abstract talk concrete, here's the simplest model in the book, with every piece of vocabulary labelled.
$$y_n \;=\; \theta\, x_n \;+\; \varepsilon_n, \qquad \varepsilon_n \sim \mathcal{N}(0,\sigma^2)$$
Six chapters of the book just appeared in one paragraph. Every step is a thing we'll meet again at higher dimension and with more sophistication.
Each row links to the corresponding deck in this series. Build the foundations first; everything in Part II then lands easily.
| # | Chapter | Companion deck |
|---|---|---|
| 2 | Linear Algebra | MML_02 → |
| 3 | Analytic Geometry | MML_03 → |
| 4 | Matrix Decompositions | MML_04 → |
| 5 | Vector Calculus | MML_05 → |
| 6 | Probability & Distributions | MML_06 → |
| 7 | Continuous Optimisation | MML_07 → |
| 8 | When Models Meet Data | MML_08 → |
| 9 | Linear Regression | MML_09 → |
| 10 | PCA | MML_10 → |
| 11 | GMM | MML_11 → |
| 12 | SVM | MML_12 → |
Deck 02 picks up where the book's chapter 2 starts: a single system of linear equations, the question "when does it have a solution?", and the journey from there to vector spaces, basis, rank and linear maps. By the end of deck 02 we have all the machinery needed to write every later algorithm.