MML 01 — Introduction and Motivation

00

Topics We'll Cover

The book in one sentence
Two languages: predictor and probabilistic model
The data → features → predictor pipeline
Part I — the four foundations
Part II — the four algorithms
Interactive: the chapter dependency graph
Notation we'll use throughout
How to read the book and the deck together
A worked preview: a one-feature linear model
Chapter map and what's next

01

The Book in One Sentence

Deisenroth, Faisal & Ong frame the book around a single observation:

"Most machine learning algorithms can be cast as a combination of four pillars — regression, dimensionality reduction, density estimation and classification — and each of them is built from four mathematical foundations: linear algebra, analytic geometry & vector calculus, probability & distributions, and continuous optimisation."

— paraphrased from the book's preface

Part I (chapters 2–7) develops the four foundations from scratch. Part II (chapters 8–12) takes each algorithm in turn and re-derives it using the tools of Part I — not as a list of recipes, but as something that had to look this way.

Companion philosophy

This deck series follows the same arc. Decks 02–07 visualise the foundations; decks 08–12 visualise how the four algorithms are built from them. The chapter and section numbers in every deck correspond to the book, so you can flip between them.

02

Two Languages: Predictor and Probabilistic Model

A theme of the book is that every ML algorithm can be cast in two equivalent forms, and the choice of form is largely a matter of taste — until you need uncertainty.

1. Predictor function

A deterministic map $f:\mathbb{R}^D \to \mathbb{R}$ or $\{1,\dots,K\}$.

$$\hat y = f(\mathbf{x};\boldsymbol\theta)$$

"Given $\mathbf{x}$, return the best guess for $y$." Risk-minimisation, regularisation, the bias-variance picture all live here.

2. Probabilistic model

A conditional distribution $p(y\mid\mathbf{x};\boldsymbol\theta)$, or a joint $p(\mathbf{x},y\mid\boldsymbol\theta)$.

$$\hat y \sim p(y\mid \mathbf{x};\boldsymbol\theta)$$

Same predictor at the mode — but now you also have a posterior, an entropy, and a calibrated uncertainty.

The book gives both versions of every Part II algorithm. The companion does the same: where it matters, you'll see the deterministic picture and the Bayesian posterior side by side.

03

The Data → Features → Predictor Pipeline

The book's section 1.1 walks through the standard ML pipeline. Every later algorithm is some specialisation of this picture.

Data

Raw observations $\{(\mathbf{x}_n, y_n)\}_{n=1}^N$. Numerical, after preprocessing.

Features

$\boldsymbol\phi(\mathbf{x}_n)$ — a vector representation suitable for the model class.

Model

A function (or distribution) class parameterised by $\boldsymbol\theta$.

Predictor

The instance $f(\,\cdot\,;\hat{\boldsymbol\theta})$ produced once training has chosen $\hat{\boldsymbol\theta}$.

Training is the inverse problem: given data, find $\boldsymbol\theta$ that makes the model fit well, without overfitting. That single sentence hides three of the book's deepest themes — the loss / likelihood, regularisation / prior, and generalisation. Chapter 8 (deck 08) makes them explicit.

04

Part I — The Four Foundations

Six chapters, four foundations. The book splits "analytic geometry & vector calculus" across two chapters, and adds optimisation as a fifth foundation; we group them here exactly as the book does.

Linear algebra (ch. 2)

Vectors, matrices, systems of equations, vector spaces, basis & rank, linear maps. The language every later chapter uses.

Analytic geometry (ch. 3) & matrix decompositions (ch. 4)

Inner products, norms, projections, rotations, then determinants, eigenvalues, Cholesky, SVD. The structure theorems.

Vector calculus (ch. 5)

Partial derivatives, gradients, Jacobians, chain rule, backpropagation, multivariate Taylor.

Probability (ch. 6) & optimisation (ch. 7)

Sample spaces, sum / product / Bayes, the Gaussian; then gradient descent, Lagrange multipliers, convex optimisation.

None of these chapters mention machine learning — but every example is chosen because Part II will use it. The Gaussian gets ten pages because deck 11 needs every line of them.

05

Part II — The Four Algorithms

Each Part II chapter takes one algorithm and re-derives it. The structure of every chapter is the same: state the problem, write the model, set up the loss / likelihood, solve, then critique.

Algorithm	Pillar	Heaviest Part I dependencies
Linear regression (ch. 9)	Regression	linear algebra (ch. 2), analytic geometry (ch. 3 — projection), probability (ch. 6 — Gaussian)
PCA (ch. 10)	Dim. reduction	matrix decompositions (ch. 4 — eigendecomposition, SVD), analytic geometry (ch. 3 — orthogonal projection)
GMM (ch. 11)	Density estimation	probability (ch. 6 — Gaussian, marginalisation), optimisation (ch. 7 — EM)
SVM (ch. 12)	Classification	optimisation (ch. 7 — Lagrange dual, convexity), analytic geometry (ch. 3 — hyperplane distance)

Notice that every Part II algorithm uses at least two Part I chapters. This is why the book is the size it is: you can't shortcut to PCA without knowing what an eigenvector is.

06

Interactive: The Chapter Dependency Graph

Hover or tap any chapter node. Its prerequisites in Part I light up; the algorithms it feeds in Part II light up too. This is the same dependency graph the authors draw in figure 1.1 of the book, made live.

Hover any chapter; click to pin.

Pedagogical note

The graph is not quite a tree — chapter 9 (regression) depends on both projection (ch. 3) and the Gaussian (ch. 6), and the SVM in chapter 12 depends on chapter 3, 7 and a hint of chapter 6. The shortest path through the book is therefore not 1 → 2 → 3 → ... but rather a topological order over this DAG.

07

Notation We'll Use Throughout

The book is careful about notation; so is the companion. The conventions below match the book exactly.

Symbol	Meaning
$\mathbf{x}, \mathbf{v}, \boldsymbol\theta$	Column vectors (bold lowercase)
$A, B, \Sigma$	Matrices (capital letters)
$\alpha, \lambda, \sigma$	Scalars (Greek lowercase)
$\mathbb{R}^D$	$D$-dimensional real space
$\mathcal{N}(\boldsymbol\mu, \Sigma)$	Multivariate Gaussian with mean $\boldsymbol\mu$ and covariance $\Sigma$
$p(\mathbf{x}\mid\boldsymbol\theta)$	Conditional density of $\mathbf{x}$ given parameter $\boldsymbol\theta$
$\mathbb{E}_p[\cdot]$, $\mathbb{V}_p[\cdot]$	Expectation and variance under $p$
$\langle \mathbf{u}, \mathbf{v}\rangle$	Inner product; equals $\mathbf{u}^\top\mathbf{v}$ in $\mathbb{R}^D$ with the standard inner product
$\\|\mathbf{v}\\|$	Norm of $\mathbf{v}$ — defaults to $\sqrt{\langle\mathbf{v},\mathbf{v}\rangle}$ unless specified

Two minor abuses we accept (the book accepts them too): writing $p(\mathbf{x})$ for both a probability mass function and a density when context is clear, and using the same symbol for a function and its evaluation when no confusion can arise.

08

How to Read the Book and the Deck Together

Open both

Keep the free PDF on one screen and the deck on another. Every deck's slide titles match the book's section titles.

Touch the widgets

Every deck has at least one interactive widget. The maths is in the book; the widget is what builds your intuition. Drag, scrub, click.

Do the exercises

The book's end-of-chapter exercises are excellent. The deck does not reproduce them — it builds the geometric and numerical intuition that makes them tractable.

Skip strategically

If you already know chapter 2, skim deck 02 for the conventions and dive into deck 03. The DAG on slide 06 tells you what is safe to skip.

09

A Worked Preview — A One-Feature Linear Model

Just to make the abstract talk concrete, here's the simplest model in the book, with every piece of vocabulary labelled.

$$y_n \;=\; \theta\, x_n \;+\; \varepsilon_n, \qquad \varepsilon_n \sim \mathcal{N}(0,\sigma^2)$$

Data: $\{(x_n, y_n)\}_{n=1}^N$, with $x_n, y_n\in\mathbb{R}$ (deck 08).
Features: trivial here — $\phi(x)=x$. Deck 09 generalises to $\phi(x) = (1, x, x^2, \dots)$.
Model: linear with one parameter $\theta\in\mathbb{R}$ and Gaussian noise (deck 09).
Loss: empirical risk = mean squared error $\;\hat R(\theta) = \tfrac{1}{N}\sum_n (y_n-\theta x_n)^2$ (deck 08).
Solution: gradient zero ⇒ $\hat\theta = \sum x_n y_n / \sum x_n^2$. This is the simplest case of the normal equation (deck 09); geometrically it's the orthogonal projection of $\mathbf{y}$ on the line spanned by $\mathbf{x}$ (deck 03).
Uncertainty: assuming the noise model is right, the posterior $p(\theta\mid\text{data})$ is Gaussian (deck 06 & 09).

Six chapters of the book just appeared in one paragraph. Every step is a thing we'll meet again at higher dimension and with more sophistication.

10

Chapter Map and What's Next

Each row links to the corresponding deck in this series. Build the foundations first; everything in Part II then lands easily.

#	Chapter	Companion deck
2	Linear Algebra	MML_02 →
3	Analytic Geometry	MML_03 →
4	Matrix Decompositions	MML_04 →
5	Vector Calculus	MML_05 →
6	Probability & Distributions	MML_06 →
7	Continuous Optimisation	MML_07 →
8	When Models Meet Data	MML_08 →
9	Linear Regression	MML_09 →
10	PCA	MML_10 →
11	GMM	MML_11 →
12	SVM	MML_12 →

Up next

Deck 02 picks up where the book's chapter 2 starts: a single system of linear equations, the question "when does it have a solution?", and the journey from there to vector spaces, basis, rank and linear maps. By the end of deck 02 we have all the machinery needed to write every later algorithm.