From a single linear equation to vector spaces, rank, and linear maps — the language every later chapter of the book is written in.
Consider the system the book uses as its running example:
$$\begin{aligned} 2x_1 + 4x_2 - 2x_3 &= \phantom{-}2 \\ 4x_1 + 9x_2 - 3x_3 &= \phantom{-}8 \\ -2x_1 + -3x_2 + 7x_3 &= 10 \end{aligned}$$
"Find numbers $x_1, x_2, x_3$ that make every line simultaneously true." This is what the system says, and it's how you'd write it on paper in school.
Stack coefficients into a matrix:
$$A\mathbf{x} = \mathbf{b}, \quad A=\begin{bmatrix}2 & 4 & -2\\4 & 9 & -3\\-2 & -3 & 7\end{bmatrix},\; \mathbf{b}=\begin{bmatrix}2\\8\\10\end{bmatrix}.$$
Each row is one equation; row operations preserve the solution set.
Re-read $A\mathbf{x}=\mathbf{b}$ as "find weights $x_1, x_2, x_3$ on the columns of $A$ so they sum to $\mathbf{b}$":
$$x_1\begin{bmatrix}2\\4\\-2\end{bmatrix} + x_2\begin{bmatrix}4\\9\\-3\end{bmatrix} + x_3\begin{bmatrix}-2\\-3\\7\end{bmatrix} = \begin{bmatrix}2\\8\\10\end{bmatrix}.$$
The solvability question becomes: is $\mathbf{b}$ in the span of the columns?
The book and this deck use all three views. The column view is the one that prepares us for vector spaces; the row view is the one we manipulate algorithmically.
A matrix is an $m\times n$ rectangular array of numbers. Three operations matter, and one identity matters.
$$(A+B)_{ij} = A_{ij} + B_{ij}, \qquad (\alpha A)_{ij} = \alpha A_{ij}.$$
If $A\in\mathbb{R}^{m\times k}$ and $B\in\mathbb{R}^{k\times n}$, then $(AB)_{ij} = \sum_{\ell=1}^k A_{i\ell} B_{\ell j}$. The inner dimension $k$ must match. Multiplication is associative and distributive but not commutative.
$I_n$ has ones on the diagonal and zeros elsewhere; $I_n A = A I_n = A$ for $A$ of compatible size. A square matrix $A$ has an inverse $A^{-1}$ iff its columns span $\mathbb{R}^n$, equivalently $\det A\neq 0$, equivalently the system $A\mathbf{x}=\mathbf{0}$ has only the zero solution.
(i) entry-by-entry dot products; (ii) $B$ acts on the columns of $A^\top$; (iii) the columns of $AB$ are linear combinations of the columns of $A$ with weights given by the columns of $B$. Every later chapter uses at least one of these.
$(A^\top)_{ij}=A_{ji}$. The identity that comes back endlessly is $(AB)^\top = B^\top A^\top$ — written the way an attention layer multiplies query against key transpose.
Drag the buttons to step through Gaussian elimination on the system. Three row operations are allowed; one preserves the solution set:
The system $A\mathbf{x}=\mathbf{b}$ has a unique solution exactly when elimination produces three pivots and no zero row with a non-zero right-hand side. The next slide makes the "pivot" picture precise.
After elimination a matrix is in row-echelon form (REF) if
It's in reduced row-echelon form (RREF) if additionally every pivot equals 1 and every column containing a pivot has zeros elsewhere. RREF is unique — the row-reduction algorithm produces the same RREF regardless of which row operations you use to get there.
Every variable column has a pivot, and no inconsistency. The system has exactly one solution — $A$ is invertible if it's square.
At least one variable column has no pivot — it's a free variable. The solution set is a translated subspace ("an affine hyperplane").
A row reads $0 = c$ for some $c\ne0$ — the right-hand side $\mathbf{b}$ is not in the column space of $A$.
The number of pivots equals the rank of $A$. The number of free variables equals $n - \mathrm{rank}(A)$; this is the dimension of the null space (slide 07).
A real vector space $V$ is a set with two operations: a way to add elements and a way to scale them by real numbers. They must satisfy eight axioms.
| Axiom | Statement |
|---|---|
| associativity | $(\mathbf{u}+\mathbf{v})+\mathbf{w} = \mathbf{u}+(\mathbf{v}+\mathbf{w})$ |
| commutativity | $\mathbf{u}+\mathbf{v} = \mathbf{v}+\mathbf{u}$ |
| identity | $\exists\, \mathbf{0} \in V$ with $\mathbf{v}+\mathbf{0}=\mathbf{v}$ |
| inverse | $\forall\, \mathbf{v}\in V, \;\exists\,-\mathbf{v}$ with $\mathbf{v}+(-\mathbf{v})=\mathbf{0}$ |
| compatibility | $\alpha(\beta\mathbf{v})=(\alpha\beta)\mathbf{v}$ |
| scalar identity | $1\cdot\mathbf{v}=\mathbf{v}$ |
| scalar distributivity | $\alpha(\mathbf{u}+\mathbf{v})=\alpha\mathbf{u}+\alpha\mathbf{v}$ |
| vector distributivity | $(\alpha+\beta)\mathbf{v}=\alpha\mathbf{v}+\beta\mathbf{v}$ |
Examples that recur in ML: $\mathbb{R}^D$ (every embedding); polynomials up to degree $k$ (feature maps); continuous functions on an interval (function spaces); and the space of $m\times n$ matrices (weight matrices).
Polynomials are vectors, but no polynomial is "an arrow at the origin". The abstract definition is exactly what lets us take ML algorithms designed for $\mathbb{R}^D$ — e.g. PCA — and apply them inside function spaces (kernel PCA), inside matrix spaces (low-rank weight factorisation), or inside infinite-dimensional Hilbert spaces (RKHS).
A linear combination of vectors $\mathbf{v}_1, \dots, \mathbf{v}_k$ is any $\alpha_1\mathbf{v}_1+\dots+\alpha_k\mathbf{v}_k$. The span is the set of all such combinations — always a subspace of $V$.
$\mathbf{v}_1,\dots,\mathbf{v}_k$ are linearly independent if the only $\alpha$'s with $\sum\alpha_i\mathbf{v}_i=\mathbf{0}$ are $\alpha_i=0$ for all $i$. Equivalently, no $\mathbf{v}_i$ is in the span of the others — nothing is redundant.
A basis $\mathcal{B}=(\mathbf{b}_1,\dots,\mathbf{b}_n)$ is an ordered linearly independent set whose span is the whole space. Every basis has the same number of vectors $n=\dim V$, the dimension.
Once a basis is chosen, every $\mathbf{v}\in V$ has a unique tuple of coordinates $[\mathbf{v}]_{\mathcal{B}}=(\alpha_1,\dots,\alpha_n)$ with $\mathbf{v}=\sum\alpha_i\mathbf{b}_i$. Picking a different basis re-labels every vector; the underlying geometry is unchanged. This is exactly the operation PCA performs in deck 10.
Stack the vectors as columns of a matrix $V$. They are independent iff $\mathrm{rank}(V) = k$, iff $V^\top V$ is invertible, iff the only solution of $V\boldsymbol\alpha = \mathbf{0}$ is $\boldsymbol\alpha=\mathbf{0}$. Gaussian elimination on $V$ gives all three answers simultaneously.
For an $m\times n$ matrix $A$ the book defines:
$\mathrm{Col}(A) = \mathrm{span}(\text{columns of } A) \subseteq \mathbb{R}^m$.
$\mathbf{b}\in\mathrm{Col}(A)$ iff $A\mathbf{x}=\mathbf{b}$ has a solution. $\dim\mathrm{Col}(A) = \mathrm{rank}(A)$.
$\mathrm{Nul}(A) = \{\mathbf{x}\in\mathbb{R}^n : A\mathbf{x}=\mathbf{0}\}$.
$\dim\mathrm{Nul}(A) = n - \mathrm{rank}(A)$ — the number of free variables.
$\mathrm{Row}(A) = \mathrm{Col}(A^\top) \subseteq \mathbb{R}^n$. Same dimension as the column space: row-rank = column-rank.
$\mathrm{Nul}(A^\top)=\{\mathbf{y}\in\mathbb{R}^m : A^\top\mathbf{y}=\mathbf{0}\}$. Dimension $m-\mathrm{rank}(A)$.
The two facts that come back constantly in ML:
$$\mathbb{R}^n = \mathrm{Row}(A) \oplus \mathrm{Nul}(A), \qquad \mathbb{R}^m = \mathrm{Col}(A) \oplus \mathrm{Nul}(A^\top).$$
The row space and null space are orthogonal complements inside $\mathbb{R}^n$; same for column & left-null inside $\mathbb{R}^m$. Deck 03 turns this into the orthogonal-projection picture that linear regression in deck 09 uses.
A function $\Phi:V\to W$ between vector spaces is linear if
$$\Phi(\alpha\mathbf{u}+\beta\mathbf{v}) = \alpha\Phi(\mathbf{u}) + \beta\Phi(\mathbf{v}).$$
Fix a basis $\mathcal{B}=(\mathbf{b}_1,\dots,\mathbf{b}_n)$ for $V$ and $\mathcal{C}=(\mathbf{c}_1,\dots,\mathbf{c}_m)$ for $W$. Write the image of each basis vector in the $\mathcal{C}$-coordinates:
$$\Phi(\mathbf{b}_j) = \sum_{i=1}^{m} A_{ij}\, \mathbf{c}_i.$$
The numbers $A_{ij}$ are the entries of the matrix representation $A=M_\mathcal{C\leftarrow B}(\Phi)$. Once we have $A$, applying $\Phi$ in coordinates is matrix-vector multiplication: $[\Phi(\mathbf{v})]_\mathcal{C} = A\,[\mathbf{v}]_\mathcal{B}$.
Every fully-connected layer of a neural network is a linear map $\Phi$ followed by a non-linearity. The weight matrix $W$ is exactly $M_\mathcal{C\leftarrow B}(\Phi)$ — with the standard basis on both sides. Choosing a clever basis on either side (RoPE, PCA, whitening) gives the same $\Phi$ a different, often sparser, matrix.
If $\Phi:V\to W$ and $\Psi:W\to U$ are linear with matrices $A$ and $B$ respectively, then $\Psi\circ\Phi$ has matrix $BA$ (note the order). This is why matrix multiplication exists: it is composition of linear maps in coordinates.
A linear subspace must contain the origin. An affine subspace doesn't:
$$L = \mathbf{x}_0 + U = \{\mathbf{x}_0 + \mathbf{u} : \mathbf{u} \in U\}$$
where $U$ is a linear subspace and $\mathbf{x}_0$ is a fixed offset. Lines, planes, hyperplanes in general position are all affine.
$\boldsymbol\phi(\mathbf{x}) = A\mathbf{x} + \mathbf{c}$. Equivalently, a linear map followed by translation. The set of affine maps is what ML actually uses: every nn.Linear in PyTorch has a bias term.
An affine hyperplane in $\mathbb{R}^D$ is $\{\mathbf{x} : \mathbf{w}^\top\mathbf{x}=b\}$, defined by a normal $\mathbf{w}\ne\mathbf{0}$ and an offset $b\in\mathbb{R}$. The SVM in deck 12 is built around finding the affine hyperplane with the largest margin.
The book is honest that almost everything ML calls "linear" is in fact affine. We continue this small abuse — it never causes confusion in practice as long as you remember to absorb the bias by appending a 1 to $\mathbf{x}$.
| Concept | Used by | Where |
|---|---|---|
| Matrix-vector $A\mathbf{x}$ | Every algorithm | Defines features → activations |
| Column space | Linear regression (ch. 9) | Least-squares prediction = projection onto $\mathrm{Col}(\Phi)$ |
| Null space | Ridge regression (ch. 9) | Regularisation picks a unique solution out of the affine solution set |
| Rank | PCA (ch. 10) | The optimal rank-$k$ approximation has rank exactly $k$ |
| Linear map | Feature map (ch. 8, 9, 12) | $\boldsymbol\phi(\mathbf{x})$ before the model is a linear (or kernelised linear) map |
| Hyperplane | SVM (ch. 12) | The decision boundary is an affine hyperplane $\mathbf{w}^\top\mathbf{x}=b$ |
Every later deck quietly assumes the contents of this one. If you find yourself stuck on a step in deck 09, the answer is almost always: "what does this look like in the row / column / null-space picture?"
| Object | Definition | One-line use |
|---|---|---|
| $A\in\mathbb{R}^{m\times n}$ | Rectangular array of numbers | Stores any linear map between coordinate spaces |
| $\mathrm{rank}(A)$ | Number of pivots in RREF | Dimension of the column space |
| $\mathrm{Nul}(A)$ | $\{\mathbf{x}:A\mathbf{x}=\mathbf{0}\}$ | Direction(s) the system is blind to |
| Basis | Linearly independent spanning set | Gives every vector coordinates |
| Dimension | Size of any basis | Counts how many degrees of freedom the space has |
| Linear map $\Phi$ | $\Phi(\alpha\mathbf{u}+\beta\mathbf{v}) = \alpha\Phi\mathbf{u}+\beta\Phi\mathbf{v}$ | The right notion of "homomorphism" for vector spaces |
| Matrix of $\Phi$ | Coordinates of $\Phi(\mathbf{b}_j)$ in $\mathcal{C}$ | Lets you apply $\Phi$ by matrix multiplication |
| Composition | $\Psi\circ\Phi \leftrightarrow BA$ | Stacking layers is matrix product |
| Affine subspace | $\mathbf{x}_0+U$ | Solution set of a consistent under-determined system |
Deck 03 puts a geometry on the vector space — inner products, lengths, angles, projections. From there every "shortest distance" question (closest point on a line, closest line through a cloud of points, closest separating hyperplane) has the same answer: orthogonal projection.