Mathematics for Machine Learning — Deck 02

Linear Algebra

From a single linear equation to vector spaces, rank, and linear maps — the language every later chapter of the book is written in.

systemsmatrices Gaussian eliminationvector spaces basisrank linear mapsaffine
$Ax=b$ row echelon vector space basis & rank linear map
00

Topics We'll Cover

01

Three Views of a Linear System

Consider the system the book uses as its running example:

$$\begin{aligned} 2x_1 + 4x_2 - 2x_3 &= \phantom{-}2 \\ 4x_1 + 9x_2 - 3x_3 &= \phantom{-}8 \\ -2x_1 + -3x_2 + 7x_3 &= 10 \end{aligned}$$

1. Algebraic view

"Find numbers $x_1, x_2, x_3$ that make every line simultaneously true." This is what the system says, and it's how you'd write it on paper in school.

2. Row view (matrix)

Stack coefficients into a matrix:

$$A\mathbf{x} = \mathbf{b}, \quad A=\begin{bmatrix}2 & 4 & -2\\4 & 9 & -3\\-2 & -3 & 7\end{bmatrix},\; \mathbf{b}=\begin{bmatrix}2\\8\\10\end{bmatrix}.$$

Each row is one equation; row operations preserve the solution set.

3. Column view

Re-read $A\mathbf{x}=\mathbf{b}$ as "find weights $x_1, x_2, x_3$ on the columns of $A$ so they sum to $\mathbf{b}$":

$$x_1\begin{bmatrix}2\\4\\-2\end{bmatrix} + x_2\begin{bmatrix}4\\9\\-3\end{bmatrix} + x_3\begin{bmatrix}-2\\-3\\7\end{bmatrix} = \begin{bmatrix}2\\8\\10\end{bmatrix}.$$

The solvability question becomes: is $\mathbf{b}$ in the span of the columns?

The book and this deck use all three views. The column view is the one that prepares us for vector spaces; the row view is the one we manipulate algorithmically.

02

Matrices and Matrix Operations

A matrix is an $m\times n$ rectangular array of numbers. Three operations matter, and one identity matters.

Sum and scalar multiple

$$(A+B)_{ij} = A_{ij} + B_{ij}, \qquad (\alpha A)_{ij} = \alpha A_{ij}.$$

Matrix multiplication

If $A\in\mathbb{R}^{m\times k}$ and $B\in\mathbb{R}^{k\times n}$, then $(AB)_{ij} = \sum_{\ell=1}^k A_{i\ell} B_{\ell j}$. The inner dimension $k$ must match. Multiplication is associative and distributive but not commutative.

Identity and inverse

$I_n$ has ones on the diagonal and zeros elsewhere; $I_n A = A I_n = A$ for $A$ of compatible size. A square matrix $A$ has an inverse $A^{-1}$ iff its columns span $\mathbb{R}^n$, equivalently $\det A\neq 0$, equivalently the system $A\mathbf{x}=\mathbf{0}$ has only the zero solution.

Three ways to read $AB$

(i) entry-by-entry dot products; (ii) $B$ acts on the columns of $A^\top$; (iii) the columns of $AB$ are linear combinations of the columns of $A$ with weights given by the columns of $B$. Every later chapter uses at least one of these.

The transpose

$(A^\top)_{ij}=A_{ji}$. The identity that comes back endlessly is $(AB)^\top = B^\top A^\top$ — written the way an attention layer multiplies query against key transpose.

03

Interactive: Gaussian Elimination

Drag the buttons to step through Gaussian elimination on the system. Three row operations are allowed; one preserves the solution set:

click Step to begin elimination → row echelon form
Pivots are amber; reduce below each pivot, then sweep above for reduced row-echelon form.
Solution: x = ?

The system $A\mathbf{x}=\mathbf{b}$ has a unique solution exactly when elimination produces three pivots and no zero row with a non-zero right-hand side. The next slide makes the "pivot" picture precise.

04

Reduced Row-Echelon Form and Pivots

After elimination a matrix is in row-echelon form (REF) if

  1. every all-zero row sits at the bottom,
  2. each leading non-zero entry (the pivot) sits strictly to the right of the pivot above it.

It's in reduced row-echelon form (RREF) if additionally every pivot equals 1 and every column containing a pivot has zeros elsewhere. RREF is unique — the row-reduction algorithm produces the same RREF regardless of which row operations you use to get there.

Three solvability cases

Unique solution

Every variable column has a pivot, and no inconsistency. The system has exactly one solution — $A$ is invertible if it's square.

Infinitely many

At least one variable column has no pivot — it's a free variable. The solution set is a translated subspace ("an affine hyperplane").

No solution

A row reads $0 = c$ for some $c\ne0$ — the right-hand side $\mathbf{b}$ is not in the column space of $A$.

The number of pivots equals the rank of $A$. The number of free variables equals $n - \mathrm{rank}(A)$; this is the dimension of the null space (slide 07).

05

Vector Spaces — The Eight Axioms

A real vector space $V$ is a set with two operations: a way to add elements and a way to scale them by real numbers. They must satisfy eight axioms.

AxiomStatement
associativity$(\mathbf{u}+\mathbf{v})+\mathbf{w} = \mathbf{u}+(\mathbf{v}+\mathbf{w})$
commutativity$\mathbf{u}+\mathbf{v} = \mathbf{v}+\mathbf{u}$
identity$\exists\, \mathbf{0} \in V$ with $\mathbf{v}+\mathbf{0}=\mathbf{v}$
inverse$\forall\, \mathbf{v}\in V, \;\exists\,-\mathbf{v}$ with $\mathbf{v}+(-\mathbf{v})=\mathbf{0}$
compatibility$\alpha(\beta\mathbf{v})=(\alpha\beta)\mathbf{v}$
scalar identity$1\cdot\mathbf{v}=\mathbf{v}$
scalar distributivity$\alpha(\mathbf{u}+\mathbf{v})=\alpha\mathbf{u}+\alpha\mathbf{v}$
vector distributivity$(\alpha+\beta)\mathbf{v}=\alpha\mathbf{v}+\beta\mathbf{v}$

Examples that recur in ML: $\mathbb{R}^D$ (every embedding); polynomials up to degree $k$ (feature maps); continuous functions on an interval (function spaces); and the space of $m\times n$ matrices (weight matrices).

Why the abstraction matters

Polynomials are vectors, but no polynomial is "an arrow at the origin". The abstract definition is exactly what lets us take ML algorithms designed for $\mathbb{R}^D$ — e.g. PCA — and apply them inside function spaces (kernel PCA), inside matrix spaces (low-rank weight factorisation), or inside infinite-dimensional Hilbert spaces (RKHS).

06

Span, Linear Independence, Basis

Linear combination and span

A linear combination of vectors $\mathbf{v}_1, \dots, \mathbf{v}_k$ is any $\alpha_1\mathbf{v}_1+\dots+\alpha_k\mathbf{v}_k$. The span is the set of all such combinations — always a subspace of $V$.

Linear independence

$\mathbf{v}_1,\dots,\mathbf{v}_k$ are linearly independent if the only $\alpha$'s with $\sum\alpha_i\mathbf{v}_i=\mathbf{0}$ are $\alpha_i=0$ for all $i$. Equivalently, no $\mathbf{v}_i$ is in the span of the others — nothing is redundant.

Basis and dimension

A basis $\mathcal{B}=(\mathbf{b}_1,\dots,\mathbf{b}_n)$ is an ordered linearly independent set whose span is the whole space. Every basis has the same number of vectors $n=\dim V$, the dimension.

Coordinates

Once a basis is chosen, every $\mathbf{v}\in V$ has a unique tuple of coordinates $[\mathbf{v}]_{\mathcal{B}}=(\alpha_1,\dots,\alpha_n)$ with $\mathbf{v}=\sum\alpha_i\mathbf{b}_i$. Picking a different basis re-labels every vector; the underlying geometry is unchanged. This is exactly the operation PCA performs in deck 10.

Test for independence

Stack the vectors as columns of a matrix $V$. They are independent iff $\mathrm{rank}(V) = k$, iff $V^\top V$ is invertible, iff the only solution of $V\boldsymbol\alpha = \mathbf{0}$ is $\boldsymbol\alpha=\mathbf{0}$. Gaussian elimination on $V$ gives all three answers simultaneously.

07

Rank and the Four Subspaces

For an $m\times n$ matrix $A$ the book defines:

Column space (range)

$\mathrm{Col}(A) = \mathrm{span}(\text{columns of } A) \subseteq \mathbb{R}^m$.

$\mathbf{b}\in\mathrm{Col}(A)$ iff $A\mathbf{x}=\mathbf{b}$ has a solution. $\dim\mathrm{Col}(A) = \mathrm{rank}(A)$.

Null space (kernel)

$\mathrm{Nul}(A) = \{\mathbf{x}\in\mathbb{R}^n : A\mathbf{x}=\mathbf{0}\}$.

$\dim\mathrm{Nul}(A) = n - \mathrm{rank}(A)$ — the number of free variables.

Row space

$\mathrm{Row}(A) = \mathrm{Col}(A^\top) \subseteq \mathbb{R}^n$. Same dimension as the column space: row-rank = column-rank.

Left null space

$\mathrm{Nul}(A^\top)=\{\mathbf{y}\in\mathbb{R}^m : A^\top\mathbf{y}=\mathbf{0}\}$. Dimension $m-\mathrm{rank}(A)$.

The two facts that come back constantly in ML:

$$\mathbb{R}^n = \mathrm{Row}(A) \oplus \mathrm{Nul}(A), \qquad \mathbb{R}^m = \mathrm{Col}(A) \oplus \mathrm{Nul}(A^\top).$$

The row space and null space are orthogonal complements inside $\mathbb{R}^n$; same for column & left-null inside $\mathbb{R}^m$. Deck 03 turns this into the orthogonal-projection picture that linear regression in deck 09 uses.

08

Linear Maps as Matrices

A function $\Phi:V\to W$ between vector spaces is linear if

$$\Phi(\alpha\mathbf{u}+\beta\mathbf{v}) = \alpha\Phi(\mathbf{u}) + \beta\Phi(\mathbf{v}).$$

Fix a basis $\mathcal{B}=(\mathbf{b}_1,\dots,\mathbf{b}_n)$ for $V$ and $\mathcal{C}=(\mathbf{c}_1,\dots,\mathbf{c}_m)$ for $W$. Write the image of each basis vector in the $\mathcal{C}$-coordinates:

$$\Phi(\mathbf{b}_j) = \sum_{i=1}^{m} A_{ij}\, \mathbf{c}_i.$$

The numbers $A_{ij}$ are the entries of the matrix representation $A=M_\mathcal{C\leftarrow B}(\Phi)$. Once we have $A$, applying $\Phi$ in coordinates is matrix-vector multiplication: $[\Phi(\mathbf{v})]_\mathcal{C} = A\,[\mathbf{v}]_\mathcal{B}$.

Why this matters for ML

Every fully-connected layer of a neural network is a linear map $\Phi$ followed by a non-linearity. The weight matrix $W$ is exactly $M_\mathcal{C\leftarrow B}(\Phi)$ — with the standard basis on both sides. Choosing a clever basis on either side (RoPE, PCA, whitening) gives the same $\Phi$ a different, often sparser, matrix.

Composition

If $\Phi:V\to W$ and $\Psi:W\to U$ are linear with matrices $A$ and $B$ respectively, then $\Psi\circ\Phi$ has matrix $BA$ (note the order). This is why matrix multiplication exists: it is composition of linear maps in coordinates.

09

Affine Spaces

A linear subspace must contain the origin. An affine subspace doesn't:

$$L = \mathbf{x}_0 + U = \{\mathbf{x}_0 + \mathbf{u} : \mathbf{u} \in U\}$$

where $U$ is a linear subspace and $\mathbf{x}_0$ is a fixed offset. Lines, planes, hyperplanes in general position are all affine.

Affine maps

$\boldsymbol\phi(\mathbf{x}) = A\mathbf{x} + \mathbf{c}$. Equivalently, a linear map followed by translation. The set of affine maps is what ML actually uses: every nn.Linear in PyTorch has a bias term.

Hyperplanes

An affine hyperplane in $\mathbb{R}^D$ is $\{\mathbf{x} : \mathbf{w}^\top\mathbf{x}=b\}$, defined by a normal $\mathbf{w}\ne\mathbf{0}$ and an offset $b\in\mathbb{R}$. The SVM in deck 12 is built around finding the affine hyperplane with the largest margin.

A small confession

The book is honest that almost everything ML calls "linear" is in fact affine. We continue this small abuse — it never causes confusion in practice as long as you remember to absorb the bias by appending a 1 to $\mathbf{x}$.

10

Where This Lands in Part II

ConceptUsed byWhere
Matrix-vector $A\mathbf{x}$Every algorithmDefines features → activations
Column spaceLinear regression (ch. 9)Least-squares prediction = projection onto $\mathrm{Col}(\Phi)$
Null spaceRidge regression (ch. 9)Regularisation picks a unique solution out of the affine solution set
RankPCA (ch. 10)The optimal rank-$k$ approximation has rank exactly $k$
Linear mapFeature map (ch. 8, 9, 12)$\boldsymbol\phi(\mathbf{x})$ before the model is a linear (or kernelised linear) map
HyperplaneSVM (ch. 12)The decision boundary is an affine hyperplane $\mathbf{w}^\top\mathbf{x}=b$

Every later deck quietly assumes the contents of this one. If you find yourself stuck on a step in deck 09, the answer is almost always: "what does this look like in the row / column / null-space picture?"

11

Cheat Sheet

ObjectDefinitionOne-line use
$A\in\mathbb{R}^{m\times n}$Rectangular array of numbersStores any linear map between coordinate spaces
$\mathrm{rank}(A)$Number of pivots in RREFDimension of the column space
$\mathrm{Nul}(A)$$\{\mathbf{x}:A\mathbf{x}=\mathbf{0}\}$Direction(s) the system is blind to
BasisLinearly independent spanning setGives every vector coordinates
DimensionSize of any basisCounts how many degrees of freedom the space has
Linear map $\Phi$$\Phi(\alpha\mathbf{u}+\beta\mathbf{v}) = \alpha\Phi\mathbf{u}+\beta\Phi\mathbf{v}$The right notion of "homomorphism" for vector spaces
Matrix of $\Phi$Coordinates of $\Phi(\mathbf{b}_j)$ in $\mathcal{C}$Lets you apply $\Phi$ by matrix multiplication
Composition$\Psi\circ\Phi \leftrightarrow BA$Stacking layers is matrix product
Affine subspace$\mathbf{x}_0+U$Solution set of a consistent under-determined system
Up next

Deck 03 puts a geometry on the vector space — inner products, lengths, angles, projections. From there every "shortest distance" question (closest point on a line, closest line through a cloud of points, closest separating hyperplane) has the same answer: orthogonal projection.