MML 03 — Analytic Geometry

00

Topics We'll Cover

Norms and the unit ball
Inner products — abstract definition
Lengths, distances, angles, Cauchy–Schwarz
Orthonormal bases & Gram–Schmidt
Orthogonal complement
Interactive: orthogonal projection
Projection onto a subspace
Rotations
Inner products on function spaces
Where this lands in Part II
Cheat sheet

01

Norms and the Unit Ball

A norm on a real vector space $V$ is a function $\|\cdot\| : V \to \mathbb{R}_{\ge 0}$ satisfying

Positive definiteness: $\|\mathbf{v}\|=0 \iff \mathbf{v}=\mathbf{0}$;
Absolute homogeneity: $\|\alpha\mathbf{v}\| = |\alpha|\,\|\mathbf{v}\|$;
Triangle inequality: $\|\mathbf{u}+\mathbf{v}\| \le \|\mathbf{u}\| + \|\mathbf{v}\|$.

$\ell^1$ — "Manhattan"

$\|\mathbf{v}\|_1 = \sum_i |v_i|$.

Unit ball: diamond. Encourages sparsity (lasso).

$\ell^2$ — "Euclidean"

$\|\mathbf{v}\|_2 = \sqrt{\sum_i v_i^2}$.

Unit ball: circle. Smooth, rotationally invariant, what an inner product induces.

$\ell^\infty$ — "Chebyshev"

$\|\mathbf{v}\|_\infty = \max_i |v_i|$.

Unit ball: square. Useful for adversarial perturbations and robust ML.

For finite dimensions all norms are equivalent — they differ by at most a constant factor. They induce the same notion of convergence, but very different notions of "small": minimising under $\ell^1$ picks a sparse vector; under $\ell^2$ picks a smooth one. ML uses both.

02

Inner Products — The Abstract Definition

An inner product on $V$ is a map $\langle\cdot,\cdot\rangle : V\times V \to \mathbb{R}$ that is

Bilinear: $\langle\alpha\mathbf{u}+\beta\mathbf{v},\mathbf{w}\rangle = \alpha\langle\mathbf{u},\mathbf{w}\rangle + \beta\langle\mathbf{v},\mathbf{w}\rangle$ and linear in the second slot too;
Symmetric: $\langle\mathbf{u},\mathbf{v}\rangle = \langle\mathbf{v},\mathbf{u}\rangle$;
Positive definite: $\langle\mathbf{v},\mathbf{v}\rangle \ge 0$, with equality iff $\mathbf{v}=\mathbf{0}$.

Examples

Dot product on $\mathbb{R}^D$: $\langle\mathbf{u},\mathbf{v}\rangle = \mathbf{u}^\top\mathbf{v}$.
Weighted: $\langle\mathbf{u},\mathbf{v}\rangle_A = \mathbf{u}^\top A \mathbf{v}$ for any symmetric positive-definite $A$. Mahalanobis distance uses $A=\Sigma^{-1}$.
$L^2$ on functions: $\langle f,g\rangle = \int_a^b f(x)g(x)\,dx$. Fourier and kernel methods live here.

The induced norm

Every inner product produces a norm by $\|\mathbf{v}\| = \sqrt{\langle\mathbf{v},\mathbf{v}\rangle}$. Not every norm comes from an inner product: $\ell^1$ and $\ell^\infty$ don't.

Why this matters for ML

The inner product is the only object needed for projection, angle, and orthogonality. The kernel trick (deck 12) generalises by letting $\langle\mathbf{u},\mathbf{v}\rangle_k = k(\mathbf{u},\mathbf{v})$ — an inner product without ever writing down the corresponding vectors.

03

Lengths, Distances, Angles, Cauchy–Schwarz

Length

$$\|\mathbf{v}\| = \sqrt{\langle\mathbf{v},\mathbf{v}\rangle}.$$

Distance

$$d(\mathbf{u},\mathbf{v}) = \|\mathbf{u}-\mathbf{v}\|.$$

Angle

$$\cos\theta = \frac{\langle\mathbf{u},\mathbf{v}\rangle}{\|\mathbf{u}\|\,\|\mathbf{v}\|} \quad \in [-1, 1].$$

The fact that this ratio is in $[-1,1]$ is the Cauchy–Schwarz inequality:

$$|\langle\mathbf{u},\mathbf{v}\rangle| \le \|\mathbf{u}\|\,\|\mathbf{v}\|.$$

Equality holds iff $\mathbf{u}$ and $\mathbf{v}$ are linearly dependent. This single inequality powers most convergence proofs in optimisation and the regret bounds in online learning.

Orthogonality

$\mathbf{u}\perp\mathbf{v}$ iff $\langle\mathbf{u},\mathbf{v}\rangle=0$. When $\mathbf{u}\perp\mathbf{v}$, Pythagoras applies: $\|\mathbf{u}+\mathbf{v}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2$.

Cosine similarity in ML

The fact above is the basis of "cosine similarity" between embedding vectors: $\cos\theta(\mathbf{u},\mathbf{v}) \in [-1,1]$ measures how aligned the directions are, independent of magnitude. Two embeddings can be a thousand units apart in $\ell^2$ and still have $\cos\theta=1$.

04

Orthonormal Bases & Gram–Schmidt

A basis $(\mathbf{b}_1,\dots,\mathbf{b}_n)$ is orthonormal if $\langle\mathbf{b}_i,\mathbf{b}_j\rangle = \delta_{ij}$ (1 if $i=j$, 0 otherwise). In coordinates relative to an orthonormal basis, the inner product collapses to a dot product and a vector's coordinates are just its inner products with the basis vectors:

$$\mathbf{v} = \sum_{i=1}^n \langle \mathbf{v}, \mathbf{b}_i\rangle\, \mathbf{b}_i.$$

Gram–Schmidt

Given any basis $(\mathbf{v}_1, \dots, \mathbf{v}_n)$, produce an orthonormal one:

$$\mathbf{u}_k = \mathbf{v}_k - \sum_{j=1}^{k-1} \langle\mathbf{v}_k, \mathbf{b}_j\rangle\,\mathbf{b}_j, \qquad \mathbf{b}_k = \frac{\mathbf{u}_k}{\|\mathbf{u}_k\|}.$$

Each step strips off the component of $\mathbf{v}_k$ that already lies in the span of the previous basis vectors, then normalises. Stack the $\mathbf{b}_k$'s as columns and you get the $Q$ of the $QR$ factorisation (deck 04).

Numerical caveat

"Classical" Gram–Schmidt is numerically unstable for nearly-parallel vectors. The modified version subtracts each projection one at a time and accumulates much less round-off error; production libraries use modified GS or Householder reflections.

05

Orthogonal Complement

For a subspace $U\subseteq V$, the orthogonal complement is

$$U^\perp = \{\mathbf{w}\in V : \langle\mathbf{w},\mathbf{u}\rangle=0 \text{ for all } \mathbf{u}\in U\}.$$

Three facts:

$U^\perp$ is a subspace.
$\dim U + \dim U^\perp = \dim V$.
$V = U \oplus U^\perp$ — every $\mathbf{v}\in V$ has a unique decomposition $\mathbf{v} = \mathbf{u} + \mathbf{w}$ with $\mathbf{u}\in U$, $\mathbf{w}\in U^\perp$.

Tying back to deck 02

For any matrix $A$:

$$\mathrm{Row}(A)^\perp = \mathrm{Nul}(A), \qquad \mathrm{Col}(A)^\perp = \mathrm{Nul}(A^\top).$$

The four fundamental subspaces are therefore two orthogonal pairs sitting inside $\mathbb{R}^n$ and $\mathbb{R}^m$. This is the geometric content of the rank-nullity theorem.

06

Interactive: Orthogonal Projection

Drag the tip of $\mathbf{v}$ (purple). The projection onto the green line $L = \mathrm{span}(\mathbf{u})$ is computed live, and the residual is drawn as a dashed amber segment. Notice three things:

The residual is always perpendicular to $L$ — that's why the projection minimises distance.
Pythagoras holds: $\|\mathbf{v}\|^2 = \|\pi_L(\mathbf{v})\|^2 + \|\mathbf{v}-\pi_L(\mathbf{v})\|^2$.
Projection is idempotent: projecting twice gives the same answer.

Drag the purple dot. The green line is $\mathrm{span}(\mathbf{u})$ where $\mathbf{u}$ is the green vector.

|v|

—

|π_L(v)|

—

|v − π_L(v)|

—

cos θ

—

The formula on the line:

$$\pi_L(\mathbf{v}) = \frac{\langle\mathbf{v},\mathbf{u}\rangle}{\langle\mathbf{u},\mathbf{u}\rangle}\,\mathbf{u}.$$

07

Projection onto a Subspace

For a $k$-dimensional subspace $U = \mathrm{Col}(B)$ where $B\in\mathbb{R}^{D\times k}$ has linearly independent columns, the orthogonal projection of $\mathbf{v}$ onto $U$ is

$$\pi_U(\mathbf{v}) = B(B^\top B)^{-1} B^\top \mathbf{v}.$$

If the columns of $B$ are orthonormal the formula simplifies to $BB^\top \mathbf{v}$ — one matrix-vector multiply.

The minimising property

$\pi_U(\mathbf{v})$ is the unique element of $U$ closest to $\mathbf{v}$:

$$\pi_U(\mathbf{v}) = \arg\min_{\mathbf{u}\in U}\|\mathbf{v}-\mathbf{u}\|.$$

Proof: the residual $\mathbf{v}-\pi_U(\mathbf{v})$ lies in $U^\perp$; Pythagoras gives $\|\mathbf{v}-\mathbf{u}\|^2 = \|\mathbf{v}-\pi_U(\mathbf{v})\|^2 + \|\pi_U(\mathbf{v})-\mathbf{u}\|^2$, minimised at $\mathbf{u}=\pi_U(\mathbf{v})$.

Linear regression preview

The least-squares solution $\hat{\boldsymbol\theta} = (\Phi^\top\Phi)^{-1}\Phi^\top\mathbf{y}$ in deck 09 has the formula $\Phi\hat{\boldsymbol\theta} = \Phi(\Phi^\top\Phi)^{-1}\Phi^\top\mathbf{y}$ — exactly the projection of $\mathbf{y}$ onto $\mathrm{Col}(\Phi)$. The "$\Phi^\top$" on the right and the "$(\Phi^\top\Phi)^{-1}$" in the middle exist because $\Phi$ may not have orthonormal columns.

08

Rotations

A rotation is a linear map that preserves both lengths and orientation. In matrix form: $R\in\mathbb{R}^{D\times D}$ is a rotation iff $R^\top R = I$ and $\det R = +1$. The set of all such $R$ forms a group called $SO(D)$.

In $\mathbb{R}^2$

$$R(\theta) = \begin{bmatrix}\cos\theta & -\sin\theta\\ \sin\theta & \phantom{-}\cos\theta\end{bmatrix}.$$

$R(\theta_1)R(\theta_2)=R(\theta_1+\theta_2)$. A 2D rotation is uniquely determined by one angle.

In $\mathbb{R}^3$

Three angles (Euler angles, or roll/pitch/yaw), or equivalently a unit axis and a single angle (Rodrigues), or a unit quaternion.

Orthogonal but not a rotation

If $R^\top R = I$ but $\det R = -1$, $R$ is a reflection. Together they form the orthogonal group $O(D)$.

RoPE in transformers

Rotary Position Embeddings (used by Llama, Gemma, DeepSeek) rotate pairs of dimensions in $\mathbb{R}^d$ by an angle that depends on position. The position-dependent rotation matrix is block-diagonal with $d/2$ blocks each of the form above. The whole thing is a rotation.

09

Inner Products on Function Spaces

The abstract definition of inner product means we can put one on spaces of functions. The space $L^2([a,b])$ of square-integrable functions on $[a,b]$ comes with

$$\langle f, g\rangle = \int_a^b f(x)\, g(x)\, dx.$$

This makes "two functions are orthogonal" a sensible statement. Three uses in ML:

Fourier

$\{1, \sin nx, \cos nx\}$ is an orthogonal basis of $L^2([0,2\pi])$. Coefficients = inner products.

Polynomial features

Orthogonal polynomials (Legendre, Hermite) give numerically stable feature maps for regression.

Kernel methods

Mercer's theorem: a kernel $k(\mathbf{x},\mathbf{x}')$ is the inner product of feature maps in an RKHS. The SVM in deck 12 only ever uses $k$, never the features themselves.

The same theorems — Pythagoras, Cauchy–Schwarz, projection minimises distance — hold word-for-word in these function spaces. That's the power of the abstract definition: every theorem proved in this chapter applies everywhere.

10

Where This Lands in Part II

Concept	Used by	Where
$\ell^2$ norm	Ridge regression	Regulariser $\\|\boldsymbol\theta\\|^2$ shrinks toward zero
$\ell^1$ norm	Sparse models	Lasso selects features
Inner product	SVM (ch. 12)	Replaced by kernel $k(\mathbf{x},\mathbf{x}')$
Cosine similarity	Embeddings, retrieval	Length-invariant similarity
Cauchy–Schwarz	Convergence proofs	SGD, perceptron, regret bounds
Orthonormal basis	PCA (ch. 10)	Principal components are an orthonormal basis
Orthogonal projection	Linear regression (ch. 9)	$\hat{\mathbf{y}}$ is the projection of $\mathbf{y}$ onto $\mathrm{Col}(\Phi)$
Distance to hyperplane	SVM (ch. 12)	Margin = $\frac{\|\mathbf{w}^\top\mathbf{x}+b\|}{\\|\mathbf{w}\\|}$
Rotation	PCA, decoders	Whitening / RoPE / orthogonal weight init

11

Cheat Sheet

Quantity	Formula
length	$\\|\mathbf{v}\\| = \sqrt{\langle\mathbf{v},\mathbf{v}\rangle}$
distance	$d(\mathbf{u},\mathbf{v}) = \\|\mathbf{u}-\mathbf{v}\\|$
angle	$\cos\theta = \frac{\langle\mathbf{u},\mathbf{v}\rangle}{\\|\mathbf{u}\\|\,\\|\mathbf{v}\\|}$
Cauchy–Schwarz	$\|\langle\mathbf{u},\mathbf{v}\rangle\| \le \\|\mathbf{u}\\|\,\\|\mathbf{v}\\|$
projection onto a line	$\pi_L(\mathbf{v})=\frac{\langle\mathbf{v},\mathbf{u}\rangle}{\langle\mathbf{u},\mathbf{u}\rangle}\mathbf{u}$
projection onto $\mathrm{Col}(B)$	$B(B^\top B)^{-1}B^\top \mathbf{v}$
orthonormal projection	$BB^\top \mathbf{v}$ (when $B^\top B=I$)
residual is perpendicular	$\mathbf{v}-\pi_U(\mathbf{v})\in U^\perp$
rotation in $\mathbb{R}^2$	$\begin{bmatrix}\cos\theta & -\sin\theta\\\sin\theta & \cos\theta\end{bmatrix}$
$SO(D)$	$R^\top R=I,\;\det R=+1$

Up next

Deck 04 takes the same geometry one step further: when does a matrix have a complete orthonormal basis of eigenvectors? The answer (the spectral theorem) and its generalisation to rectangular matrices (the SVD) sit at the heart of PCA and low-rank approximation.