Linear Algebra for AI 04 — Inner Products, Norms & Geometry

00

Topics We'll Cover

The Dot Product Three Ways
Norms — Lp and Friends
Cauchy-Schwarz & the Triangle Inequality
Cosine Similarity
Orthogonality & Projection (Preview)
General Inner Products: Mahalanobis & the Gram Matrix
Concentration: Random Vectors are Almost Orthogonal
Why Attention Scales by 1/√d
Interactive: Cosine Similarity Explorer
Cheat Sheet

01

The Dot Product Three Ways

The dot product on $\mathbb{R}^n$ has three equivalent definitions; the elegance is that they all give the same number.

Algebraic

$$\langle \mathbf{u}, \mathbf{v} \rangle = \sum_{i=1}^n u_i v_i.$$

The component-wise sum of products. The thing the GPU computes.

Matrix

$$\langle \mathbf{u}, \mathbf{v} \rangle = \mathbf{u}^\top \mathbf{v}.$$

One row times one column. The matmul atom.

Geometric

$$\langle \mathbf{u}, \mathbf{v} \rangle = \|\mathbf{u}\|\,\|\mathbf{v}\|\,\cos\theta.$$

Length-times-length-times-cosine-of-angle. The picture in $\mathbb{R}^2$.

The four axioms

An inner product on a real vector space $V$ is a function $\langle \cdot, \cdot \rangle : V \times V \to \mathbb{R}$ satisfying:

Symmetry: $\langle \mathbf{u}, \mathbf{v} \rangle = \langle \mathbf{v}, \mathbf{u} \rangle$.
Linearity in the first argument: $\langle \alpha \mathbf{u} + \beta \mathbf{w}, \mathbf{v} \rangle = \alpha \langle \mathbf{u}, \mathbf{v} \rangle + \beta \langle \mathbf{w}, \mathbf{v} \rangle$.
Positive definiteness: $\langle \mathbf{u}, \mathbf{u} \rangle \ge 0$, with equality iff $\mathbf{u} = \mathbf{0}$.
(Symmetry + linearity-in-1st ⇒ linearity in 2nd by free.)

Any function satisfying these defines a "geometry" on the space — a notion of length and angle. The standard dot product on $\mathbb{R}^n$ is one example, the Mahalanobis inner product (slide 06) is another.

02

Norms — Lp and Friends

A norm $\|\cdot\| : V \to [0, \infty)$ satisfies three properties: $\|\mathbf{v}\| = 0 \iff \mathbf{v} = \mathbf{0}$; $\|\alpha \mathbf{v}\| = |\alpha|\,\|\mathbf{v}\|$; and $\|\mathbf{u} + \mathbf{v}\| \le \|\mathbf{u}\| + \|\mathbf{v}\|$ (the triangle inequality).

Every inner product gives a norm: $\|\mathbf{v}\| = \sqrt{\langle \mathbf{v}, \mathbf{v} \rangle}$. But not every norm comes from an inner product (only the L2 norm does).

Norm	Formula	Where it shows up in ML
$L^1$ (taxicab)	$\sum_i \|v_i\|$	Lasso regularisation, sparse-coding objective, total-variation losses.
$L^2$ (Euclidean)	$\sqrt{\sum_i v_i^2}$	Default everywhere — weight decay, MSE, gradient norms, embedding distance, attention scaling.
$L^\infty$ (max)	$\max_i \|v_i\|$	Adversarial $\varepsilon$-balls, max-perturbation attack budgets, PGD attacks.
$L^p$ ($1 \le p < \infty$)	$(\sum_i \|v_i\|^p)^{1/p}$	Less common in ML; appears in regularisation theory and analysis of generalisation bounds.
Frobenius (matrix)	$\sqrt{\sum_{ij} A_{ij}^2}$	Treats a matrix as a flat vector. Standard ML matrix norm; the Eckart-Young theorem (deck 07) is in Frobenius norm.
Spectral (matrix)	$\sigma_1(A)$ (largest singular value)	Lipschitz constant of a linear layer; used in spectral normalisation (Miyato et al. 2018) for GAN stability.

$L^1$, $L^2$ and $L^\infty$ unit balls in $\mathbb{R}^2$ are a square rotated 45° (a diamond), a circle, and an axis-aligned square. The corners of the $L^1$ ball sit on the axes — this is the geometric reason Lasso encourages sparse solutions: the optimum tends to land at a corner where some coordinates are exactly zero.

03

Cauchy-Schwarz & the Triangle Inequality

The most-cited inequality in linear algebra:

$$|\langle \mathbf{u}, \mathbf{v} \rangle| \le \|\mathbf{u}\|\,\|\mathbf{v}\|.$$

Equality iff $\mathbf{u}$ and $\mathbf{v}$ are parallel. Reading: the absolute dot product is bounded by the product of the lengths — equivalently, $|\cos\theta| \le 1$.

Proof in two lines (real, $\mathbf{v} \neq \mathbf{0}$)

For any $t \in \mathbb{R}$, $0 \le \|\mathbf{u} - t\mathbf{v}\|^2 = \|\mathbf{u}\|^2 - 2t \langle \mathbf{u}, \mathbf{v} \rangle + t^2 \|\mathbf{v}\|^2$. Choose $t = \langle \mathbf{u}, \mathbf{v} \rangle / \|\mathbf{v}\|^2$ to minimise the RHS; the minimum is $\|\mathbf{u}\|^2 - \langle \mathbf{u}, \mathbf{v} \rangle^2 / \|\mathbf{v}\|^2 \ge 0$. Rearranging gives Cauchy-Schwarz.

Triangle inequality, as a corollary

$$\|\mathbf{u} + \mathbf{v}\|^2 = \|\mathbf{u}\|^2 + 2\langle \mathbf{u}, \mathbf{v} \rangle + \|\mathbf{v}\|^2 \le \|\mathbf{u}\|^2 + 2\|\mathbf{u}\|\|\mathbf{v}\| + \|\mathbf{v}\|^2 = (\|\mathbf{u}\| + \|\mathbf{v}\|)^2.$$

Cauchy-Schwarz is doing the work in the middle.

Why ML practitioners care

Cauchy-Schwarz bounds everything built from dot products. Attention scores live in $[-\|\mathbf{q}\|\|\mathbf{k}\|, +\|\mathbf{q}\|\|\mathbf{k}\|]$. A normalisation step ($\|\mathbf{q}\| = \|\mathbf{k}\| = 1$) tightens this to $[-1, 1]$. The Lipschitz-bounded layers used in GAN stabilisation lean on Cauchy-Schwarz to get worst-case behaviour. The attention $1/\sqrt{d}$ scaling depends on what Cauchy-Schwarz doesn't tell you — the typical case — which we cover on slide 08.

04

Cosine Similarity

The cosine similarity of $\mathbf{u}, \mathbf{v}$ is the cosine of the angle between them:

$$\mathrm{cos\_sim}(\mathbf{u}, \mathbf{v}) = \frac{\langle \mathbf{u}, \mathbf{v} \rangle}{\|\mathbf{u}\|\,\|\mathbf{v}\|}.$$

Range $[-1, 1]$ (by Cauchy-Schwarz). $+1$ = same direction, $-1$ = opposite, $0$ = orthogonal. This is the standard similarity score for embeddings in retrieval, RAG, deduplication, and most metric-learning losses.

Why cosine, not Euclidean distance?

For embeddings, direction is meaningful but length is often a side-effect of token frequency, training instability, or the length of an input sequence. A short query and a long document can have similar topical content but very different L2 norms; cosine ignores that. Equivalently, cosine is the dot product after L2-normalising both vectors.

Cosine vs dot vs Euclidean — the relations

If $\hat{\mathbf{u}}, \hat{\mathbf{v}}$ are unit vectors,

$$\|\hat{\mathbf{u}} - \hat{\mathbf{v}}\|^2 = 2 - 2\,\mathrm{cos\_sim}(\mathbf{u}, \mathbf{v}).$$

Squared Euclidean distance on the unit sphere is a monotone function of cosine similarity. So nearest-neighbour search by cosine and by L2 distance give identical orderings (after normalisation) — a fact that vector-DB indexes (FAISS, ScaNN, HNSW) exploit.

Embedding-space conventions

OpenAI text-embedding-3, Cohere embed-v3, Voyage and most modern embedders return L2-normalised vectors. Cosine similarity reduces to a plain dot product, which is faster and SIMD-friendly. Always check whether your embedder returns normalised vectors before reaching for cos_sim — if it does, v1 @ v2 is enough.

05

Orthogonality & Projection (Preview)

Two vectors are orthogonal when $\langle \mathbf{u}, \mathbf{v} \rangle = 0$. A set $\{\mathbf{e}_1, \ldots, \mathbf{e}_n\}$ is orthonormal when its members are unit vectors and pairwise orthogonal — the simplest possible basis.

Why orthonormal is so much nicer

If $\mathbf{v} = c_1 \mathbf{e}_1 + \cdots + c_n \mathbf{e}_n$ with the $\mathbf{e}_i$ orthonormal, then $c_i = \langle \mathbf{v}, \mathbf{e}_i \rangle$. One inner product per coordinate — no system of equations to solve.
Pythagoras applies: $\|\mathbf{v}\|^2 = \sum_i c_i^2$.
The matrix $Q$ whose columns are an orthonormal basis satisfies $Q^\top Q = I$. Inverse and transpose are the same thing — the cheapest possible inverse.

The closest point in a subspace $U$ to a given vector $\mathbf{v}$ is the orthogonal projection of $\mathbf{v}$ onto $U$ — the unique $\hat{\mathbf{v}} \in U$ with $\mathbf{v} - \hat{\mathbf{v}} \perp U$. This is the foundation of least-squares, of the FFN's projection structure (deck 05) and of every "best low-rank approximation" theorem.

Why this slide is short

Deck 05 is entirely about projections. This slide is the bridge: projection is geometry built out of inner products, and that's all it is.

06

General Inner Products: Mahalanobis & the Gram Matrix

The standard dot product on $\mathbb{R}^n$ is one inner product. Many others exist. The general form on $\mathbb{R}^n$ is

$$\langle \mathbf{u}, \mathbf{v} \rangle_M = \mathbf{u}^\top M \mathbf{v}$$

for any symmetric positive-definite $M$. The standard one is $M = I$.

Mahalanobis distance

$$d_M(\mathbf{u}, \mathbf{v}) = \sqrt{(\mathbf{u} - \mathbf{v})^\top \Sigma^{-1} (\mathbf{u} - \mathbf{v})}$$

with $\Sigma$ the covariance of the data. Distances measured in standard deviations along the principal axes — whitening makes Mahalanobis equal to Euclidean.

Gram matrix

For a list of vectors $\mathbf{v}_1, \ldots, \mathbf{v}_n$, the Gram matrix has entries $G_{ij} = \langle \mathbf{v}_i, \mathbf{v}_j \rangle$. Symmetric, positive semi-definite, captures all pairwise geometry. The kernel matrix in kernel methods is exactly a Gram matrix in feature space.

The kernel trick, in two sentences

Define a feature map $\phi : \mathcal{X} \to V$ and an inner product on $V$. Many ML algorithms (SVM, kernel ridge, GP) only ever access $\phi(\mathbf{x}_i)$ through inner products $\langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle$, so you can use them without ever computing $\phi$ explicitly — just a kernel function $k(\mathbf{x}_i, \mathbf{x}_j)$ that returns the inner product directly. This is what made SVMs powerful in the 90s and is the precise mathematical content of "attention is approximate kernel evaluation".

07

Concentration: Random Vectors are Almost Orthogonal

A foundational fact about high dimensions:

Let $\mathbf{u}, \mathbf{v}$ be two independent vectors with i.i.d. $\mathcal{N}(0, 1)$ entries in $\mathbb{R}^d$. Then

$$\mathbb{E}[\langle \mathbf{u}, \mathbf{v} \rangle] = 0, \quad \mathrm{Var}[\langle \mathbf{u}, \mathbf{v} \rangle] = d.$$

So the dot product has standard deviation $\sqrt{d}$, and after normalisation by lengths (each $\approx \sqrt{d}$), the cosine similarity has standard deviation $1/\sqrt{d}$.

For $d = 4096$, $1/\sqrt{d} \approx 0.0156$. Two random Gaussian vectors in $\mathbb{R}^{4096}$ have cosine similarity within $\pm 0.05$ of zero with overwhelming probability — they are almost orthogonal.

Why this matters

Embedding capacity. A $d$-dim space has at most $d$ strictly orthogonal directions, but exponentially many nearly-orthogonal ones (Johnson-Lindenstrauss). LLMs exploit this via superposition (deck 01).
Random initialisation works. If you initialise weights randomly, the input directions are unrelated to the previous layer's directions — signal doesn't blow up or vanish through random rotations.
Distance in high dimension is uninformative without scaling. Pairwise Euclidean distances among random vectors all converge to $\sqrt{2d}$ in expectation. The "curse of dimensionality" is partly that distances stop discriminating.

08

Why Attention Scales by 1/√d

The scaled dot-product attention score:

$$\mathrm{score}(\mathbf{q}, \mathbf{k}) = \frac{\mathbf{q}^\top \mathbf{k}}{\sqrt{d}}.$$

Why $\sqrt{d}$ and not $d$ or $d^2$? It's the previous slide.

The argument from "Attention Is All You Need" (§3.2.1)

Suppose $\mathbf{q}, \mathbf{k}$ have entries with mean 0 and variance 1. Then

$$\mathbf{q}^\top \mathbf{k} = \sum_{i=1}^d q_i k_i, \qquad \mathrm{Var}[\mathbf{q}^\top \mathbf{k}] = d.$$

So the un-scaled dot product has standard deviation $\sqrt{d}$. Without the scale, the magnitudes that go into softmax grow with $d$ — for $d = 64$ they're already on the order of $\pm 8$. Softmax saturates: all probability mass concentrates on the single largest argument, gradients through everything else go to zero, learning stalls.

Dividing by $\sqrt{d}$ keeps the score's standard deviation at $\approx 1$ regardless of head dimension. Softmax stays in its responsive range; gradients flow through all keys, not just the winner.

Empirical confirmation, the easy way

Train two attention layers, one with $1/\sqrt{d}$, one without. The unscaled one stalls inside the first thousand steps for $d \ge 64$ — the score range is too wide for softmax to be useful. The Vaswani et al. 2017 paper (Section 3.2.1) gives this as the explicit motivation. The whole choice is downstream of "random vectors are almost orthogonal but not exactly — the leftover dot product grows like $\sqrt{d}$".

09

Interactive: Cosine Similarity Explorer

Drag $\mathbf{u}$ and $\mathbf{v}$. The widget shows their L2 lengths, dot product, cosine similarity (= $\cos\theta$), and Euclidean distance. Notice that as $\mathbf{u}$ rotates around $\mathbf{v}$, the dot product traces a cosine curve.

‖u‖

2.60

‖v‖

2.60

u·v

3.50

cos θ

0.52

θ°

59°

‖u-v‖

2.55

10

Cheat Sheet

The dot product on $\mathbb{R}^n$ has three equivalent definitions: component-sum, $\mathbf{u}^\top\mathbf{v}$, and $\|\mathbf{u}\|\|\mathbf{v}\|\cos\theta$.
An inner product is symmetric, linear, positive-definite. The dot product is one of many; Mahalanobis is another.
$L^2$ norm comes from an inner product. $L^1$ and $L^\infty$ don't, but they still satisfy the triangle inequality.
Cauchy-Schwarz: $|\langle \mathbf{u},\mathbf{v}\rangle| \le \|\mathbf{u}\|\|\mathbf{v}\|$. Equivalently $|\cos\theta| \le 1$.
Cosine similarity is the dot product after L2-normalisation. On the unit sphere, cosine and Euclidean distance are monotone-equivalent.
Orthogonality means dot product is zero. Orthonormal bases make every coordinate a single inner product, no system to solve.
Two random Gaussian vectors in $\mathbb{R}^d$ have $\mathrm{cos\_sim}$ with std $\approx 1/\sqrt{d}$ — nearly orthogonal in high $d$.
The 1/√d scaling in attention exists because the unscaled dot product has std $\sqrt{d}$, which would saturate softmax.