Linear Algebra for AI — Presentation 04

Inner Products, Norms & Geometry

Length, angle, and orthogonality — the geometric language behind cosine similarity, attention scores, the 1/√d scale, and why random high-dimensional vectors are almost orthogonal.

dot productcosine normorthogonal Cauchy-Schwarzattention scale
dot product norm = self-dot cosine = normalised dot attention score
00

Topics We'll Cover

01

The Dot Product Three Ways

The dot product on $\mathbb{R}^n$ has three equivalent definitions; the elegance is that they all give the same number.

Algebraic

$$\langle \mathbf{u}, \mathbf{v} \rangle = \sum_{i=1}^n u_i v_i.$$

The component-wise sum of products. The thing the GPU computes.

Matrix

$$\langle \mathbf{u}, \mathbf{v} \rangle = \mathbf{u}^\top \mathbf{v}.$$

One row times one column. The matmul atom.

Geometric

$$\langle \mathbf{u}, \mathbf{v} \rangle = \|\mathbf{u}\|\,\|\mathbf{v}\|\,\cos\theta.$$

Length-times-length-times-cosine-of-angle. The picture in $\mathbb{R}^2$.

The four axioms

An inner product on a real vector space $V$ is a function $\langle \cdot, \cdot \rangle : V \times V \to \mathbb{R}$ satisfying:

  1. Symmetry: $\langle \mathbf{u}, \mathbf{v} \rangle = \langle \mathbf{v}, \mathbf{u} \rangle$.
  2. Linearity in the first argument: $\langle \alpha \mathbf{u} + \beta \mathbf{w}, \mathbf{v} \rangle = \alpha \langle \mathbf{u}, \mathbf{v} \rangle + \beta \langle \mathbf{w}, \mathbf{v} \rangle$.
  3. Positive definiteness: $\langle \mathbf{u}, \mathbf{u} \rangle \ge 0$, with equality iff $\mathbf{u} = \mathbf{0}$.
  4. (Symmetry + linearity-in-1st ⇒ linearity in 2nd by free.)

Any function satisfying these defines a "geometry" on the space — a notion of length and angle. The standard dot product on $\mathbb{R}^n$ is one example, the Mahalanobis inner product (slide 06) is another.

02

Norms — Lp and Friends

A norm $\|\cdot\| : V \to [0, \infty)$ satisfies three properties: $\|\mathbf{v}\| = 0 \iff \mathbf{v} = \mathbf{0}$; $\|\alpha \mathbf{v}\| = |\alpha|\,\|\mathbf{v}\|$; and $\|\mathbf{u} + \mathbf{v}\| \le \|\mathbf{u}\| + \|\mathbf{v}\|$ (the triangle inequality).

Every inner product gives a norm: $\|\mathbf{v}\| = \sqrt{\langle \mathbf{v}, \mathbf{v} \rangle}$. But not every norm comes from an inner product (only the L2 norm does).

NormFormulaWhere it shows up in ML
$L^1$ (taxicab)$\sum_i |v_i|$Lasso regularisation, sparse-coding objective, total-variation losses.
$L^2$ (Euclidean)$\sqrt{\sum_i v_i^2}$Default everywhere — weight decay, MSE, gradient norms, embedding distance, attention scaling.
$L^\infty$ (max)$\max_i |v_i|$Adversarial $\varepsilon$-balls, max-perturbation attack budgets, PGD attacks.
$L^p$ ($1 \le p < \infty$)$(\sum_i |v_i|^p)^{1/p}$Less common in ML; appears in regularisation theory and analysis of generalisation bounds.
Frobenius (matrix)$\sqrt{\sum_{ij} A_{ij}^2}$Treats a matrix as a flat vector. Standard ML matrix norm; the Eckart-Young theorem (deck 07) is in Frobenius norm.
Spectral (matrix)$\sigma_1(A)$ (largest singular value)Lipschitz constant of a linear layer; used in spectral normalisation (Miyato et al. 2018) for GAN stability.

$L^1$, $L^2$ and $L^\infty$ unit balls in $\mathbb{R}^2$ are a square rotated 45° (a diamond), a circle, and an axis-aligned square. The corners of the $L^1$ ball sit on the axes — this is the geometric reason Lasso encourages sparse solutions: the optimum tends to land at a corner where some coordinates are exactly zero.

03

Cauchy-Schwarz & the Triangle Inequality

The most-cited inequality in linear algebra:

$$|\langle \mathbf{u}, \mathbf{v} \rangle| \le \|\mathbf{u}\|\,\|\mathbf{v}\|.$$

Equality iff $\mathbf{u}$ and $\mathbf{v}$ are parallel. Reading: the absolute dot product is bounded by the product of the lengths — equivalently, $|\cos\theta| \le 1$.

Proof in two lines (real, $\mathbf{v} \neq \mathbf{0}$)

For any $t \in \mathbb{R}$, $0 \le \|\mathbf{u} - t\mathbf{v}\|^2 = \|\mathbf{u}\|^2 - 2t \langle \mathbf{u}, \mathbf{v} \rangle + t^2 \|\mathbf{v}\|^2$. Choose $t = \langle \mathbf{u}, \mathbf{v} \rangle / \|\mathbf{v}\|^2$ to minimise the RHS; the minimum is $\|\mathbf{u}\|^2 - \langle \mathbf{u}, \mathbf{v} \rangle^2 / \|\mathbf{v}\|^2 \ge 0$. Rearranging gives Cauchy-Schwarz.

Triangle inequality, as a corollary

$$\|\mathbf{u} + \mathbf{v}\|^2 = \|\mathbf{u}\|^2 + 2\langle \mathbf{u}, \mathbf{v} \rangle + \|\mathbf{v}\|^2 \le \|\mathbf{u}\|^2 + 2\|\mathbf{u}\|\|\mathbf{v}\| + \|\mathbf{v}\|^2 = (\|\mathbf{u}\| + \|\mathbf{v}\|)^2.$$

Cauchy-Schwarz is doing the work in the middle.

Why ML practitioners care

Cauchy-Schwarz bounds everything built from dot products. Attention scores live in $[-\|\mathbf{q}\|\|\mathbf{k}\|, +\|\mathbf{q}\|\|\mathbf{k}\|]$. A normalisation step ($\|\mathbf{q}\| = \|\mathbf{k}\| = 1$) tightens this to $[-1, 1]$. The Lipschitz-bounded layers used in GAN stabilisation lean on Cauchy-Schwarz to get worst-case behaviour. The attention $1/\sqrt{d}$ scaling depends on what Cauchy-Schwarz doesn't tell you — the typical case — which we cover on slide 08.

04

Cosine Similarity

The cosine similarity of $\mathbf{u}, \mathbf{v}$ is the cosine of the angle between them:

$$\mathrm{cos\_sim}(\mathbf{u}, \mathbf{v}) = \frac{\langle \mathbf{u}, \mathbf{v} \rangle}{\|\mathbf{u}\|\,\|\mathbf{v}\|}.$$

Range $[-1, 1]$ (by Cauchy-Schwarz). $+1$ = same direction, $-1$ = opposite, $0$ = orthogonal. This is the standard similarity score for embeddings in retrieval, RAG, deduplication, and most metric-learning losses.

Why cosine, not Euclidean distance?

For embeddings, direction is meaningful but length is often a side-effect of token frequency, training instability, or the length of an input sequence. A short query and a long document can have similar topical content but very different L2 norms; cosine ignores that. Equivalently, cosine is the dot product after L2-normalising both vectors.

Cosine vs dot vs Euclidean — the relations

If $\hat{\mathbf{u}}, \hat{\mathbf{v}}$ are unit vectors,

$$\|\hat{\mathbf{u}} - \hat{\mathbf{v}}\|^2 = 2 - 2\,\mathrm{cos\_sim}(\mathbf{u}, \mathbf{v}).$$

Squared Euclidean distance on the unit sphere is a monotone function of cosine similarity. So nearest-neighbour search by cosine and by L2 distance give identical orderings (after normalisation) — a fact that vector-DB indexes (FAISS, ScaNN, HNSW) exploit.

Embedding-space conventions

OpenAI text-embedding-3, Cohere embed-v3, Voyage and most modern embedders return L2-normalised vectors. Cosine similarity reduces to a plain dot product, which is faster and SIMD-friendly. Always check whether your embedder returns normalised vectors before reaching for cos_sim — if it does, v1 @ v2 is enough.

05

Orthogonality & Projection (Preview)

Two vectors are orthogonal when $\langle \mathbf{u}, \mathbf{v} \rangle = 0$. A set $\{\mathbf{e}_1, \ldots, \mathbf{e}_n\}$ is orthonormal when its members are unit vectors and pairwise orthogonal — the simplest possible basis.

Why orthonormal is so much nicer

The closest point in a subspace $U$ to a given vector $\mathbf{v}$ is the orthogonal projection of $\mathbf{v}$ onto $U$ — the unique $\hat{\mathbf{v}} \in U$ with $\mathbf{v} - \hat{\mathbf{v}} \perp U$. This is the foundation of least-squares, of the FFN's projection structure (deck 05) and of every "best low-rank approximation" theorem.

Why this slide is short

Deck 05 is entirely about projections. This slide is the bridge: projection is geometry built out of inner products, and that's all it is.

06

General Inner Products: Mahalanobis & the Gram Matrix

The standard dot product on $\mathbb{R}^n$ is one inner product. Many others exist. The general form on $\mathbb{R}^n$ is

$$\langle \mathbf{u}, \mathbf{v} \rangle_M = \mathbf{u}^\top M \mathbf{v}$$

for any symmetric positive-definite $M$. The standard one is $M = I$.

Mahalanobis distance

$$d_M(\mathbf{u}, \mathbf{v}) = \sqrt{(\mathbf{u} - \mathbf{v})^\top \Sigma^{-1} (\mathbf{u} - \mathbf{v})}$$

with $\Sigma$ the covariance of the data. Distances measured in standard deviations along the principal axes — whitening makes Mahalanobis equal to Euclidean.

Gram matrix

For a list of vectors $\mathbf{v}_1, \ldots, \mathbf{v}_n$, the Gram matrix has entries $G_{ij} = \langle \mathbf{v}_i, \mathbf{v}_j \rangle$. Symmetric, positive semi-definite, captures all pairwise geometry. The kernel matrix in kernel methods is exactly a Gram matrix in feature space.

The kernel trick, in two sentences

Define a feature map $\phi : \mathcal{X} \to V$ and an inner product on $V$. Many ML algorithms (SVM, kernel ridge, GP) only ever access $\phi(\mathbf{x}_i)$ through inner products $\langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle$, so you can use them without ever computing $\phi$ explicitly — just a kernel function $k(\mathbf{x}_i, \mathbf{x}_j)$ that returns the inner product directly. This is what made SVMs powerful in the 90s and is the precise mathematical content of "attention is approximate kernel evaluation".

07

Concentration: Random Vectors are Almost Orthogonal

A foundational fact about high dimensions:

Let $\mathbf{u}, \mathbf{v}$ be two independent vectors with i.i.d. $\mathcal{N}(0, 1)$ entries in $\mathbb{R}^d$. Then

$$\mathbb{E}[\langle \mathbf{u}, \mathbf{v} \rangle] = 0, \quad \mathrm{Var}[\langle \mathbf{u}, \mathbf{v} \rangle] = d.$$

So the dot product has standard deviation $\sqrt{d}$, and after normalisation by lengths (each $\approx \sqrt{d}$), the cosine similarity has standard deviation $1/\sqrt{d}$.

For $d = 4096$, $1/\sqrt{d} \approx 0.0156$. Two random Gaussian vectors in $\mathbb{R}^{4096}$ have cosine similarity within $\pm 0.05$ of zero with overwhelming probability — they are almost orthogonal.

Why this matters

08

Why Attention Scales by 1/√d

The scaled dot-product attention score:

$$\mathrm{score}(\mathbf{q}, \mathbf{k}) = \frac{\mathbf{q}^\top \mathbf{k}}{\sqrt{d}}.$$

Why $\sqrt{d}$ and not $d$ or $d^2$? It's the previous slide.

The argument from "Attention Is All You Need" (§3.2.1)

Suppose $\mathbf{q}, \mathbf{k}$ have entries with mean 0 and variance 1. Then

$$\mathbf{q}^\top \mathbf{k} = \sum_{i=1}^d q_i k_i, \qquad \mathrm{Var}[\mathbf{q}^\top \mathbf{k}] = d.$$

So the un-scaled dot product has standard deviation $\sqrt{d}$. Without the scale, the magnitudes that go into softmax grow with $d$ — for $d = 64$ they're already on the order of $\pm 8$. Softmax saturates: all probability mass concentrates on the single largest argument, gradients through everything else go to zero, learning stalls.

Dividing by $\sqrt{d}$ keeps the score's standard deviation at $\approx 1$ regardless of head dimension. Softmax stays in its responsive range; gradients flow through all keys, not just the winner.

Empirical confirmation, the easy way

Train two attention layers, one with $1/\sqrt{d}$, one without. The unscaled one stalls inside the first thousand steps for $d \ge 64$ — the score range is too wide for softmax to be useful. The Vaswani et al. 2017 paper (Section 3.2.1) gives this as the explicit motivation. The whole choice is downstream of "random vectors are almost orthogonal but not exactly — the leftover dot product grows like $\sqrt{d}$".

09

Interactive: Cosine Similarity Explorer

Drag $\mathbf{u}$ and $\mathbf{v}$. The widget shows their L2 lengths, dot product, cosine similarity (= $\cos\theta$), and Euclidean distance. Notice that as $\mathbf{u}$ rotates around $\mathbf{v}$, the dot product traces a cosine curve.

θ
‖u‖
2.60
‖v‖
2.60
u·v
3.50
cos θ
0.52
θ°
59°
‖u-v‖
2.55
10

Cheat Sheet

Read next

Deck 05 — Projections, Up & Down takes the geometry of inner products and uses it to project. We'll build the projection-matrix view of an FFN's up- and down-projections, the structural reason transformer MLPs go up by 4× and back down.