Linear Algebra for AI 06 — Eigenvalues & Eigenvectors

00

Topics We'll Cover

Definition & Geometric Picture
Computing Eigenvalues: the Characteristic Polynomial
Diagonalisation
The Spectral Theorem (Symmetric Matrices)
Positive (Semi)-Definite Matrices
Power Iteration: How libraries find them
PCA — the most-used eigenproblem in ML
Eigenvalues & Gradient Flow
Interactive: 2D Eigenvector Animation
Cheat Sheet

01

Definition & Geometric Picture

An eigenvector of a square matrix $A$ is a non-zero $\mathbf{v}$ such that $A$ acting on $\mathbf{v}$ produces a scaled copy of $\mathbf{v}$:

$$A\mathbf{v} = \lambda \mathbf{v}.$$

The scalar $\lambda$ is the eigenvalue. Geometrically: $\mathbf{v}$ is a direction that $A$ leaves invariant up to scaling. It might be stretched ($|\lambda| > 1$), shrunk ($|\lambda| < 1$), flipped ($\lambda < 0$), or held fixed ($\lambda = 1$) — but its direction is preserved.

What this excludes and includes

Pure rotation in 2D rotates every direction; no real eigenvector exists. (Complex eigenvectors $\mathbf{v} = (1, \pm i)$ with eigenvalues $e^{\pm i\theta}$ exist over $\mathbb{C}$.)
Reflection across a line has the line itself as a $\lambda = 1$ eigenvector and its perpendicular as $\lambda = -1$.
Projection onto a line has the line as $\lambda = 1$ eigenvector and its perpendicular as $\lambda = 0$ (annihilated).
Pure scaling has every direction as eigenvector with the same $\lambda$ — the matrix is $\lambda I$.

Why we care

If you write $\mathbf{x} = c_1\mathbf{v}_1 + \cdots + c_n\mathbf{v}_n$ in an eigenbasis, then $A\mathbf{x} = c_1 \lambda_1 \mathbf{v}_1 + \cdots + c_n \lambda_n \mathbf{v}_n$ — each component scales independently. Repeated application gives $A^k\mathbf{x} = \sum_i c_i \lambda_i^k \mathbf{v}_i$. The eigenvalue with the largest $|\lambda|$ dominates. This is why eigen-analysis tells you about stability, dynamics, fixed points, and long-time behaviour.

02

Computing Eigenvalues: the Characteristic Polynomial

$A\mathbf{v} = \lambda \mathbf{v}$ rearranges to $(A - \lambda I)\mathbf{v} = \mathbf{0}$. A non-zero $\mathbf{v}$ exists iff $A - \lambda I$ is singular — iff its determinant is zero:

$$\det(A - \lambda I) = 0.$$

This is the characteristic polynomial of $A$, of degree $n$. Its $n$ roots (over $\mathbb{C}$) are the eigenvalues. For each $\lambda_i$, the null space $\mathrm{Null}(A - \lambda_i I)$ is the corresponding eigenspace.

2×2 example

For $A = \begin{pmatrix} a & b \\ c & d \end{pmatrix}$,

$$\det(A - \lambda I) = (a - \lambda)(d - \lambda) - bc = \lambda^2 - (a+d)\lambda + (ad - bc).$$

Coefficients: $a + d = \mathrm{tr}(A)$ and $ad - bc = \det(A)$. So

$$\lambda^2 - \mathrm{tr}(A)\lambda + \det(A) = 0.$$

Two facts that hold in any dimension

$\sum_i \lambda_i = \mathrm{tr}(A)$ — the sum of eigenvalues equals the trace.
$\prod_i \lambda_i = \det(A)$ — the product of eigenvalues equals the determinant.

Don't compute eigenvalues by characteristic polynomial

For numerical work, finding roots of polynomials of degree $> 4$ is unstable and (by Abel-Ruffini) requires iterative methods anyway. Real numerical libraries use the QR algorithm with shifts: a sequence of QR decompositions of shifted matrices that converges to upper-triangular form, where eigenvalues sit on the diagonal. numpy.linalg.eig wraps LAPACK's geev.

03

Diagonalisation

If $A$ has $n$ linearly independent eigenvectors $\mathbf{v}_1, \ldots, \mathbf{v}_n$ with eigenvalues $\lambda_1, \ldots, \lambda_n$, stack the eigenvectors as columns of $V$ and the eigenvalues as the diagonal of $\Lambda = \mathrm{diag}(\lambda_1, \ldots, \lambda_n)$. Then

$$A = V \Lambda V^{-1}.$$

Reading from the right: $V^{-1}\mathbf{x}$ rewrites $\mathbf{x}$ in the eigenbasis; $\Lambda$ scales each coordinate by its eigenvalue; $V$ rewrites back into the standard basis. Same vector, different lens, the action is just per-axis scaling.

What diagonalisation gives you

Powers: $A^k = V \Lambda^k V^{-1}$. Costs one diagonalisation plus $k$ multiplications by diagonals.
Matrix functions: $f(A) = V f(\Lambda) V^{-1}$ where $f(\Lambda) = \mathrm{diag}(f(\lambda_i))$. Defines $e^A$, $A^{1/2}$, etc.
Solving linear ODEs: $\dot{\mathbf{x}} = A\mathbf{x}$ has solution $\mathbf{x}(t) = e^{At}\mathbf{x}(0) = V e^{\Lambda t} V^{-1} \mathbf{x}(0)$. Each eigenmode evolves independently as $e^{\lambda_i t}$.

When $A$ can't be diagonalised

Some matrices have fewer linearly independent eigenvectors than dimensions — the canonical example being $\begin{pmatrix}0&1\\0&0\end{pmatrix}$, which has eigenvalue 0 (twice) but only one eigenvector $\mathbf{e}_1$. Such "defective" matrices admit only a Jordan-form decomposition. Symmetric matrices are never defective (next slide), which is the saving grace for most ML applications: covariance matrices, Hessians and Gram matrices are all symmetric.

04

The Spectral Theorem (Symmetric Matrices)

If $A \in \mathbb{R}^{n \times n}$ is symmetric ($A^\top = A$), then:

All eigenvalues of $A$ are real.
$A$ has an orthonormal basis of eigenvectors.
$A$ can be written $A = Q \Lambda Q^\top$ with $Q$ orthogonal ($Q^\top Q = I$) and $\Lambda$ diagonal real.

This is the most-used theorem in numerical linear algebra. It says symmetric matrices are always diagonalisable, with an orthogonal change of basis — the cheapest, most numerically stable kind.

What the theorem buys you

Compare with the general formula $A = V\Lambda V^{-1}$:

Solving $V$ has $V^{-1} = V^\top$ — a transpose, not an inverse. Free numerically and exact.
Eigenvectors are orthogonal: project onto each one independently, no spurious coupling.
Spectral norm $\|A\|_2 = \max_i |\lambda_i|$. Frobenius norm $\|A\|_F = \sqrt{\sum_i \lambda_i^2}$.

Where symmetry shows up in ML

Covariance matrix $\Sigma = \mathbb{E}[(\mathbf{x} - \mu)(\mathbf{x} - \mu)^\top]$ — symmetric and positive semi-definite. Eigendecomp = PCA.
Hessian $H = \nabla^2 \mathcal{L}(\theta)$ — symmetric (mixed partial derivatives commute). Eigendecomp = curvature directions.
Gram matrix $G_{ij} = \langle \mathbf{x}_i, \mathbf{x}_j \rangle$ — symmetric, PSD. Kernel methods.
$A^\top A$ for any $A$ — symmetric, PSD. Foundation of SVD (deck 07).

05

Positive (Semi)-Definite Matrices

A symmetric matrix $A$ is

positive semi-definite (PSD) if $\mathbf{x}^\top A \mathbf{x} \ge 0$ for all $\mathbf{x}$. Equivalently: all eigenvalues $\ge 0$.
positive definite (PD) if $\mathbf{x}^\top A \mathbf{x} > 0$ for all $\mathbf{x} \neq \mathbf{0}$. Equivalently: all eigenvalues $> 0$.

PSD matrices are exactly those that can be written $A = B^\top B$ for some $B$. (Take $B = \Lambda^{1/2} Q^\top$ from the spectral decomposition.) PD if $B$ has full column rank.

Why ML cares

Covariance matrices are PSD by construction, PD whenever the data spans the full space.
A function $\mathcal{L}$ is convex iff $\nabla^2 \mathcal{L}$ is PSD everywhere. The model class of "convex losses" is exactly the class with PSD Hessian.
Cholesky decomposition $A = LL^\top$ exists iff $A$ is PD. Two-three times faster than LU for solving $A\mathbf{x} = \mathbf{b}$.
The whitening transform $\Sigma^{-1/2}$ is well-defined iff $\Sigma$ is PD. Computed via the spectral square root.

Numerical caution

Floating-point rounding can turn a "should be PD" matrix into a barely-not-PD one (a tiny negative eigenvalue). Production code defends with a small ridge $A + \varepsilon I$ before Cholesky. The eigenvalues of $A + \varepsilon I$ are $\lambda_i + \varepsilon$ — lift them all by $\varepsilon$.

06

Power Iteration: How libraries find them

The conceptually simplest eigenvalue algorithm. To find the eigenvector of largest $|\lambda|$:

Pick a random vector $\mathbf{v}_0$.
Repeat: $\mathbf{v}_{k+1} = A \mathbf{v}_k / \|A \mathbf{v}_k\|$.
Eigenvalue estimate: $\lambda \approx \mathbf{v}_k^\top A \mathbf{v}_k$ (the Rayleigh quotient).

Why it works: write $\mathbf{v}_0 = \sum_i c_i \mathbf{v}_i$ in the eigenbasis. Then $A^k \mathbf{v}_0 = \sum_i c_i \lambda_i^k \mathbf{v}_i$. Dividing by the dominant $\lambda_1^k$ gives $\sum_i c_i (\lambda_i / \lambda_1)^k \mathbf{v}_i$ — all but the first term go to zero. Convergence rate $|\lambda_2/\lambda_1|$ — fast when there's a clear leader.

Variants

Inverse iteration: apply $(A - \mu I)^{-1}$ to find the eigenvalue closest to $\mu$.
Lanczos / Arnoldi: for sparse / large $A$, builds a Krylov subspace, takes eigendecomp of a small projected matrix. Foundation of scipy.sparse.linalg.eigsh.
Randomised SVD (Halko-Martinsson-Tropp 2011): use random Gaussian samples to project to a small subspace, then eigendecompose. The standard fast algorithm for top-$k$ eigenvalues.

Power iteration in transformer-land

Spectral normalisation (Miyato et al. 2018) caps the spectral norm of each weight matrix during training by running one step of power iteration per minibatch and dividing by that estimate of $\sigma_1$. One mat-vec product per step — the cheapest possible spectral regulariser. Used in GANs, BigGAN, and many modern stability tricks.

07

PCA — the most-used eigenproblem in ML

Given centred data $X \in \mathbb{R}^{n \times d}$ (rows = samples, columns = features), the empirical covariance is

$$\Sigma = \tfrac{1}{n-1} X^\top X \in \mathbb{R}^{d \times d}.$$

The eigenvectors of $\Sigma$ are the principal components; the eigenvalues are the variances along each direction. Sorted by eigenvalue, the top $k$ components form the optimal $k$-dimensional subspace for capturing variance.

The optimisation interpretation

The first principal component is the unit vector $\mathbf{w}_1$ maximising $\mathrm{Var}(X \mathbf{w}_1) = \mathbf{w}_1^\top \Sigma \mathbf{w}_1$. By the Rayleigh quotient, this is exactly the top eigenvector of $\Sigma$. Subsequent components are constrained to be orthogonal to all previous ones — a sequence of constrained Rayleigh maximisations.

What it's used for

Whitening / pre-processing. Project onto top-$k$ principal components to remove correlated noise before downstream ML.
Dimensional reduction for visualisation. 2D PCA scatter plots of embeddings.
The eigenfaces / eigentokens line of analysis. Mechanistic interpretability does PCA on activation matrices to find dominant directions; circuits research (Anthropic) uses tools related to this.
Foundation for SVD. The right singular vectors of $X$ are the eigenvectors of $X^\top X = (n-1)\Sigma$. Deck 07.

08

Eigenvalues & Gradient Flow

For a loss $\mathcal{L}(\theta)$, gradient descent near a critical point obeys (to leading order)

$$\theta_{t+1} - \theta^* \approx (I - \eta H)(\theta_t - \theta^*),$$

where $H$ is the Hessian at $\theta^*$. Diagonalise $H = Q\Lambda Q^\top$ and read off: each eigendirection of $H$ contracts by $1 - \eta\lambda_i$ per step.

Three regimes per eigendirection

$0 < \eta\lambda < 1$

Smooth contraction. $|1 - \eta\lambda| < 1$. The direction converges.

$1 < \eta\lambda < 2$

Oscillating contraction. The sign of the error flips each step but magnitude shrinks.

$\eta\lambda > 2$

Divergence. The error grows. This is what "the learning rate is too high" means — the largest Hessian eigenvalue exceeds $2/\eta$.

The condition-number argument

Convergence speed is governed by the slowest-converging direction, which is the smallest $|\lambda_i|$. Maximum stable learning rate is $2/\lambda_{\max}$. So convergence rate $\le 1 - 2\lambda_{\min}/\lambda_{\max} = 1 - 2/\kappa$ where $\kappa = \lambda_{\max}/\lambda_{\min}$ is the condition number of $H$. Ill-conditioned losses (high $\kappa$) train slowly — pre-conditioning, normalisation and adaptive optimisers are designed to reduce $\kappa$.

Why deep nets train at all

The Hessian of a deep network's loss has eigenvalues spanning many orders of magnitude. SGD with momentum and adaptive optimisers (Adam, AdamW) implicitly precondition by per-coordinate scaling, smoothing out the eigenvalue spectrum. LayerNorm and residual connections both help by keeping the effective Hessian better conditioned across layers (Santurkar et al. 2018). The eigenvalue picture is the right way to read these architectural decisions.

09

Interactive: 2D Eigenvector Animation

The widget plots a unit circle of vectors and animates their evolution under repeated multiplication by a 2×2 matrix $A$. Eigenvectors stay on lines through the origin; everything else gets pulled towards (or pushed by) the dominant eigendirection.

a (1,1)1.50

b (1,2)0.40

c (2,1)0.40

d (2,2)0.80

tr(A)

2.30

det(A)

1.04

λ₁

1.71

λ₂

0.59

step

0

The yellow and green dashed lines are the eigendirections (real eigenvalues). Step a few times: the circle deforms into an ellipse aligned with the eigenvectors; the long axis grows like $\lambda_1^k$ and the short like $\lambda_2^k$. Try a rotation matrix ($a = d = \cos\theta$, $b = -c = -\sin\theta$): no real eigenvectors, the circle just rotates.

10

Cheat Sheet

An eigenvector is a non-zero $\mathbf{v}$ with $A\mathbf{v} = \lambda\mathbf{v}$. Geometrically: a direction $A$ doesn't rotate, only scales.
Eigenvalues are roots of $\det(A - \lambda I) = 0$. $\sum \lambda_i = \mathrm{tr}\,A$, $\prod \lambda_i = \det A$.
Diagonalisation: if $A$ has $n$ independent eigenvectors, $A = V\Lambda V^{-1}$. Powers, exponentials and ODE solutions are immediate.
Spectral theorem: a real symmetric $A$ has all real eigenvalues, an orthonormal eigenbasis, and decomposes as $A = Q\Lambda Q^\top$ with $Q^\top Q = I$.
PSD / PD matrices have non-negative / positive eigenvalues. PSD iff $\mathbf{x}^\top A \mathbf{x} \ge 0$. Covariance, Hessian-of-convex, Gram are PSD.
Power iteration finds the dominant eigenvector by repeated $A$-multiplication and normalisation. Convergence rate $|\lambda_2/\lambda_1|$. Underlies spectral normalisation.
PCA = eigendecomposition of the covariance matrix. Top eigenvectors are directions of maximum variance.
Hessian eigenvalues control gradient-descent convergence. Stable steps require $\eta < 2/\lambda_{\max}$; convergence speed $\propto 1/\kappa$ where $\kappa = \lambda_{\max}/\lambda_{\min}$ is the condition number.