Linear Algebra for AI — Presentation 06

Eigenvalues & Eigenvectors

A matrix's eigenvectors are the directions it doesn't rotate. Find them and the matrix becomes a diagonal — and a thousand questions about gradients, stability, PCA and Hessians become trivial.

eigenvectoreigenvalue spectral theorempower iteration PCAHessian
Av = λv diagonalise spectral theorem PCA SVD (next deck)
00

Topics We'll Cover

01

Definition & Geometric Picture

An eigenvector of a square matrix $A$ is a non-zero $\mathbf{v}$ such that $A$ acting on $\mathbf{v}$ produces a scaled copy of $\mathbf{v}$:

$$A\mathbf{v} = \lambda \mathbf{v}.$$

The scalar $\lambda$ is the eigenvalue. Geometrically: $\mathbf{v}$ is a direction that $A$ leaves invariant up to scaling. It might be stretched ($|\lambda| > 1$), shrunk ($|\lambda| < 1$), flipped ($\lambda < 0$), or held fixed ($\lambda = 1$) — but its direction is preserved.

What this excludes and includes

Why we care

If you write $\mathbf{x} = c_1\mathbf{v}_1 + \cdots + c_n\mathbf{v}_n$ in an eigenbasis, then $A\mathbf{x} = c_1 \lambda_1 \mathbf{v}_1 + \cdots + c_n \lambda_n \mathbf{v}_n$ — each component scales independently. Repeated application gives $A^k\mathbf{x} = \sum_i c_i \lambda_i^k \mathbf{v}_i$. The eigenvalue with the largest $|\lambda|$ dominates. This is why eigen-analysis tells you about stability, dynamics, fixed points, and long-time behaviour.

02

Computing Eigenvalues: the Characteristic Polynomial

$A\mathbf{v} = \lambda \mathbf{v}$ rearranges to $(A - \lambda I)\mathbf{v} = \mathbf{0}$. A non-zero $\mathbf{v}$ exists iff $A - \lambda I$ is singular — iff its determinant is zero:

$$\det(A - \lambda I) = 0.$$

This is the characteristic polynomial of $A$, of degree $n$. Its $n$ roots (over $\mathbb{C}$) are the eigenvalues. For each $\lambda_i$, the null space $\mathrm{Null}(A - \lambda_i I)$ is the corresponding eigenspace.

2×2 example

For $A = \begin{pmatrix} a & b \\ c & d \end{pmatrix}$,

$$\det(A - \lambda I) = (a - \lambda)(d - \lambda) - bc = \lambda^2 - (a+d)\lambda + (ad - bc).$$

Coefficients: $a + d = \mathrm{tr}(A)$ and $ad - bc = \det(A)$. So

$$\lambda^2 - \mathrm{tr}(A)\lambda + \det(A) = 0.$$

Two facts that hold in any dimension

Don't compute eigenvalues by characteristic polynomial

For numerical work, finding roots of polynomials of degree $> 4$ is unstable and (by Abel-Ruffini) requires iterative methods anyway. Real numerical libraries use the QR algorithm with shifts: a sequence of QR decompositions of shifted matrices that converges to upper-triangular form, where eigenvalues sit on the diagonal. numpy.linalg.eig wraps LAPACK's geev.

03

Diagonalisation

If $A$ has $n$ linearly independent eigenvectors $\mathbf{v}_1, \ldots, \mathbf{v}_n$ with eigenvalues $\lambda_1, \ldots, \lambda_n$, stack the eigenvectors as columns of $V$ and the eigenvalues as the diagonal of $\Lambda = \mathrm{diag}(\lambda_1, \ldots, \lambda_n)$. Then

$$A = V \Lambda V^{-1}.$$

Reading from the right: $V^{-1}\mathbf{x}$ rewrites $\mathbf{x}$ in the eigenbasis; $\Lambda$ scales each coordinate by its eigenvalue; $V$ rewrites back into the standard basis. Same vector, different lens, the action is just per-axis scaling.

What diagonalisation gives you

When $A$ can't be diagonalised

Some matrices have fewer linearly independent eigenvectors than dimensions — the canonical example being $\begin{pmatrix}0&1\\0&0\end{pmatrix}$, which has eigenvalue 0 (twice) but only one eigenvector $\mathbf{e}_1$. Such "defective" matrices admit only a Jordan-form decomposition. Symmetric matrices are never defective (next slide), which is the saving grace for most ML applications: covariance matrices, Hessians and Gram matrices are all symmetric.

04

The Spectral Theorem (Symmetric Matrices)

If $A \in \mathbb{R}^{n \times n}$ is symmetric ($A^\top = A$), then:

  1. All eigenvalues of $A$ are real.
  2. $A$ has an orthonormal basis of eigenvectors.
  3. $A$ can be written $A = Q \Lambda Q^\top$ with $Q$ orthogonal ($Q^\top Q = I$) and $\Lambda$ diagonal real.

This is the most-used theorem in numerical linear algebra. It says symmetric matrices are always diagonalisable, with an orthogonal change of basis — the cheapest, most numerically stable kind.

What the theorem buys you

Compare with the general formula $A = V\Lambda V^{-1}$:

Where symmetry shows up in ML

05

Positive (Semi)-Definite Matrices

A symmetric matrix $A$ is

PSD matrices are exactly those that can be written $A = B^\top B$ for some $B$. (Take $B = \Lambda^{1/2} Q^\top$ from the spectral decomposition.) PD if $B$ has full column rank.

Why ML cares

Numerical caution

Floating-point rounding can turn a "should be PD" matrix into a barely-not-PD one (a tiny negative eigenvalue). Production code defends with a small ridge $A + \varepsilon I$ before Cholesky. The eigenvalues of $A + \varepsilon I$ are $\lambda_i + \varepsilon$ — lift them all by $\varepsilon$.

06

Power Iteration: How libraries find them

The conceptually simplest eigenvalue algorithm. To find the eigenvector of largest $|\lambda|$:

  1. Pick a random vector $\mathbf{v}_0$.
  2. Repeat: $\mathbf{v}_{k+1} = A \mathbf{v}_k / \|A \mathbf{v}_k\|$.
  3. Eigenvalue estimate: $\lambda \approx \mathbf{v}_k^\top A \mathbf{v}_k$ (the Rayleigh quotient).

Why it works: write $\mathbf{v}_0 = \sum_i c_i \mathbf{v}_i$ in the eigenbasis. Then $A^k \mathbf{v}_0 = \sum_i c_i \lambda_i^k \mathbf{v}_i$. Dividing by the dominant $\lambda_1^k$ gives $\sum_i c_i (\lambda_i / \lambda_1)^k \mathbf{v}_i$ — all but the first term go to zero. Convergence rate $|\lambda_2/\lambda_1|$ — fast when there's a clear leader.

Variants

Power iteration in transformer-land

Spectral normalisation (Miyato et al. 2018) caps the spectral norm of each weight matrix during training by running one step of power iteration per minibatch and dividing by that estimate of $\sigma_1$. One mat-vec product per step — the cheapest possible spectral regulariser. Used in GANs, BigGAN, and many modern stability tricks.

07

PCA — the most-used eigenproblem in ML

Given centred data $X \in \mathbb{R}^{n \times d}$ (rows = samples, columns = features), the empirical covariance is

$$\Sigma = \tfrac{1}{n-1} X^\top X \in \mathbb{R}^{d \times d}.$$

The eigenvectors of $\Sigma$ are the principal components; the eigenvalues are the variances along each direction. Sorted by eigenvalue, the top $k$ components form the optimal $k$-dimensional subspace for capturing variance.

The optimisation interpretation

The first principal component is the unit vector $\mathbf{w}_1$ maximising $\mathrm{Var}(X \mathbf{w}_1) = \mathbf{w}_1^\top \Sigma \mathbf{w}_1$. By the Rayleigh quotient, this is exactly the top eigenvector of $\Sigma$. Subsequent components are constrained to be orthogonal to all previous ones — a sequence of constrained Rayleigh maximisations.

What it's used for

08

Eigenvalues & Gradient Flow

For a loss $\mathcal{L}(\theta)$, gradient descent near a critical point obeys (to leading order)

$$\theta_{t+1} - \theta^* \approx (I - \eta H)(\theta_t - \theta^*),$$

where $H$ is the Hessian at $\theta^*$. Diagonalise $H = Q\Lambda Q^\top$ and read off: each eigendirection of $H$ contracts by $1 - \eta\lambda_i$ per step.

Three regimes per eigendirection

$0 < \eta\lambda < 1$

Smooth contraction. $|1 - \eta\lambda| < 1$. The direction converges.

$1 < \eta\lambda < 2$

Oscillating contraction. The sign of the error flips each step but magnitude shrinks.

$\eta\lambda > 2$

Divergence. The error grows. This is what "the learning rate is too high" means — the largest Hessian eigenvalue exceeds $2/\eta$.

The condition-number argument

Convergence speed is governed by the slowest-converging direction, which is the smallest $|\lambda_i|$. Maximum stable learning rate is $2/\lambda_{\max}$. So convergence rate $\le 1 - 2\lambda_{\min}/\lambda_{\max} = 1 - 2/\kappa$ where $\kappa = \lambda_{\max}/\lambda_{\min}$ is the condition number of $H$. Ill-conditioned losses (high $\kappa$) train slowly — pre-conditioning, normalisation and adaptive optimisers are designed to reduce $\kappa$.

Why deep nets train at all

The Hessian of a deep network's loss has eigenvalues spanning many orders of magnitude. SGD with momentum and adaptive optimisers (Adam, AdamW) implicitly precondition by per-coordinate scaling, smoothing out the eigenvalue spectrum. LayerNorm and residual connections both help by keeping the effective Hessian better conditioned across layers (Santurkar et al. 2018). The eigenvalue picture is the right way to read these architectural decisions.

09

Interactive: 2D Eigenvector Animation

The widget plots a unit circle of vectors and animates their evolution under repeated multiplication by a 2×2 matrix $A$. Eigenvectors stay on lines through the origin; everything else gets pulled towards (or pushed by) the dominant eigendirection.

1.50
0.40
0.40
0.80
tr(A)
2.30
det(A)
1.04
λ₁
1.71
λ₂
0.59
step
0

The yellow and green dashed lines are the eigendirections (real eigenvalues). Step a few times: the circle deforms into an ellipse aligned with the eigenvectors; the long axis grows like $\lambda_1^k$ and the short like $\lambda_2^k$. Try a rotation matrix ($a = d = \cos\theta$, $b = -c = -\sin\theta$): no real eigenvectors, the circle just rotates.

10

Cheat Sheet

Read next

Deck 07 — SVD, Low-Rank & LoRA generalises eigendecomposition to non-square matrices. The SVD is what made Eckart-Young, principal-component regression, LoRA, and DeepSeek's MLA possible — the right way to find the "best $r$-dimensional summary" of a linear map.