A matrix's eigenvectors are the directions it doesn't rotate. Find them and the matrix becomes a diagonal — and a thousand questions about gradients, stability, PCA and Hessians become trivial.
An eigenvector of a square matrix $A$ is a non-zero $\mathbf{v}$ such that $A$ acting on $\mathbf{v}$ produces a scaled copy of $\mathbf{v}$:
$$A\mathbf{v} = \lambda \mathbf{v}.$$
The scalar $\lambda$ is the eigenvalue. Geometrically: $\mathbf{v}$ is a direction that $A$ leaves invariant up to scaling. It might be stretched ($|\lambda| > 1$), shrunk ($|\lambda| < 1$), flipped ($\lambda < 0$), or held fixed ($\lambda = 1$) — but its direction is preserved.
If you write $\mathbf{x} = c_1\mathbf{v}_1 + \cdots + c_n\mathbf{v}_n$ in an eigenbasis, then $A\mathbf{x} = c_1 \lambda_1 \mathbf{v}_1 + \cdots + c_n \lambda_n \mathbf{v}_n$ — each component scales independently. Repeated application gives $A^k\mathbf{x} = \sum_i c_i \lambda_i^k \mathbf{v}_i$. The eigenvalue with the largest $|\lambda|$ dominates. This is why eigen-analysis tells you about stability, dynamics, fixed points, and long-time behaviour.
$A\mathbf{v} = \lambda \mathbf{v}$ rearranges to $(A - \lambda I)\mathbf{v} = \mathbf{0}$. A non-zero $\mathbf{v}$ exists iff $A - \lambda I$ is singular — iff its determinant is zero:
$$\det(A - \lambda I) = 0.$$
This is the characteristic polynomial of $A$, of degree $n$. Its $n$ roots (over $\mathbb{C}$) are the eigenvalues. For each $\lambda_i$, the null space $\mathrm{Null}(A - \lambda_i I)$ is the corresponding eigenspace.
For $A = \begin{pmatrix} a & b \\ c & d \end{pmatrix}$,
$$\det(A - \lambda I) = (a - \lambda)(d - \lambda) - bc = \lambda^2 - (a+d)\lambda + (ad - bc).$$
Coefficients: $a + d = \mathrm{tr}(A)$ and $ad - bc = \det(A)$. So
$$\lambda^2 - \mathrm{tr}(A)\lambda + \det(A) = 0.$$
For numerical work, finding roots of polynomials of degree $> 4$ is unstable and (by Abel-Ruffini) requires iterative methods anyway. Real numerical libraries use the QR algorithm with shifts: a sequence of QR decompositions of shifted matrices that converges to upper-triangular form, where eigenvalues sit on the diagonal. numpy.linalg.eig wraps LAPACK's geev.
If $A$ has $n$ linearly independent eigenvectors $\mathbf{v}_1, \ldots, \mathbf{v}_n$ with eigenvalues $\lambda_1, \ldots, \lambda_n$, stack the eigenvectors as columns of $V$ and the eigenvalues as the diagonal of $\Lambda = \mathrm{diag}(\lambda_1, \ldots, \lambda_n)$. Then
$$A = V \Lambda V^{-1}.$$
Reading from the right: $V^{-1}\mathbf{x}$ rewrites $\mathbf{x}$ in the eigenbasis; $\Lambda$ scales each coordinate by its eigenvalue; $V$ rewrites back into the standard basis. Same vector, different lens, the action is just per-axis scaling.
Some matrices have fewer linearly independent eigenvectors than dimensions — the canonical example being $\begin{pmatrix}0&1\\0&0\end{pmatrix}$, which has eigenvalue 0 (twice) but only one eigenvector $\mathbf{e}_1$. Such "defective" matrices admit only a Jordan-form decomposition. Symmetric matrices are never defective (next slide), which is the saving grace for most ML applications: covariance matrices, Hessians and Gram matrices are all symmetric.
If $A \in \mathbb{R}^{n \times n}$ is symmetric ($A^\top = A$), then:
This is the most-used theorem in numerical linear algebra. It says symmetric matrices are always diagonalisable, with an orthogonal change of basis — the cheapest, most numerically stable kind.
Compare with the general formula $A = V\Lambda V^{-1}$:
A symmetric matrix $A$ is
PSD matrices are exactly those that can be written $A = B^\top B$ for some $B$. (Take $B = \Lambda^{1/2} Q^\top$ from the spectral decomposition.) PD if $B$ has full column rank.
Floating-point rounding can turn a "should be PD" matrix into a barely-not-PD one (a tiny negative eigenvalue). Production code defends with a small ridge $A + \varepsilon I$ before Cholesky. The eigenvalues of $A + \varepsilon I$ are $\lambda_i + \varepsilon$ — lift them all by $\varepsilon$.
The conceptually simplest eigenvalue algorithm. To find the eigenvector of largest $|\lambda|$:
Why it works: write $\mathbf{v}_0 = \sum_i c_i \mathbf{v}_i$ in the eigenbasis. Then $A^k \mathbf{v}_0 = \sum_i c_i \lambda_i^k \mathbf{v}_i$. Dividing by the dominant $\lambda_1^k$ gives $\sum_i c_i (\lambda_i / \lambda_1)^k \mathbf{v}_i$ — all but the first term go to zero. Convergence rate $|\lambda_2/\lambda_1|$ — fast when there's a clear leader.
scipy.sparse.linalg.eigsh.Spectral normalisation (Miyato et al. 2018) caps the spectral norm of each weight matrix during training by running one step of power iteration per minibatch and dividing by that estimate of $\sigma_1$. One mat-vec product per step — the cheapest possible spectral regulariser. Used in GANs, BigGAN, and many modern stability tricks.
Given centred data $X \in \mathbb{R}^{n \times d}$ (rows = samples, columns = features), the empirical covariance is
$$\Sigma = \tfrac{1}{n-1} X^\top X \in \mathbb{R}^{d \times d}.$$
The eigenvectors of $\Sigma$ are the principal components; the eigenvalues are the variances along each direction. Sorted by eigenvalue, the top $k$ components form the optimal $k$-dimensional subspace for capturing variance.
The first principal component is the unit vector $\mathbf{w}_1$ maximising $\mathrm{Var}(X \mathbf{w}_1) = \mathbf{w}_1^\top \Sigma \mathbf{w}_1$. By the Rayleigh quotient, this is exactly the top eigenvector of $\Sigma$. Subsequent components are constrained to be orthogonal to all previous ones — a sequence of constrained Rayleigh maximisations.
For a loss $\mathcal{L}(\theta)$, gradient descent near a critical point obeys (to leading order)
$$\theta_{t+1} - \theta^* \approx (I - \eta H)(\theta_t - \theta^*),$$
where $H$ is the Hessian at $\theta^*$. Diagonalise $H = Q\Lambda Q^\top$ and read off: each eigendirection of $H$ contracts by $1 - \eta\lambda_i$ per step.
Smooth contraction. $|1 - \eta\lambda| < 1$. The direction converges.
Oscillating contraction. The sign of the error flips each step but magnitude shrinks.
Divergence. The error grows. This is what "the learning rate is too high" means — the largest Hessian eigenvalue exceeds $2/\eta$.
Convergence speed is governed by the slowest-converging direction, which is the smallest $|\lambda_i|$. Maximum stable learning rate is $2/\lambda_{\max}$. So convergence rate $\le 1 - 2\lambda_{\min}/\lambda_{\max} = 1 - 2/\kappa$ where $\kappa = \lambda_{\max}/\lambda_{\min}$ is the condition number of $H$. Ill-conditioned losses (high $\kappa$) train slowly — pre-conditioning, normalisation and adaptive optimisers are designed to reduce $\kappa$.
The Hessian of a deep network's loss has eigenvalues spanning many orders of magnitude. SGD with momentum and adaptive optimisers (Adam, AdamW) implicitly precondition by per-coordinate scaling, smoothing out the eigenvalue spectrum. LayerNorm and residual connections both help by keeping the effective Hessian better conditioned across layers (Santurkar et al. 2018). The eigenvalue picture is the right way to read these architectural decisions.
The widget plots a unit circle of vectors and animates their evolution under repeated multiplication by a 2×2 matrix $A$. Eigenvectors stay on lines through the origin; everything else gets pulled towards (or pushed by) the dominant eigendirection.
The yellow and green dashed lines are the eigendirections (real eigenvalues). Step a few times: the circle deforms into an ellipse aligned with the eigenvectors; the long axis grows like $\lambda_1^k$ and the short like $\lambda_2^k$. Try a rotation matrix ($a = d = \cos\theta$, $b = -c = -\sin\theta$): no real eigenvectors, the circle just rotates.
Deck 07 — SVD, Low-Rank & LoRA generalises eigendecomposition to non-square matrices. The SVD is what made Eckart-Young, principal-component regression, LoRA, and DeepSeek's MLA possible — the right way to find the "best $r$-dimensional summary" of a linear map.