A derivative is the best linear approximation of a function. The Jacobian is what that linear approximation looks like as a matrix. The chain rule is matrix multiplication. Backprop is just multiplying these matrices in the cheap order.
For a smooth function $f : \mathbb{R}^n \to \mathbb{R}^m$, the derivative at a point $\mathbf{x}_0$ is the unique linear map $J : \mathbb{R}^n \to \mathbb{R}^m$ such that
$$f(\mathbf{x}_0 + \mathbf{h}) = f(\mathbf{x}_0) + J\mathbf{h} + o(\|\mathbf{h}\|).$$
"Best linear approximation": no other linear map gets the second term right to higher order. The little-$o$ tail vanishes faster than $\mathbf{h}$ as $\mathbf{h} \to 0$.
This definition is the right one to internalise. A derivative is not a number, a slope, or a partial derivative — it's a linear map. For scalar-to-scalar $f : \mathbb{R} \to \mathbb{R}$ that linear map is multiplication by a single number, and we call that number "the derivative". For vector-valued and vector-input functions the linear map is multi-dimensional and is encoded by a matrix.
Once you read derivatives as linear maps, the chain rule becomes composition of linear maps — matrix multiplication. Backprop becomes "multiply the matrices in a smart order". Vector-Jacobian products and Jacobian-vector products become "apply $J^\top$ to a vector" and "apply $J$ to a vector". All of autograd is a single piece of vocabulary applied consistently.
For $f : \mathbb{R}^n \to \mathbb{R}^m$ with components $f = (f_1, \ldots, f_m)$, the Jacobian at $\mathbf{x}$ is the $m \times n$ matrix of partial derivatives:
$$J_f(\mathbf{x}) = \begin{pmatrix} \partial f_1/\partial x_1 & \cdots & \partial f_1/\partial x_n \\ \vdots & \ddots & \vdots \\ \partial f_m/\partial x_1 & \cdots & \partial f_m/\partial x_n \end{pmatrix}.$$
Three special cases that recur:
For a loss $\mathcal{L}(\theta) : \mathbb{R}^P \to \mathbb{R}$ (the case ML cares about), $J_\mathcal{L} \in \mathbb{R}^{1 \times P}$. The gradient $\nabla\mathcal{L} = J_\mathcal{L}^\top \in \mathbb{R}^P$ is what your optimiser steps along. The convention "gradient = column vector, derivative = row vector" comes from this: the gradient is the transpose of the Jacobian, but only when the output is scalar.
If $f = g \circ h$ — i.e. $f(\mathbf{x}) = g(h(\mathbf{x}))$ — with $h : \mathbb{R}^n \to \mathbb{R}^k$ and $g : \mathbb{R}^k \to \mathbb{R}^m$, then
$$J_f(\mathbf{x}) = J_g(h(\mathbf{x}))\, J_h(\mathbf{x}).$$
The Jacobian of a composition is the product of the Jacobians. Just like matrix product of linear maps. Because that's what it is.
A network with $L$ layers: $\mathbf{y} = f_L \circ f_{L-1} \circ \cdots \circ f_1(\mathbf{x})$. Then
$$J_f = J_L \, J_{L-1} \, \cdots \, J_1.$$
That's the chain rule. The forward pass evaluates the $f_i$'s in order; the backward pass multiplies the $J_i$'s in reverse order.
Vanishing/exploding gradients: the singular values of the product of Jacobians are roughly the products of the per-layer singular values (up to alignment factors). If each layer has $\sigma_1 = 0.5$, the gradient scale shrinks 1/2 per layer — $2^{-100}$ at depth 100. If $\sigma_1 = 2$, it explodes. Residual connections, LayerNorm, careful init, gradient clipping — all attempts to keep this product near 1.
A residual layer computes $\mathbf{y} = \mathbf{x} + f(\mathbf{x})$. Its Jacobian is $J = I + J_f$. As long as $J_f$ has small singular values, $J \approx I$ — the product across many layers stays well-conditioned. Identity is the only matrix you can multiply by yourself any number of times without trouble. He et al. (2015) did exactly this argument; the trick has stuck.
Imagine the chain rule for a 4-layer net: $J = J_4 J_3 J_2 J_1$, with input dim $n$, output dim $m$, hidden dims $h_i$. You almost never want to materialise $J$. Instead you want either
Compute $J \mathbf{v}$ for a single input direction $\mathbf{v} \in \mathbb{R}^n$.
Strategy: $\mathbf{u}_0 = \mathbf{v}$; for $i = 1..L$, $\mathbf{u}_i = J_i \mathbf{u}_{i-1}$. One mat-vec per layer, going forward in lockstep with the forward pass.
Cost: proportional to one forward pass, regardless of output dim. Cheap when $n$ is small.
Compute $\mathbf{w}^\top J$ for a single output direction $\mathbf{w} \in \mathbb{R}^m$ — equivalently, $J^\top \mathbf{w}$.
Strategy: $\mathbf{u}_L = \mathbf{w}$; for $i = L..1$, $\mathbf{u}_{i-1} = J_i^\top \mathbf{u}_i$. Going backward through the layers.
Cost: proportional to one forward pass plus a backward pass. Cheap when $m$ is small.
For a function $f : \mathbb{R}^n \to \mathbb{R}^m$, computing the full Jacobian costs $\min(n, m)$ JVPs or VJPs (whichever is cheaper). Computing one product costs $\sim 1\times$ a forward pass.
The loss $\mathcal{L}(\theta) : \mathbb{R}^P \to \mathbb{R}$ has $P$ inputs (parameters) and 1 output (the loss). The Jacobian $J_\mathcal{L} \in \mathbb{R}^{1 \times P}$; its transpose is the gradient $\nabla\mathcal{L} \in \mathbb{R}^P$.
Reverse mode computes $J^\top \mathbf{w}$ for any $\mathbf{w}$. Pick $\mathbf{w} = 1$ (the only output) and you get the full gradient in $\sim 1\times$ the forward pass. Forward mode would need $P$ JVPs — one per parameter — to get the same thing. For modern transformers $P \in [10^9, 10^{12}]$. Reverse mode is $10^{12}\times$ cheaper. This factor is why deep learning trains at all.
The win isn't free. Reverse mode needs to remember activations from the forward pass to apply $J_i^\top$ on the way back. Activation memory is the dominant memory cost of training. Tricks to reduce it:
Reverse-mode automatic differentiation was developed in the 1970s in the optimal-control literature (Linnainmaa 1970, Speelpenning 1980). Its application to neural networks was popularised by Rumelhart, Hinton & Williams 1986. Modern frameworks (PyTorch's autograd, JAX's grad/vjp/jvp) implement essentially the same algorithm, with engineering effort going into efficient graph capture, fusion, and scheduling.
The Jacobians of the operations a transformer is built from. You don't compute these explicitly — the framework does — but reading them in matrix form is the cleanest way to understand how gradient information flows.
| Forward | Jacobian | VJP (reverse) |
|---|---|---|
| $\mathbf{y} = W\mathbf{x}$ | $\partial\mathbf{y}/\partial\mathbf{x} = W$, $\partial\mathbf{y}/\partial W = \cdots$ (rank-3) | $\nabla_\mathbf{x} = W^\top \nabla_\mathbf{y}$, $\nabla_W = \nabla_\mathbf{y}\, \mathbf{x}^\top$ |
| $\mathbf{y} = \mathbf{x} + \mathbf{b}$ | $\partial\mathbf{y}/\partial\mathbf{x} = I$, $\partial\mathbf{y}/\partial\mathbf{b} = I$ | $\nabla_\mathbf{x} = \nabla_\mathbf{y}$, $\nabla_\mathbf{b} = \nabla_\mathbf{y}$ |
| $y_i = \sigma(x_i)$ elementwise | $\mathrm{diag}(\sigma'(x_i))$ | $\nabla_\mathbf{x} = \sigma'(\mathbf{x}) \odot \nabla_\mathbf{y}$ |
| $\mathbf{y} = \mathbf{a} \odot \mathbf{x}$ | $\mathrm{diag}(\mathbf{a})$ | $\nabla_\mathbf{x} = \mathbf{a} \odot \nabla_\mathbf{y}$, $\nabla_\mathbf{a} = \mathbf{x} \odot \nabla_\mathbf{y}$ |
| $\mathbf{y} = \mathrm{sum}(\mathbf{x})$ | $\mathbf{1}^\top$ | $\nabla_\mathbf{x} = \nabla_y \cdot \mathbf{1}$ (broadcast) |
| $\mathbf{y} = \mathrm{softmax}(\mathbf{x})$ | $\mathrm{diag}(\mathbf{y}) - \mathbf{y}\mathbf{y}^\top$ | $\nabla_\mathbf{x} = \mathbf{y} \odot (\nabla_\mathbf{y} - (\mathbf{y}^\top \nabla_\mathbf{y}))$ |
| $y = \frac{1}{2}\|\mathbf{x}\|^2$ | $\mathbf{x}^\top$ | $\nabla_\mathbf{x} = \nabla_y \cdot \mathbf{x}$ |
Notice: for elementwise ops the Jacobian is diagonal — you don't store the matrix, just the diagonal. For matrix multiply, $W^\top$ is the VJP — transposing a matrix is the entire backward pass for a linear layer.
The single most-evaluated piece of math in modern deep learning. Forward:
$$\mathbf{y} = W\mathbf{x} + \mathbf{b}.$$
Suppose we have the upstream gradient $\nabla_\mathbf{y} = \partial \mathcal{L}/\partial \mathbf{y}$. The backward pass produces three things:
$$\nabla_\mathbf{x} = W^\top \nabla_\mathbf{y}, \qquad \nabla_W = \nabla_\mathbf{y}\, \mathbf{x}^\top, \qquad \nabla_\mathbf{b} = \nabla_\mathbf{y}.$$
For a batch of inputs $X \in \mathbb{R}^{B \times d_{in}}$ and outputs $Y = XW^\top + \mathbf{b}$ (PyTorch convention), the backward is
$$\nabla_X = \nabla_Y \, W, \qquad \nabla_W = \nabla_Y^\top X, \qquad \nabla_\mathbf{b} = \mathbf{1}^\top \nabla_Y.$$
One matmul forward, two matmuls backward. That's why training a transformer takes ~3× the FLOPs of inference.
Forward: 1 matmul per linear layer. Backward: 1 matmul to compute $\nabla_X$ (input gradient), 1 matmul to compute $\nabla_W$ (weight gradient). Total 3 matmuls of the same shape. Activations are an additional cost. The famous "training is 3× inference FLOPs" rule comes directly from this.
The combined softmax-then-cross-entropy is the most common output layer in classification — including next-token prediction in every LLM. Done correctly, the gradient is shockingly simple.
Forward (logits $\mathbf{z} \in \mathbb{R}^V$, target one-hot $\mathbf{y}^\star$):
$$\mathbf{p} = \mathrm{softmax}(\mathbf{z}), \qquad \mathcal{L} = -\sum_i y^\star_i \log p_i.$$
Backward:
$$\nabla_\mathbf{z} = \mathbf{p} - \mathbf{y}^\star.$$
The gradient of cross-entropy with respect to the logits is just (predicted distribution) minus (target one-hot). No softmax-Jacobian to expand, no division, no logarithm to differentiate. The complexity all cancels.
The softmax Jacobian $J_{\mathrm{sm}} = \mathrm{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^\top$ acts on the cross-entropy gradient $\partial \mathcal{L}/\partial \mathbf{p} = -\mathbf{y}^\star/\mathbf{p}$ (elementwise). Multiplying out:
$$J_{\mathrm{sm}}^\top \cdot (-\mathbf{y}^\star/\mathbf{p}) = -(\mathrm{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^\top)(\mathbf{y}^\star/\mathbf{p}) = -\mathbf{y}^\star + \mathbf{p}(\mathbf{y}^\star \cdot \mathbf{1}) = \mathbf{p} - \mathbf{y}^\star,$$
using $\sum y^\star_i = 1$ for one-hot targets.
Numerical-stability note: implementing softmax then cross-entropy as separate steps loses precision (the log of a small number is large; small softmax probabilities go to 0 in fp16). The fused operator F.cross_entropy takes logits directly, computes $\log\sum e^{z_i}$ stably, never materialises the probabilities.
A toy 3-layer MLP with random weights and a fake target. Press "Forward" to see activations propagate. Press "Backward" to see gradients flow in reverse. Each layer's contribution is shown alongside the chain-rule formula being applied.
Press Forward to start. Activations flow left to right; gradients right to left.
The violet edges light up when their layer is computing the forward pass; pink edges light up during the backward pass. Watch how the gradient $\mathrm{d}\mathcal{L}/\mathrm{d}\mathbf{y}$ at the right propagates step-by-step to $\mathrm{d}\mathcal{L}/\mathrm{d}\mathbf{x}$ at the left, multiplied by $W_i^\top$ at each linear layer and gated by the ReLU derivative.
Deck 10 — Attention as Linear Algebra applies all of the above to one specific operation: scaled dot-product attention. By the end you can read every matmul, every gradient, every shape annotation in a real attention implementation.