Linear Algebra for AI 09 — Gradients, Jacobians & Backprop

00

Topics We'll Cover

Derivative as the Best Linear Approximation
The Jacobian Matrix
The Chain Rule is Matrix Multiplication
JVP vs VJP — Forward vs Reverse Mode
Why Backprop is Reverse Mode
Common Layers' Jacobians
Backward Through a Linear Layer
Backward Through Softmax + Cross-Entropy
Interactive: Backprop Walker
Cheat Sheet

01

Derivative as the Best Linear Approximation

For a smooth function $f : \mathbb{R}^n \to \mathbb{R}^m$, the derivative at a point $\mathbf{x}_0$ is the unique linear map $J : \mathbb{R}^n \to \mathbb{R}^m$ such that

$$f(\mathbf{x}_0 + \mathbf{h}) = f(\mathbf{x}_0) + J\mathbf{h} + o(\|\mathbf{h}\|).$$

"Best linear approximation": no other linear map gets the second term right to higher order. The little-$o$ tail vanishes faster than $\mathbf{h}$ as $\mathbf{h} \to 0$.

This definition is the right one to internalise. A derivative is not a number, a slope, or a partial derivative — it's a linear map. For scalar-to-scalar $f : \mathbb{R} \to \mathbb{R}$ that linear map is multiplication by a single number, and we call that number "the derivative". For vector-valued and vector-input functions the linear map is multi-dimensional and is encoded by a matrix.

Why this framing matters

Once you read derivatives as linear maps, the chain rule becomes composition of linear maps — matrix multiplication. Backprop becomes "multiply the matrices in a smart order". Vector-Jacobian products and Jacobian-vector products become "apply $J^\top$ to a vector" and "apply $J$ to a vector". All of autograd is a single piece of vocabulary applied consistently.

02

The Jacobian Matrix

For $f : \mathbb{R}^n \to \mathbb{R}^m$ with components $f = (f_1, \ldots, f_m)$, the Jacobian at $\mathbf{x}$ is the $m \times n$ matrix of partial derivatives:

$$J_f(\mathbf{x}) = \begin{pmatrix} \partial f_1/\partial x_1 & \cdots & \partial f_1/\partial x_n \\ \vdots & \ddots & \vdots \\ \partial f_m/\partial x_1 & \cdots & \partial f_m/\partial x_n \end{pmatrix}.$$

Three special cases that recur:

Scalar input, scalar output ($n = m = 1$): $J$ is a 1×1 matrix, a single number — the ordinary derivative.
Vector input, scalar output ($m = 1$): $J$ is a row vector. Its transpose is the gradient.
Vector input, vector output: full Jacobian. The standard case in ML — every intermediate activation is a vector function of a vector.

The gradient is a vector, the Jacobian is a matrix

For a loss $\mathcal{L}(\theta) : \mathbb{R}^P \to \mathbb{R}$ (the case ML cares about), $J_\mathcal{L} \in \mathbb{R}^{1 \times P}$. The gradient $\nabla\mathcal{L} = J_\mathcal{L}^\top \in \mathbb{R}^P$ is what your optimiser steps along. The convention "gradient = column vector, derivative = row vector" comes from this: the gradient is the transpose of the Jacobian, but only when the output is scalar.

03

The Chain Rule is Matrix Multiplication

If $f = g \circ h$ — i.e. $f(\mathbf{x}) = g(h(\mathbf{x}))$ — with $h : \mathbb{R}^n \to \mathbb{R}^k$ and $g : \mathbb{R}^k \to \mathbb{R}^m$, then

$$J_f(\mathbf{x}) = J_g(h(\mathbf{x}))\, J_h(\mathbf{x}).$$

The Jacobian of a composition is the product of the Jacobians. Just like matrix product of linear maps. Because that's what it is.

The deep-network case

A network with $L$ layers: $\mathbf{y} = f_L \circ f_{L-1} \circ \cdots \circ f_1(\mathbf{x})$. Then

$$J_f = J_L \, J_{L-1} \, \cdots \, J_1.$$

That's the chain rule. The forward pass evaluates the $f_i$'s in order; the backward pass multiplies the $J_i$'s in reverse order.

One linear-algebra fact, infinite consequences

Vanishing/exploding gradients: the singular values of the product of Jacobians are roughly the products of the per-layer singular values (up to alignment factors). If each layer has $\sigma_1 = 0.5$, the gradient scale shrinks 1/2 per layer — $2^{-100}$ at depth 100. If $\sigma_1 = 2$, it explodes. Residual connections, LayerNorm, careful init, gradient clipping — all attempts to keep this product near 1.

Why ResNets train

A residual layer computes $\mathbf{y} = \mathbf{x} + f(\mathbf{x})$. Its Jacobian is $J = I + J_f$. As long as $J_f$ has small singular values, $J \approx I$ — the product across many layers stays well-conditioned. Identity is the only matrix you can multiply by yourself any number of times without trouble. He et al. (2015) did exactly this argument; the trick has stuck.

04

JVP vs VJP — Forward vs Reverse Mode

Imagine the chain rule for a 4-layer net: $J = J_4 J_3 J_2 J_1$, with input dim $n$, output dim $m$, hidden dims $h_i$. You almost never want to materialise $J$. Instead you want either

JVP — Jacobian-Vector Product (forward mode)

Compute $J \mathbf{v}$ for a single input direction $\mathbf{v} \in \mathbb{R}^n$.

Strategy: $\mathbf{u}_0 = \mathbf{v}$; for $i = 1..L$, $\mathbf{u}_i = J_i \mathbf{u}_{i-1}$. One mat-vec per layer, going forward in lockstep with the forward pass.

Cost: proportional to one forward pass, regardless of output dim. Cheap when $n$ is small.

VJP — Vector-Jacobian Product (reverse mode)

Compute $\mathbf{w}^\top J$ for a single output direction $\mathbf{w} \in \mathbb{R}^m$ — equivalently, $J^\top \mathbf{w}$.

Strategy: $\mathbf{u}_L = \mathbf{w}$; for $i = L..1$, $\mathbf{u}_{i-1} = J_i^\top \mathbf{u}_i$. Going backward through the layers.

Cost: proportional to one forward pass plus a backward pass. Cheap when $m$ is small.

The cost rule of thumb

For a function $f : \mathbb{R}^n \to \mathbb{R}^m$, computing the full Jacobian costs $\min(n, m)$ JVPs or VJPs (whichever is cheaper). Computing one product costs $\sim 1\times$ a forward pass.

$n$ large, $m = 1$ (training a deep net): reverse mode wins by $n$×. Backprop is reverse mode.
$n = 1$, $m$ large (sensitivity analysis of one input variable): forward mode wins.
Square Jacobian: they cost roughly the same.

05

Why Backprop is Reverse Mode

The loss $\mathcal{L}(\theta) : \mathbb{R}^P \to \mathbb{R}$ has $P$ inputs (parameters) and 1 output (the loss). The Jacobian $J_\mathcal{L} \in \mathbb{R}^{1 \times P}$; its transpose is the gradient $\nabla\mathcal{L} \in \mathbb{R}^P$.

Reverse mode computes $J^\top \mathbf{w}$ for any $\mathbf{w}$. Pick $\mathbf{w} = 1$ (the only output) and you get the full gradient in $\sim 1\times$ the forward pass. Forward mode would need $P$ JVPs — one per parameter — to get the same thing. For modern transformers $P \in [10^9, 10^{12}]$. Reverse mode is $10^{12}\times$ cheaper. This factor is why deep learning trains at all.

Memory cost

The win isn't free. Reverse mode needs to remember activations from the forward pass to apply $J_i^\top$ on the way back. Activation memory is the dominant memory cost of training. Tricks to reduce it:

Gradient checkpointing (Chen et al. 2016): drop activations, recompute on the backward pass. Trades 30% extra compute for $\sqrt{L}$ memory savings.
FlashAttention (Dao 2022): never materialises the $L \times L$ attention scores; recomputes them in tiles on the backward pass.
Mixed-precision training: store activations in bf16/fp16, accumulate in fp32. Halves activation memory.
ZeRO-style optimiser-state sharding: dual problem of activations — sharding the gradient/optimiser side rather than activations.

A historical note

Reverse-mode automatic differentiation was developed in the 1970s in the optimal-control literature (Linnainmaa 1970, Speelpenning 1980). Its application to neural networks was popularised by Rumelhart, Hinton & Williams 1986. Modern frameworks (PyTorch's autograd, JAX's grad/vjp/jvp) implement essentially the same algorithm, with engineering effort going into efficient graph capture, fusion, and scheduling.

06

Common Layers' Jacobians

The Jacobians of the operations a transformer is built from. You don't compute these explicitly — the framework does — but reading them in matrix form is the cleanest way to understand how gradient information flows.

Forward	Jacobian	VJP (reverse)
$\mathbf{y} = W\mathbf{x}$	$\partial\mathbf{y}/\partial\mathbf{x} = W$, $\partial\mathbf{y}/\partial W = \cdots$ (rank-3)	$\nabla_\mathbf{x} = W^\top \nabla_\mathbf{y}$, $\nabla_W = \nabla_\mathbf{y}\, \mathbf{x}^\top$
$\mathbf{y} = \mathbf{x} + \mathbf{b}$	$\partial\mathbf{y}/\partial\mathbf{x} = I$, $\partial\mathbf{y}/\partial\mathbf{b} = I$	$\nabla_\mathbf{x} = \nabla_\mathbf{y}$, $\nabla_\mathbf{b} = \nabla_\mathbf{y}$
$y_i = \sigma(x_i)$ elementwise	$\mathrm{diag}(\sigma'(x_i))$	$\nabla_\mathbf{x} = \sigma'(\mathbf{x}) \odot \nabla_\mathbf{y}$
$\mathbf{y} = \mathbf{a} \odot \mathbf{x}$	$\mathrm{diag}(\mathbf{a})$	$\nabla_\mathbf{x} = \mathbf{a} \odot \nabla_\mathbf{y}$, $\nabla_\mathbf{a} = \mathbf{x} \odot \nabla_\mathbf{y}$
$\mathbf{y} = \mathrm{sum}(\mathbf{x})$	$\mathbf{1}^\top$	$\nabla_\mathbf{x} = \nabla_y \cdot \mathbf{1}$ (broadcast)
$\mathbf{y} = \mathrm{softmax}(\mathbf{x})$	$\mathrm{diag}(\mathbf{y}) - \mathbf{y}\mathbf{y}^\top$	$\nabla_\mathbf{x} = \mathbf{y} \odot (\nabla_\mathbf{y} - (\mathbf{y}^\top \nabla_\mathbf{y}))$
$y = \frac{1}{2}\\|\mathbf{x}\\|^2$	$\mathbf{x}^\top$	$\nabla_\mathbf{x} = \nabla_y \cdot \mathbf{x}$

Notice: for elementwise ops the Jacobian is diagonal — you don't store the matrix, just the diagonal. For matrix multiply, $W^\top$ is the VJP — transposing a matrix is the entire backward pass for a linear layer.

07

Backward Through a Linear Layer

The single most-evaluated piece of math in modern deep learning. Forward:

$$\mathbf{y} = W\mathbf{x} + \mathbf{b}.$$

Suppose we have the upstream gradient $\nabla_\mathbf{y} = \partial \mathcal{L}/\partial \mathbf{y}$. The backward pass produces three things:

$$\nabla_\mathbf{x} = W^\top \nabla_\mathbf{y}, \qquad \nabla_W = \nabla_\mathbf{y}\, \mathbf{x}^\top, \qquad \nabla_\mathbf{b} = \nabla_\mathbf{y}.$$

Where everything in this formula comes from

$\nabla_\mathbf{x} = W^\top \nabla_\mathbf{y}$: the chain rule. $\nabla_\mathbf{x} = (\partial \mathbf{y}/\partial \mathbf{x})^\top \nabla_\mathbf{y} = W^\top \nabla_\mathbf{y}$.
$\nabla_W = \nabla_\mathbf{y}\, \mathbf{x}^\top$: the outer product. Each entry $(\nabla_W)_{ij} = (\nabla_\mathbf{y})_i\, x_j$ — the upstream gradient on output $i$ times the input that contributed to it.
$\nabla_\mathbf{b} = \nabla_\mathbf{y}$: the bias adds 1 to each component, so its gradient is just the upstream.

The batched version

For a batch of inputs $X \in \mathbb{R}^{B \times d_{in}}$ and outputs $Y = XW^\top + \mathbf{b}$ (PyTorch convention), the backward is

$$\nabla_X = \nabla_Y \, W, \qquad \nabla_W = \nabla_Y^\top X, \qquad \nabla_\mathbf{b} = \mathbf{1}^\top \nabla_Y.$$

One matmul forward, two matmuls backward. That's why training a transformer takes ~3× the FLOPs of inference.

The 3× rule

Forward: 1 matmul per linear layer. Backward: 1 matmul to compute $\nabla_X$ (input gradient), 1 matmul to compute $\nabla_W$ (weight gradient). Total 3 matmuls of the same shape. Activations are an additional cost. The famous "training is 3× inference FLOPs" rule comes directly from this.

08

Backward Through Softmax + Cross-Entropy

The combined softmax-then-cross-entropy is the most common output layer in classification — including next-token prediction in every LLM. Done correctly, the gradient is shockingly simple.

Forward (logits $\mathbf{z} \in \mathbb{R}^V$, target one-hot $\mathbf{y}^\star$):

$$\mathbf{p} = \mathrm{softmax}(\mathbf{z}), \qquad \mathcal{L} = -\sum_i y^\star_i \log p_i.$$

Backward:

$$\nabla_\mathbf{z} = \mathbf{p} - \mathbf{y}^\star.$$

The gradient of cross-entropy with respect to the logits is just (predicted distribution) minus (target one-hot). No softmax-Jacobian to expand, no division, no logarithm to differentiate. The complexity all cancels.

Why the cancellation

The softmax Jacobian $J_{\mathrm{sm}} = \mathrm{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^\top$ acts on the cross-entropy gradient $\partial \mathcal{L}/\partial \mathbf{p} = -\mathbf{y}^\star/\mathbf{p}$ (elementwise). Multiplying out:

$$J_{\mathrm{sm}}^\top \cdot (-\mathbf{y}^\star/\mathbf{p}) = -(\mathrm{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^\top)(\mathbf{y}^\star/\mathbf{p}) = -\mathbf{y}^\star + \mathbf{p}(\mathbf{y}^\star \cdot \mathbf{1}) = \mathbf{p} - \mathbf{y}^\star,$$

using $\sum y^\star_i = 1$ for one-hot targets.

Numerical-stability note: implementing softmax then cross-entropy as separate steps loses precision (the log of a small number is large; small softmax probabilities go to 0 in fp16). The fused operator F.cross_entropy takes logits directly, computes $\log\sum e^{z_i}$ stably, never materialises the probabilities.

09

Interactive: Backprop Walker

A toy 3-layer MLP with random weights and a fake target. Press "Forward" to see activations propagate. Press "Backward" to see gradients flow in reverse. Each layer's contribution is shown alongside the chain-rule formula being applied.

Press Forward to start. Activations flow left to right; gradients right to left.

The violet edges light up when their layer is computing the forward pass; pink edges light up during the backward pass. Watch how the gradient $\mathrm{d}\mathcal{L}/\mathrm{d}\mathbf{y}$ at the right propagates step-by-step to $\mathrm{d}\mathcal{L}/\mathrm{d}\mathbf{x}$ at the left, multiplied by $W_i^\top$ at each linear layer and gated by the ReLU derivative.

10

Cheat Sheet

A derivative is a linear map — the best linear approximation to a function near a point. The Jacobian is its matrix.
For $f : \mathbb{R}^n \to \mathbb{R}^m$: $J \in \mathbb{R}^{m \times n}$. Special cases: scalar (1×1, ordinary derivative), vector input scalar output (gradient = transpose of Jacobian).
Chain rule = matrix multiplication. $J_f = J_L \cdots J_1$ for $f = f_L \circ \cdots \circ f_1$.
JVP $J\mathbf{v}$ is forward mode; VJP $J^\top\mathbf{w}$ is reverse mode. Each costs ~one forward pass.
Backprop is reverse mode. Loss has 1 output, $P$ inputs — reverse is $P\times$ cheaper than forward.
Linear layer backward: $\nabla_\mathbf{x} = W^\top \nabla_\mathbf{y}$, $\nabla_W = \nabla_\mathbf{y} \mathbf{x}^\top$, $\nabla_\mathbf{b} = \nabla_\mathbf{y}$. Three matmuls per layer (1 fwd + 2 bwd) — the famous 3× rule for training cost.
Softmax + cross-entropy: $\nabla_\mathbf{z} = \mathbf{p} - \mathbf{y}^\star$. Use the fused op for stability.
Activation memory dominates training. Tricks: gradient checkpointing, FlashAttention, mixed precision, ZeRO sharding.
Residual connections give $J = I + J_f$ — products near identity, gradient stays well-conditioned across depth.