Mathematics for Machine Learning — Deck 07

Continuous Optimisation

Gradient descent, Lagrange multipliers, convexity and the dual — the optimisation toolkit Part II needs.

grad descentmomentum LagrangeKKT convexdual
00

Topics We'll Cover

01

The General Problem

The chapter studies

$$\min_{\mathbf{x}\in\mathbb{R}^d}\; f(\mathbf{x}) \quad \text{subject to} \quad g_i(\mathbf{x}) \le 0,\; h_j(\mathbf{x}) = 0,$$

where $f, g_i, h_j$ are differentiable. Three cases of practical interest:

If $f$ is convex and the constraint set is convex, the problem is a convex programme — every local minimum is global, and powerful solvers exist (interior-point, projected gradient, ADMM…).

02

Gradient Descent — Convergence under Smoothness

The iteration:

$$\mathbf{x}_{t+1} = \mathbf{x}_t - \eta\,\nabla f(\mathbf{x}_t)^\top.$$

If $f$ is $L$-smooth ($\|\nabla f(\mathbf{x})-\nabla f(\mathbf{y})\| \le L\|\mathbf{x}-\mathbf{y}\|$) and $\eta = 1/L$:

$$f(\mathbf{x}_{t+1}) \le f(\mathbf{x}_t) - \frac{1}{2L}\|\nabla f(\mathbf{x}_t)\|^2.$$

So $f$ strictly decreases; for convex $f$ the rate is $\mathcal{O}(1/t)$.

Strong convexity

If $f$ is also $m$-strongly convex (the Hessian satisfies $H_f \succeq mI$), the rate becomes linear:

$$f(\mathbf{x}_t) - f^\star \le \left(1 - \frac{m}{L}\right)^t (f(\mathbf{x}_0) - f^\star).$$

The ratio $\kappa = L/m$ is the condition number; it controls how many iterations you need. The book's example: $f(\mathbf{x}) = \tfrac12\mathbf{x}^\top A\mathbf{x}$ has $L=\lambda_{\max}(A)$ and $m=\lambda_{\min}(A)$.

03

Step Size: Fixed, Line Search, Decay

ChoiceRecipeNotes
Fixed$\eta_t = \eta$Simple. Needs $\eta \le 1/L$ for guarantee. Underused in deep learning.
Diminishing$\eta_t = c/t$ or $c/\sqrt{t}$For stochastic gradients. The $1/\sqrt{t}$ schedule is what gives SGD's classical $\mathcal{O}(1/\sqrt{t})$ rate.
Backtracking line searchHalve $\eta$ until Armijo: $f(\mathbf{x}-\eta\mathbf{g}) \le f(\mathbf{x}) - \alpha\eta\|\mathbf{g}\|^2$No prior knowledge of $L$; widely used outside deep learning.
Adam / AdamWper-coordinate $\eta$, adaptive from historyWhat every transformer trainer uses; not analysed in the book but implements the same idea.
04

Momentum and Nesterov Acceleration

Heavy ball / momentum:

$$\mathbf{v}_{t+1} = \beta\mathbf{v}_t + \nabla f(\mathbf{x}_t), \qquad \mathbf{x}_{t+1} = \mathbf{x}_t - \eta\,\mathbf{v}_{t+1}.$$

Smooths out oscillations along sharp directions and accumulates progress along flat ones.

Nesterov:

$$\mathbf{y}_t = \mathbf{x}_t + \beta(\mathbf{x}_t-\mathbf{x}_{t-1}),\qquad \mathbf{x}_{t+1} = \mathbf{y}_t - \eta\nabla f(\mathbf{y}_t).$$

Look ahead before taking the gradient. Achieves the optimal rate $\mathcal{O}(1/t^2)$ for smooth convex $f$ — provably the best first-order rate.

In one line

Plain GD: $\mathcal{O}(1/t)$. Nesterov: $\mathcal{O}(1/t^2)$. Stochastic GD: $\mathcal{O}(1/\sqrt{t})$. Strongly-convex GD: linear ($e^{-t/\kappa}$).

05

Constrained Optimisation: Equality & the Lagrangian

For $\min f(\mathbf{x})$ subject to $h(\mathbf{x})=0$, form the Lagrangian

$$\mathcal{L}(\mathbf{x}, \lambda) = f(\mathbf{x}) - \lambda\, h(\mathbf{x}).$$

At a stationary point:

$$\nabla_{\mathbf{x}}\mathcal{L} = 0 \;\Longleftrightarrow\; \nabla f(\mathbf{x}^\star) = \lambda^\star\,\nabla h(\mathbf{x}^\star), \quad h(\mathbf{x}^\star) = 0.$$

Geometrically: the gradient of $f$ must be parallel to the gradient of $h$ at the optimum. Otherwise you could move along the constraint surface and decrease $f$ further.

For multiple equality constraints, sum: $\mathcal L = f - \sum_j \lambda_j h_j$ and $\nabla f = \sum_j \lambda_j \nabla h_j$ at the optimum.

06

KKT Conditions for Inequality Constraints

For $\min f(\mathbf{x})$ s.t. $g_i(\mathbf{x})\le 0$, $h_j(\mathbf{x})=0$, the Karush–Kuhn–Tucker conditions are necessary at any local optimum (under regularity).

  1. Stationarity: $\nabla f + \sum_i \alpha_i \nabla g_i + \sum_j \lambda_j \nabla h_j = 0$.
  2. Primal feasibility: $g_i(\mathbf{x})\le 0$, $h_j(\mathbf{x})=0$.
  3. Dual feasibility: $\alpha_i \ge 0$.
  4. Complementary slackness: $\alpha_i\, g_i(\mathbf{x}) = 0$ for every $i$.

Complementary slackness is the key one: either constraint $i$ is active ($g_i=0$, multiplier $\alpha_i>0$) or it's inactive ($g_i<0$, multiplier $\alpha_i=0$). The SVM in deck 12 reads off "support vectors" from exactly this rule.

07

Interactive: Lagrange Multipliers

Minimise $f(\mathbf{x}) = (x_1-c_1)^2 + (x_2-c_2)^2$ on the constraint $h(\mathbf{x}) = x_1^2 + x_2^2 - r^2 = 0$ (the circle of radius $r$). Drag the centre of $f$; change $r$. The optimum is plotted — notice $\nabla f$ and $\nabla h$ aligned at the optimum.

1.40
x*
f(x*)
λ*

The Lagrangian condition reads $2(\mathbf{x}^\star - \mathbf{c}) = \lambda^\star \cdot 2\mathbf{x}^\star$, i.e. $\mathbf{x}^\star$ is collinear with $\mathbf{c}$. The optimum is $\mathbf{x}^\star = r\,\mathbf{c}/\|\mathbf{c}\|$ (project $\mathbf{c}$ onto the circle).

08

Convex Sets and Convex Functions

Convex set

$C\subseteq\mathbb{R}^d$ is convex if $\alpha\mathbf{x}+(1-\alpha)\mathbf{y}\in C$ for all $\mathbf{x},\mathbf{y}\in C$, $\alpha\in[0,1]$. Examples: half-spaces, balls, the PSD cone, the simplex.

Convex function

$f:C\to\mathbb{R}$ on a convex domain $C$ is convex if

$$f(\alpha\mathbf{x}+(1-\alpha)\mathbf{y}) \le \alpha f(\mathbf{x}) + (1-\alpha) f(\mathbf{y}).$$

Equivalent (assume differentiability):

Why convexity matters

Every local minimum is a global minimum. Many algorithms have global convergence guarantees on convex problems. Most "classical" ML — linear/logistic regression, SVMs, lasso/ridge — is convex; modern neural networks are not.

09

Linear and Quadratic Programmes

Linear programme (LP)

$$\min\;\mathbf{c}^\top\mathbf{x} \quad \text{s.t.}\quad A\mathbf{x} \le \mathbf{b}.$$

Feasible region is a polyhedron; the optimum is at a vertex. Solved by simplex (combinatorial) or interior-point (polynomial-time).

Quadratic programme (QP)

$$\min\;\tfrac12\mathbf{x}^\top Q\mathbf{x} + \mathbf{c}^\top\mathbf{x} \quad \text{s.t.}\quad A\mathbf{x} \le \mathbf{b}.$$

If $Q\succeq 0$, it's convex. The SVM (deck 12) is exactly a QP with this shape; ridge regression is a QP with no constraints.

Why ML loves QPs

Many ML losses are quadratic or piecewise-linear; many constraints are linear. The intersection — QP — is solvable to global optimality, fast, and has well-understood software.

10

The Lagrange Dual

For the constrained problem with Lagrangian $\mathcal{L}(\mathbf{x},\boldsymbol\alpha,\boldsymbol\lambda) = f(\mathbf{x}) + \sum_i\alpha_i g_i(\mathbf{x}) + \sum_j\lambda_j h_j(\mathbf{x})$, define

$$\mathcal{D}(\boldsymbol\alpha,\boldsymbol\lambda) = \inf_{\mathbf{x}} \mathcal{L}(\mathbf{x},\boldsymbol\alpha,\boldsymbol\lambda).$$

$\mathcal D$ is the dual function; it is always concave (even if $f$ isn't convex). The dual problem:

$$\max_{\boldsymbol\alpha\ge 0,\boldsymbol\lambda}\;\mathcal D(\boldsymbol\alpha,\boldsymbol\lambda).$$

Weak and strong duality

Why SVMs solve the dual

The primal SVM is a QP in $d+1$ variables; the dual is a QP in $N$ variables (number of training points). For small $d$ and large $N$ the primal is faster; with kernels the dual is the natural place to insert $k(\mathbf{x}_i,\mathbf{x}_j)$ without ever computing the feature map.

11

Where This Lands in Part II

ConceptUsed byWhere
Gradient descentAny deep modelTraining loop
StationarityLinear regression (ch. 9)$\nabla R(\boldsymbol\theta) = 0$ gives the normal equation
Lagrange multipliersPCA (ch. 10)Maximise variance subject to unit-norm constraint
KKT & complementary slacknessSVM (ch. 12)Support vectors = active inequality constraints
Lagrange dualSVM (ch. 12)Dual SVM is a QP in $N$ vars allowing kernels
ConvexityMost of Part IIWhy each algorithm's optimum is unique & tractable
EMGMM (ch. 11)Coordinate ascent on a lower bound — another convex device
12

Cheat Sheet

IdeaStatement
GD step$\mathbf{x}_{t+1} = \mathbf{x}_t - \eta \nabla f(\mathbf{x}_t)^\top$
Smoothness $L$$\eta \le 1/L$ ensures monotonic decrease
Condition number$\kappa = L/m$ (smoothness / strong convexity)
Lagrangian$\mathcal L = f + \sum\alpha_i g_i + \sum\lambda_j h_j$
Stationarity$\nabla f = \sum\alpha_i\nabla g_i + \sum\lambda_j\nabla h_j$
Comp. slackness$\alpha_i g_i = 0$
Convexity (1st-order)$f(\mathbf y)\ge f(\mathbf x) + \nabla f(\mathbf x)(\mathbf y - \mathbf x)$
Strong dualityholds for convex + Slater's condition
QP form$\min \tfrac12 \mathbf x^\top Q\mathbf x + \mathbf c^\top \mathbf x$ s.t. $A\mathbf x\le\mathbf b$