MML 07 — Continuous Optimisation

00

Topics We'll Cover

The general problem
Gradient descent — convergence under smoothness
Step size: fixed, line search, decay
Momentum and Nesterov acceleration
Constrained optimisation: equality & the Lagrangian
KKT conditions for inequality constraints
Interactive: Lagrange multipliers
Convex sets and convex functions
Linear and quadratic programmes
The Lagrange dual
Where this lands in Part II
Cheat sheet

01

The General Problem

The chapter studies

$$\min_{\mathbf{x}\in\mathbb{R}^d}\; f(\mathbf{x}) \quad \text{subject to} \quad g_i(\mathbf{x}) \le 0,\; h_j(\mathbf{x}) = 0,$$

where $f, g_i, h_j$ are differentiable. Three cases of practical interest:

Unconstrained: no $g$ or $h$. Solved by gradient descent / Newton.
Equality-constrained: only $h_j(\mathbf{x})=0$. Lagrange multipliers.
Constrained with inequalities: SVMs and many ML problems. KKT, projection, dual.

If $f$ is convex and the constraint set is convex, the problem is a convex programme — every local minimum is global, and powerful solvers exist (interior-point, projected gradient, ADMM…).

02

Gradient Descent — Convergence under Smoothness

The iteration:

$$\mathbf{x}_{t+1} = \mathbf{x}_t - \eta\,\nabla f(\mathbf{x}_t)^\top.$$

If $f$ is $L$-smooth ($\|\nabla f(\mathbf{x})-\nabla f(\mathbf{y})\| \le L\|\mathbf{x}-\mathbf{y}\|$) and $\eta = 1/L$:

$$f(\mathbf{x}_{t+1}) \le f(\mathbf{x}_t) - \frac{1}{2L}\|\nabla f(\mathbf{x}_t)\|^2.$$

So $f$ strictly decreases; for convex $f$ the rate is $\mathcal{O}(1/t)$.

Strong convexity

If $f$ is also $m$-strongly convex (the Hessian satisfies $H_f \succeq mI$), the rate becomes linear:

$$f(\mathbf{x}_t) - f^\star \le \left(1 - \frac{m}{L}\right)^t (f(\mathbf{x}_0) - f^\star).$$

The ratio $\kappa = L/m$ is the condition number; it controls how many iterations you need. The book's example: $f(\mathbf{x}) = \tfrac12\mathbf{x}^\top A\mathbf{x}$ has $L=\lambda_{\max}(A)$ and $m=\lambda_{\min}(A)$.

03

Step Size: Fixed, Line Search, Decay

Choice	Recipe	Notes
Fixed	$\eta_t = \eta$	Simple. Needs $\eta \le 1/L$ for guarantee. Underused in deep learning.
Diminishing	$\eta_t = c/t$ or $c/\sqrt{t}$	For stochastic gradients. The $1/\sqrt{t}$ schedule is what gives SGD's classical $\mathcal{O}(1/\sqrt{t})$ rate.
Backtracking line search	Halve $\eta$ until Armijo: $f(\mathbf{x}-\eta\mathbf{g}) \le f(\mathbf{x}) - \alpha\eta\\|\mathbf{g}\\|^2$	No prior knowledge of $L$; widely used outside deep learning.
Adam / AdamW	per-coordinate $\eta$, adaptive from history	What every transformer trainer uses; not analysed in the book but implements the same idea.

04

Momentum and Nesterov Acceleration

Heavy ball / momentum:

$$\mathbf{v}_{t+1} = \beta\mathbf{v}_t + \nabla f(\mathbf{x}_t), \qquad \mathbf{x}_{t+1} = \mathbf{x}_t - \eta\,\mathbf{v}_{t+1}.$$

Smooths out oscillations along sharp directions and accumulates progress along flat ones.

Nesterov:

$$\mathbf{y}_t = \mathbf{x}_t + \beta(\mathbf{x}_t-\mathbf{x}_{t-1}),\qquad \mathbf{x}_{t+1} = \mathbf{y}_t - \eta\nabla f(\mathbf{y}_t).$$

Look ahead before taking the gradient. Achieves the optimal rate $\mathcal{O}(1/t^2)$ for smooth convex $f$ — provably the best first-order rate.

In one line

Plain GD: $\mathcal{O}(1/t)$. Nesterov: $\mathcal{O}(1/t^2)$. Stochastic GD: $\mathcal{O}(1/\sqrt{t})$. Strongly-convex GD: linear ($e^{-t/\kappa}$).

05

Constrained Optimisation: Equality & the Lagrangian

For $\min f(\mathbf{x})$ subject to $h(\mathbf{x})=0$, form the Lagrangian

$$\mathcal{L}(\mathbf{x}, \lambda) = f(\mathbf{x}) - \lambda\, h(\mathbf{x}).$$

At a stationary point:

$$\nabla_{\mathbf{x}}\mathcal{L} = 0 \;\Longleftrightarrow\; \nabla f(\mathbf{x}^\star) = \lambda^\star\,\nabla h(\mathbf{x}^\star), \quad h(\mathbf{x}^\star) = 0.$$

Geometrically: the gradient of $f$ must be parallel to the gradient of $h$ at the optimum. Otherwise you could move along the constraint surface and decrease $f$ further.

For multiple equality constraints, sum: $\mathcal L = f - \sum_j \lambda_j h_j$ and $\nabla f = \sum_j \lambda_j \nabla h_j$ at the optimum.

06

KKT Conditions for Inequality Constraints

For $\min f(\mathbf{x})$ s.t. $g_i(\mathbf{x})\le 0$, $h_j(\mathbf{x})=0$, the Karush–Kuhn–Tucker conditions are necessary at any local optimum (under regularity).

Stationarity: $\nabla f + \sum_i \alpha_i \nabla g_i + \sum_j \lambda_j \nabla h_j = 0$.
Primal feasibility: $g_i(\mathbf{x})\le 0$, $h_j(\mathbf{x})=0$.
Dual feasibility: $\alpha_i \ge 0$.
Complementary slackness: $\alpha_i\, g_i(\mathbf{x}) = 0$ for every $i$.

Complementary slackness is the key one: either constraint $i$ is active ($g_i=0$, multiplier $\alpha_i>0$) or it's inactive ($g_i<0$, multiplier $\alpha_i=0$). The SVM in deck 12 reads off "support vectors" from exactly this rule.

07

Interactive: Lagrange Multipliers

Minimise $f(\mathbf{x}) = (x_1-c_1)^2 + (x_2-c_2)^2$ on the constraint $h(\mathbf{x}) = x_1^2 + x_2^2 - r^2 = 0$ (the circle of radius $r$). Drag the centre of $f$; change $r$. The optimum is plotted — notice $\nabla f$ and $\nabla h$ aligned at the optimum.

constraint radius $r$

1.40

x*

—

f(x*)

—

λ*

—

The Lagrangian condition reads $2(\mathbf{x}^\star - \mathbf{c}) = \lambda^\star \cdot 2\mathbf{x}^\star$, i.e. $\mathbf{x}^\star$ is collinear with $\mathbf{c}$. The optimum is $\mathbf{x}^\star = r\,\mathbf{c}/\|\mathbf{c}\|$ (project $\mathbf{c}$ onto the circle).

08

Convex Sets and Convex Functions

Convex set

$C\subseteq\mathbb{R}^d$ is convex if $\alpha\mathbf{x}+(1-\alpha)\mathbf{y}\in C$ for all $\mathbf{x},\mathbf{y}\in C$, $\alpha\in[0,1]$. Examples: half-spaces, balls, the PSD cone, the simplex.

Convex function

$f:C\to\mathbb{R}$ on a convex domain $C$ is convex if

$$f(\alpha\mathbf{x}+(1-\alpha)\mathbf{y}) \le \alpha f(\mathbf{x}) + (1-\alpha) f(\mathbf{y}).$$

Equivalent (assume differentiability):

First-order: $f(\mathbf{y}) \ge f(\mathbf{x}) + \nabla f(\mathbf{x})(\mathbf{y}-\mathbf{x})$ — the function sits above its tangent.
Second-order: $H_f(\mathbf{x}) \succeq 0$ everywhere.

Why convexity matters

Every local minimum is a global minimum. Many algorithms have global convergence guarantees on convex problems. Most "classical" ML — linear/logistic regression, SVMs, lasso/ridge — is convex; modern neural networks are not.

09

Linear and Quadratic Programmes

Linear programme (LP)

$$\min\;\mathbf{c}^\top\mathbf{x} \quad \text{s.t.}\quad A\mathbf{x} \le \mathbf{b}.$$

Feasible region is a polyhedron; the optimum is at a vertex. Solved by simplex (combinatorial) or interior-point (polynomial-time).

Quadratic programme (QP)

$$\min\;\tfrac12\mathbf{x}^\top Q\mathbf{x} + \mathbf{c}^\top\mathbf{x} \quad \text{s.t.}\quad A\mathbf{x} \le \mathbf{b}.$$

If $Q\succeq 0$, it's convex. The SVM (deck 12) is exactly a QP with this shape; ridge regression is a QP with no constraints.

Why ML loves QPs

Many ML losses are quadratic or piecewise-linear; many constraints are linear. The intersection — QP — is solvable to global optimality, fast, and has well-understood software.

10

The Lagrange Dual

For the constrained problem with Lagrangian $\mathcal{L}(\mathbf{x},\boldsymbol\alpha,\boldsymbol\lambda) = f(\mathbf{x}) + \sum_i\alpha_i g_i(\mathbf{x}) + \sum_j\lambda_j h_j(\mathbf{x})$, define

$$\mathcal{D}(\boldsymbol\alpha,\boldsymbol\lambda) = \inf_{\mathbf{x}} \mathcal{L}(\mathbf{x},\boldsymbol\alpha,\boldsymbol\lambda).$$

$\mathcal D$ is the dual function; it is always concave (even if $f$ isn't convex). The dual problem:

$$\max_{\boldsymbol\alpha\ge 0,\boldsymbol\lambda}\;\mathcal D(\boldsymbol\alpha,\boldsymbol\lambda).$$

Weak and strong duality

Weak duality: $\mathcal D(\boldsymbol\alpha,\boldsymbol\lambda) \le f^\star$ always. The dual provides a lower bound.
Strong duality: equality holds; the primal and dual optima coincide. Holds for convex problems under Slater's condition (a strictly feasible point exists for the inequality constraints).

Why SVMs solve the dual

The primal SVM is a QP in $d+1$ variables; the dual is a QP in $N$ variables (number of training points). For small $d$ and large $N$ the primal is faster; with kernels the dual is the natural place to insert $k(\mathbf{x}_i,\mathbf{x}_j)$ without ever computing the feature map.

11

Where This Lands in Part II

Concept	Used by	Where
Gradient descent	Any deep model	Training loop
Stationarity	Linear regression (ch. 9)	$\nabla R(\boldsymbol\theta) = 0$ gives the normal equation
Lagrange multipliers	PCA (ch. 10)	Maximise variance subject to unit-norm constraint
KKT & complementary slackness	SVM (ch. 12)	Support vectors = active inequality constraints
Lagrange dual	SVM (ch. 12)	Dual SVM is a QP in $N$ vars allowing kernels
Convexity	Most of Part II	Why each algorithm's optimum is unique & tractable
EM	GMM (ch. 11)	Coordinate ascent on a lower bound — another convex device

12

Cheat Sheet

Idea	Statement
GD step	$\mathbf{x}_{t+1} = \mathbf{x}_t - \eta \nabla f(\mathbf{x}_t)^\top$
Smoothness $L$	$\eta \le 1/L$ ensures monotonic decrease
Condition number	$\kappa = L/m$ (smoothness / strong convexity)
Lagrangian	$\mathcal L = f + \sum\alpha_i g_i + \sum\lambda_j h_j$
Stationarity	$\nabla f = \sum\alpha_i\nabla g_i + \sum\lambda_j\nabla h_j$
Comp. slackness	$\alpha_i g_i = 0$
Convexity (1st-order)	$f(\mathbf y)\ge f(\mathbf x) + \nabla f(\mathbf x)(\mathbf y - \mathbf x)$
Strong duality	holds for convex + Slater's condition
QP form	$\min \tfrac12 \mathbf x^\top Q\mathbf x + \mathbf c^\top \mathbf x$ s.t. $A\mathbf x\le\mathbf b$