The maximum-margin hyperplane, the dual quadratic programme, support vectors, and the kernel trick — all four ideas in one demo.
Data $\{(\mathbf{x}_n, y_n)\}_{n=1}^N$ with $\mathbf{x}_n\in\mathbb{R}^D$ and $y_n\in\{-1, +1\}$. We want a classifier of the form
$$\hat y(\mathbf{x}) = \mathrm{sign}\bigl(\mathbf{w}^\top\mathbf{x} + b\bigr).$$
The set $\{\mathbf{x} : \mathbf{w}^\top\mathbf{x}+b = 0\}$ is an affine hyperplane (deck 02). When the classes are linearly separable, infinitely many hyperplanes work; SVMs pick the one with the largest margin.
With $y_n\in\{-1,+1\}$, "correct classification" is exactly $y_n(\mathbf{w}^\top\mathbf{x}_n+b) > 0$. The product trick (instead of $\{0,1\}$) is what makes the whole SVM algebra clean.
The (signed) distance from $\mathbf{x}_n$ to the hyperplane $\mathbf{w}^\top\mathbf{x}+b=0$ is
$$\frac{\mathbf{w}^\top\mathbf{x}_n+b}{\|\mathbf{w}\|} \;\in\; \mathbb{R}.$$
Multiplying by $y_n$ makes correct classifications positive. The geometric margin of the training set is
$$\gamma = \min_n \;\frac{y_n(\mathbf{w}^\top\mathbf{x}_n+b)}{\|\mathbf{w}\|}.$$
By rescaling $(\mathbf{w}, b)$ we can assume the closest points satisfy $y_n(\mathbf{w}^\top\mathbf{x}_n + b) = 1$ — the canonical form. Then $\gamma = 1/\|\mathbf{w}\|$ and "maximise margin" $\iff$ "minimise $\|\mathbf{w}\|$".
$$\min_{\mathbf{w}, b}\;\tfrac12\|\mathbf{w}\|^2 \quad \text{s.t.}\quad y_n(\mathbf{w}^\top\mathbf{x}_n + b) \ge 1,\;\; n=1,\dots,N.$$
This is a convex quadratic programme (deck 07): quadratic objective, linear inequality constraints. The objective $\tfrac12\|\mathbf{w}\|^2$ is strictly convex; the optimum is unique.
The fix is the soft-margin SVM — slide 04.
Allow violations $\xi_n \ge 0$ and penalise them:
$$\min_{\mathbf{w},b,\boldsymbol\xi}\;\tfrac12\|\mathbf{w}\|^2 + C\sum_n \xi_n \quad \text{s.t.}\quad y_n(\mathbf{w}^\top\mathbf{x}_n+b)\ge 1-\xi_n,\;\;\xi_n\ge 0.$$
$C>0$ is a hyperparameter: large $C$ means few violations (closer to hard-margin), small $C$ means a wider margin even at the cost of misclassifying training points.
At the optimum, $\xi_n = \max\bigl(0, 1 - y_n(\mathbf{w}^\top\mathbf{x}_n+b)\bigr)$. So the SVM solves
$$\min_{\mathbf{w},b}\;\sum_n \underbrace{\max\bigl(0, 1 - y_n(\mathbf{w}^\top\mathbf{x}_n+b)\bigr)}_{\text{hinge loss}} + \frac{1}{2C}\|\mathbf{w}\|^2.$$
This is the canonical "loss + L2 regularisation" form — same shape as ridge regression, with the hinge instead of squared loss.
Form the Lagrangian (deck 07) with multipliers $\alpha_n \ge 0$ for the inequality constraints and $\mu_n\ge 0$ for $\xi_n\ge 0$:
$$\mathcal L = \tfrac12\|\mathbf{w}\|^2 + C\sum_n \xi_n - \sum_n \alpha_n[y_n(\mathbf{w}^\top\mathbf{x}_n+b) - 1 + \xi_n] - \sum_n \mu_n \xi_n.$$
Setting $\nabla_{\mathbf{w}} = 0$, $\partial_b = 0$, $\partial_{\xi_n} = 0$:
$$\mathbf{w} = \sum_n \alpha_n y_n \mathbf{x}_n, \qquad \sum_n \alpha_n y_n = 0, \qquad \alpha_n + \mu_n = C.$$
Substituting back yields the dual problem:
$$\max_{\boldsymbol\alpha}\;\sum_n \alpha_n - \tfrac12 \sum_{m,n} \alpha_m\alpha_n y_m y_n \mathbf{x}_m^\top\mathbf{x}_n \quad \text{s.t.}\;\; 0\le\alpha_n\le C,\;\sum_n \alpha_n y_n = 0.$$
Convex QP in $N$ variables; strong duality (deck 07) holds because the primal is convex and Slater's condition is satisfied.
Complementary slackness (deck 07) gives, at the optimum:
From $\mathbf{w} = \sum_n \alpha_n y_n \mathbf{x}_n$, the prediction is
$$\hat y(\mathbf{x}) = \mathrm{sign}\Bigl(\sum_n \alpha_n y_n\,\mathbf{x}_n^\top\mathbf{x} + b\Bigr).$$
Only support vectors (those with $\alpha_n > 0$) appear in the sum. The SVM's representation is sparse in the training set.
In every formula above, $\mathbf{x}$ only ever appears as an inner product $\mathbf{x}_m^\top\mathbf{x}_n$. Replace it with any positive-definite kernel $k(\mathbf{x}_m, \mathbf{x}_n)$:
$$\max_{\boldsymbol\alpha}\;\sum_n \alpha_n - \tfrac12 \sum_{m,n} \alpha_m\alpha_n y_m y_n\,k(\mathbf{x}_m, \mathbf{x}_n).$$
The classifier becomes
$$\hat y(\mathbf{x}) = \mathrm{sign}\Bigl(\sum_{n\in\mathrm{SV}}\alpha_n y_n\,k(\mathbf{x}_n, \mathbf{x}) + b\Bigr).$$
Mercer's theorem says a kernel $k$ has the form $k(\mathbf{x},\mathbf{x}') = \boldsymbol\phi(\mathbf{x})^\top\boldsymbol\phi(\mathbf{x}')$ for some (possibly infinite-dimensional) feature map $\boldsymbol\phi$, as long as the kernel matrix $K_{ij} = k(\mathbf{x}_i,\mathbf{x}_j)$ is positive semi-definite for all data sets. The SVM never computes $\boldsymbol\phi$; it only needs $K$.
It lets a linear algorithm operate in an arbitrarily complicated feature space at the cost of a kernel evaluation per pair of points — no representation of the feature space is ever materialised. Gaussian processes (kernel ridge regression), kernel PCA and kernel logistic regression are the same trick applied to different objectives.
| Name | Definition | What it does |
|---|---|---|
| Linear | $k(\mathbf x,\mathbf x') = \mathbf x^\top\mathbf x'$ | Identical to the un-kernelised SVM; baseline. |
| Polynomial | $k(\mathbf x,\mathbf x') = (\mathbf x^\top\mathbf x' + c)^p$ | Decision boundary is a degree-$p$ polynomial. Be careful with $p>3$ — conditioning suffers. |
| RBF / Gaussian | $k(\mathbf x,\mathbf x') = e^{-\|\mathbf x-\mathbf x'\|^2 / (2\sigma^2)}$ | Infinite-dimensional feature space; classifier is a smooth bump per support vector. The most popular default. |
Cross-validation (deck 08) picks the sweet spot.
Hold the green or pink buttons to add points of each class. Choose a kernel and play with $C$ and (for RBF) $\sigma$. The decision boundary, the margin and the support vectors update live. The dual is solved by projected coordinate ascent under the box constraints.
Click outside the canvas to deactivate. Try the spiral / XOR pattern with linear vs RBF kernels: linear can't separate it, RBF can — the kernel trick at work.
The dual QP has $N$ variables, with one equality constraint and box constraints $0\le\alpha_n\le C$. General-purpose QP solvers are $\mathcal O(N^3)$ — too slow for $N$ in the tens of thousands.
SMO is what libsvm and sklearn run by default. For very large data, coordinate descent on the primal (LibLinear) is faster — gives up some accuracy in exchange for speed.
The biggest cost is $K\in\mathbb{R}^{N\times N}$. Production SVMs use working-set methods and cache only the rows for active variables. For $N$ in the millions you stop using kernel SVMs and switch to deep networks or kernel approximations (Nyström, random features).
| Object | Formula |
|---|---|
| Decision function | $\hat y(\mathbf x) = \mathrm{sign}(\mathbf w^\top\mathbf x + b)$ |
| Geometric margin | $\gamma = 1/\|\mathbf w\|$ (canonical form) |
| Hard-margin primal | $\min \tfrac12\|\mathbf w\|^2$ s.t. $y_n(\mathbf w^\top\mathbf x_n+b)\ge 1$ |
| Soft-margin primal | $\min \tfrac12\|\mathbf w\|^2 + C\sum\xi_n$ s.t. margin−slack |
| Hinge loss | $\max(0, 1 - y(\mathbf w^\top\mathbf x + b))$ |
| Dual QP | $\max\sum\alpha_n - \tfrac12\sum\alpha_m\alpha_n y_m y_n k(\mathbf x_m, \mathbf x_n)$ |
| Constraint | $0\le\alpha_n\le C$, $\sum\alpha_n y_n = 0$ |
| Weights from dual | $\mathbf w = \sum_n \alpha_n y_n \mathbf x_n$ |
| Kernel classifier | $\hat y(\mathbf x) = \mathrm{sign}(\sum_{\text{SV}} \alpha_n y_n k(\mathbf x_n, \mathbf x) + b)$ |
| RBF kernel | $\exp(-\|\mathbf x-\mathbf x'\|^2/(2\sigma^2))$ |
You've worked through every chapter. Part I built the language: vectors, geometry, matrix decompositions, calculus, probability and optimisation. Part II turned each of those tools into an algorithm: regression as a projection, PCA as eigenvectors, GMM as EM on a latent model, SVM as a kernel-aware quadratic programme. The same toolkit reappears, restructured, in every model that came after — right up to today's transformers.