# Support vector machines

## Introduction

Support vector machines were one of the most important contributions to machine learning in late 1990s. These maximum-margin classifiers were accurate predictive models, especially in cases that were not linearly separable, by the use of the kernel trick that we will study shortly. In this article, we will focus on support vector classifiers, and then discuss their extension to regression problems.

## Prerequisites

To understand the support vector machine classifier, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

## Problem setting

In classification, the goal of the predictive model is to identify the class that generated a particular instance. Support vector classifiers are binary classifiers.

Consider such an instance $\vx$, a vector consisting of $\ndim$ features, $\vx = [x_1, x_2, \ldots, x_\ndim]$. The features may be real-valued, binary, categorical, or mix-type. We need to assign it to one of two classes $\set{-1,1}$.

## Predictive model

The SVM classifier model is a simple parametric model with a weight vector $\vw \in \real^\ndim$ and bias $b \in \real$.

An instance $\vx$ is classified based on the following expression into the classes $\set{-1,1}$.

\begin{align} \hat{y} &= \sign \left(f(\vx)\right) \\\\ &= \sign\left( \vw^T \vx + b \right) \label{eqn:class-pred} \end{align}

## Nature of the predictive model

Now, let us try to understand the effect of changing the weight vector $\vw$ and the bias $b$ on the predictive model of the support vector machine classifier.

## Observations about the predictive model

If you have tried the interactive demo of the predictive model in the previous section, you should note a few things.

• The weight vector $\vw$ is always perpendicular to the decision boundary, the so-called separating hyperplane between the orange and blue classes. The separating hyperplane is indicated with a red line. Thus, a rotation of the weight vector results in a corresponding rotation of the decision boundary.
• The weight vector $\vw$ points in the direction of increasing value of the function $\vw^T\vx + b$.
• Increasing the magnitude of the vector (dragging it in the same direction away from the origin) does not change the decision boundary. This intuition plays a key role in regularization, as we shall see in the norm-based regularization in machine learning.
• The bias term $b$ has the net effect of sliding the decision boundary away from the origin. When $b = 0$, the decision boundary passes through the origin.

## The separating hyperplane

With the predictive model defined above, it is the case that the hyperplane separating the two classes is represented by the following set.

$$\sH = \set{x : \vw^T \vx + b = 0}$$

Why? Because the predictive model implies that when $\vw^T \vx + b < 0$, the example $\vx$ belongs to the negative class and when $\vw^T \vx + b > 0$, the example belongs to the positive class.

Let's observe some important properties of the hyperplane $\sH$ that help in the design of SVMs.

## The weight vector is orthogonal to the hyperplane

Consider any two points $\vx_1$ and $\vx_w$ on the hyperplane. Since $\vx_1 \in \sH$ and $\vx_2 \in \sH$, it is the case that

\begin{align} \vw^T \vx_1 + b &= 0 \\\\ \vw^T \vw_2 + b &= 0 \\\\ \end{align}

Subtracting the bottom relation from the top one, we get

$$\vw^T (\vx_1 - \vx_2) = 0$$

Now, note that $\vx_1 - \vx_2$ is a vector that lies in the hyperplane $\sH$. This means, the weight vector $\vw$ must be orthogonal to the hyperplane for the dot product to be zero. Thus, $\vw$ is a orthogonal to the hyperplane $\sH$.

More specifically, the unit vector $\frac{\vw}{||\vw||}$ is normal to the surface of the hyperplane $\sH$.

## Distance of point from the hyperplane

Note that for any point $\vx_0$ on the hyperplane $\sH$, it is the case that $\vw^T \vx_0 + b = 0$. By rearranging, the following expression is true for any point $\vx_0$ on the hyperplane $\sH$.

\begin{equation} \vw^T \vx_0 = -b \label{eqn:x0} \end{equation}

Consider another point $\vx$, which may or may not be on the hyperplane. By basic geometry, the distance of this point to the hyperplane is

\begin{equation} \text{dist}(\vx,\sH) = \frac{\vw^T}{\norm{\vw}{}} (\vx - \vx_0) \label{eqn:distance} \end{equation}

Why? Because $\vx - \vx_0$ is a line joining the two points. The above quantity represents the projection of the line $\vx - \vx_0$ along the direction $\frac{\vw^T}{\norm{\vw}{}}$.

We showed in the previous section that the unit vector $\frac{\vw^T}{\norm{\vw}{}}$ is orthogonal to the hyperplane $\sH$. Therefore, the projection must be the shortest distance between the point $\vx$ and the plane $\sH$.

Further expanding Equation \eqref{eqn:distance}, we get

\begin{equation} \text{dist}(\vx,\sH) = \frac{\vw^T}{\norm{\vw}{}} (\vx - \vx_0) = \frac{1}{\norm{\vw}{}} (\vw^T \vx + b) \label{eqn:distance-2} \end{equation}

where, we have used the fact that $\vw^T \vx_0 = -b$ from Equation \eqref{eqn:x0}.

## Understanding distance of point from hyperplane

Now that we know the math, let's see this in practice.

## Designing the first model constraint

Intuitively, we want the separating hyperplane $\sH$ to be inferred such that all training instances are on the correct side of the hyperplane. This constraint ensures that all training instances are correctly classified.

This requirement suggests that for any instance $(\vx_\nlabeledsmall, y_\nlabeledsmall)$ in the training set

\begin{equation} y_\nlabeledsmall = \sign \left( \vw^T\vx_\nlabeledsmall + b \right) \end{equation}

In other words, the first requirement is satisfied if

\begin{equation} y_\nlabeledsmall \left( \vw^T\vx_\nlabeledsmall + b \right) > 0 \label{eqn:first-condition} \end{equation}

## Designing the second model constraint

As a stretch goal, we also want the classifier to generalize better on future examples. We can hope to achieve this goal if all the training instances are comfortably classified by the hyperplane. In other words, all training instances are as far as possible from the hyperplane, on their correct side of the classifier.

For the second condition, suppose we want each training instance to be at least $D$ units away from the hyperplane $\sH$. This implies

$$\text{dist}(\vx_\nlabeledsmall, \sH) \ge D, ~~~~ \forall \nlabeledsmall \in \set{1,\ldots,\nlabeled}$$

Because we know from the previous section that $\text{dist}(\vx,\sH)$ is $\frac{ \vw^T\vx + b }{||\vw||}$, we can rewrite the second constraint as This means, for all $\nlabeledsmall \in \set{1,\ldots,\nlabeled}$

\begin{aligned} & \frac{ \vw^T\vx_\nlabeledsmall + b }{||\vw||} \ge D \\\\ \implies& \vw^T\vx_\nlabeledsmall + b \ge D||\vw|| \\\\ \implies& \vw^T\vx_\nlabeledsmall + b \ge 1 \\\\ \end{aligned}

where, in the last equation, we have set $D = \frac{1}{||\vw||}$.

The quantity $D$ is known as the margin of the classifier since there are no training points within the margin of the hyperplane. For a good generalizable classifier, we want to maximize the margin.

## Model constraints

Taken together, the two constraints imply that

$$y\left(\vw^T\vx_\nlabeledsmall + b\right) \ge 1, ~~~~ \forall \nlabeledsmall \in \set{1,\ldots,\nlabeled}$$

where, we have assumed that the minimum distance of every training point from the hyperplane is $D = \frac{1}{||\vw||}$.

Since we wish to maximize the distance $D$, we need to minimize $||\vw||$.

This leads to the optimization problem for support vector classification

\begin{equation} \min_{\vw,b} ||\vw|| \\\\ \text{subject to } y_\nlabeledsmall (\vw^T \vx_\nlabeledsmall + b) \ge 1, ~~~~\forall \nlabeledsmall \in \set{1,\ldots,\nlabeled} \end{equation}

## Understanding model constraints

Now that we know the math, let's see if you can discover the optimal hyperplane — manually!

In the classification problem below, there are 4 training points. $p_1, p_2, p_3,$ and $p_4$. Of these $p_1$ and $p_2$ belong to the negative class, and the other belong to the positive class. Adjust $\vw$ (by dragging the arrowhead) and $b$ (with the slider) to arrive at an optimal classifier. As defined earlier, the optimal classifier maximizes the margin (minimizes the value of $||\vw||$), while still satisfying all constraints. Satisfied constraints are marked green and violated constraints are marked red.

Feel free to move the training points around to clearly understand linear separability — scenarios when the training examples are correctly classified by the linear separating hyperplane, and when they are not possible.

## The inseparable case

If you have played enough with the interactive demo in the previous section, you know that it is impossible to satisfy all constraints if the problem is not linearly separable. Thus, in non-linearly separable scenarios, the optimization problem will never converge.

To deal with inseparable cases, we need to relax the strict condition above. How? Simple, by allowing some errors

$$y_\nlabeledsmall (\vw^T \vx_\nlabeledsmall + b) \ge 1 - \xi_\nlabeledsmall, ~~~~\forall \nlabeledsmall \in \set{1,\ldots,\nlabeled}$$

The variables $\xi_\nlabeledsmall$ are known as slack variables because they reduce the strictness, rather loosen the constraint above to the form.

Now we need two conditions on the slack variables.

1. They are all required to be positive; so $\xi_\nlabeledsmall \ge 0$. This ensures that they actually act as slack and do not instead work for enforce strictness.
2. We want only the hardest to classify examples, the truly inseparable ones, to have a nonzero slack. This means, we want to limit the total slack we are allowing across all training examples.

Thus, in the inseparable case, we have a more refined optimization problem for the support vector classifier.

\begin{equation} \min_{\vw,b} \norm{\vw}{} + C \sum_{\nlabeledsmall = 1 }^\nlabeled \xi_\nlabeledsmall \\\\ \text{subject to }~~ y_\nlabeledsmall (\vw^T \vx_\nlabeledsmall + b) \ge 1 - \xi_\nlabeledsmall, ~~~~\forall \nlabeledsmall \in \set{1,\ldots,\nlabeled} \\\\ \text{ and }~~ \xi_\nlabeledsmall \ge 0, ~~~~\forall \nlabeledsmall \in \set{1,\ldots,\nlabeled} \end{equation}

where $C$ is a tunable hyperparameter of the model. Higher values of $C$ restrict the slack variables and lower values of $C$ allow for more slack.

## Understanding model constraints: inseparable case

Now that we know the math, let's see if you can discover the optimal hyperplane in the presence of slack variables — manually!

In the classification problem below, there are 4 training points. $p_1, p_2, p_3,$ and $p_4$. Of these $p_1$ and $p_2$ belong to the negative class, and the other belong to the positive class. Adjust $\vw$ (by dragging the arrowhead) and $b$ (with the slider) to arrive at an optimal classifier.

As defined earlier, the optimal classifier minimizes the loss — $||\vw|| + C\sum_{\nlabeledsmall} \xi_\nlabeledsmall$, while still satisfying all constraints. Satisfied constraints are marked green and violated constraints are marked red.

Feel free to move the training points around to clearly understand the impact on slack variables for the inseparable scenarios.

Notice how the slack variables ensures that the constraints are satisfied no matter where you place the training points with respect to the hyperplane. The only way to guide the hyperplane now is to minimize the loss. The optimizer can now converge to the minimum loss in spite of misclassified training points.

Also, notice the effect of $C$. Higher values of $C$ increase the loss contributions of the slack variables. This will ensure that the optimization algorithm will try to find the hyperplane that minimizes the values of the slacks.

## Solving constrained optimization: Primal

Now that the optimization problem has been set up, training involves solving for the solution. Being a constrained optimization problem, we find the solution using the method of Lagrange multipliers that we have studied before.

The Lagrange primal function of the above optimization problem is

\begin{align} L_{\text{primal}} = \frac{1}{2}\norm{\vw}{}^2 &+ C \sum_{\nlabeledsmall = 1 }^\nlabeled \xi_\nlabeledsmall \\\\ &- \sum_{\nlabeledsmall = 1 }^{\nlabeled} \alpha_\nlabeledsmall \left[y_\nlabeledsmall (\vw^T \vx_\nlabeledsmall + b) - (1 - \xi_\nlabeledsmall)\right] \\\\ &- \sum_{\nlabeledsmall=1}^\nlabeled \mu_\nlabeledsmall \xi_\nlabeledsmall \label{eqn:primal} \end{align}

The first line of this primal is the function to be optimized and the latter two lines are the constraints along with their Lagrangian multipliers $\alpha_\nlabeledsmall$ and $\mu_\nlabeledsmall$.

## Solving constrained optimization: Dual

We need to minimize the primal objective w.r.t. $\vw$, $b$, and $\xi_\nlabeledsmall$. Setting the respective derivatives to zero, we get

\begin{align} \vw &= \sum_{\nlabeledsmall=1}^\nlabeled \alpha_\nlabeledsmall y_\nlabeledsmall \vx_\nlabeledsmall \\\\ 0 &= \sum_{\nlabeledsmall=1}^\nlabeled \alpha_\nlabeledsmall y_\nlabeledsmall \\\\ \alpha_\nlabeledsmall &= C - \mu_\nlabeledsmall, \forall \nlabeledsmall \end{align}

We also get the positivity constraints on the Lagrange multipliers $\alpha_\nlabeledsmall$ and $\mu_\nlabeledsmall$ as well as the slack variables $\xi_\nlabeledsmall$, for all $\nlabeledsmall$.

By substituting these identities into the primal, we get the dual objective function

$$L_{\text{dual}} = \sum_{\nlabeledsmall=1}^\nlabeled \alpha_\nlabeledsmall - \frac{1}{2} \sum_{\nlabeledsmall = 1}^\nlabeled \sum_{\dash{\nlabeledsmall}=1}^{\nlabeled} \alpha_\nlabeledsmall \alpha_\dash{\nlabeledsmall} y_\nlabeledsmall y_\dash{\nlabeledsmall} \vx_\nlabeledsmall^T \vx_\dash{\nlabeledsmall}$$

Typically, it is easier to optimize the SVM in its dual form.

## Support vectors

We know that the Lagrange multipliers $\alpha_\nlabeledsmall \ge 0$. But are they always nonzero? Let's find out.

We have studied in the comprehensive article on Karush-Kuhn-Tucker (KKT) conditions that for the dual defined in previous section, the following constraints hold

\begin{align} \alpha_\nlabeledsmall\left[ y_\nlabeledsmall(\vx_\nlabeledsmall^T \vw + b) - (1 - \xi_\nlabeledsmall) \right] &= 0 \\\\ \mu_\nlabeledsmall \xi_\nlabeledsmall &= 0 \\\\ y_\nlabeledsmall(\vx_\nlabeledsmall^T \vw + b) - (1 - \xi_\nlabeledsmall) &\ge 0 \end{align}

The first and third KKT conditions imply that,

• $\alpha_\nlabeledsmall$ is zero for some $\nlabeledsmall$.
• In fact, it is zero exactly for those labeled examples that satify the inequality in the third KKT condition. That means,

$$y_\nlabeledsmall(\vx_\nlabeledsmall^T \vw + b) - (1 - \xi_\nlabeledsmall) > 0 \Rightarrow \alpha_\nlabeledsmall = 0$$

Only this way can the first KKT condition hold true.

This further implies that $\alpha_\nlabeledsmall$ can be nonzero only for those examples that satisfy the third KKT condition with an equality. That means,

$$y_\nlabeledsmall(\vx_\nlabeledsmall^T \vw + b) - (1 - \xi_\nlabeledsmall) = 0 \Rightarrow \alpha_\nlabeledsmall \ne 0$$

What does that mean? It means, $\alpha_\nlabeledsmall \ne 0$ only for examples that satisfy

$$y_\nlabeledsmall(\vx_\nlabeledsmall^T \vw + b) = (1 - \xi_\nlabeledsmall)$$

Well, we know that these are examples on the margin of the separating hyperplane! So, $\alpha_\nlabeledsmall$ is nonzero for examples on the margin and nonzero for the remaining examples that are well outside the margin of the separating hyperplane.

These examples, with nonzero $\alpha_\nlabeledsmall$ are, in some sense, supporting the hyperplane. In their absence the hyperplane may move. Thus, the examples with nonzero $\alpha_\nlabeledsmall$ are known as the support vectors of the classifier and the classifier itself is known as support vector classifier.

## Kernel trick

When we set the derivative of the primal w.r.t. $\vw$ to zero, we got

$$\vw = \sum_{\nlabeledsmall=1}^\nlabeled \alpha_\nlabeledsmall y_\nlabeledsmall \vx_\nlabeledsmall$$

Note that the weight vector of the SVM, $\vw$ is dependent on the observations $\vx_\nlabeledsmall$. Moreover, the dependence is based on the value of the Lagrange multipliers $\alpha_\nlabeledsmall$. We also now know that only some of them, the so-called support vectors, will be active in the calculation.

Remember that our predictive model is

\begin{align} \yhat &= \sign\left(f(\vx)\right) \\\\ &= \sign\left( \vw^T \vx + b \right) \\\\ & = \sign\left( \sum_{\nlabeledsmall=1}^\nlabeled \alpha_\nlabeledsmall y_\nlabeledsmall \vx_\nlabeledsmall^T \vx + b \right) \end{align}

where, in the last step, we have substituted the result of $\vw$ from the primal.

This is a very interesting (and important) form.

Note the inner product $\vx_\nlabeledsmall^T \vx$. It is a dot product of a training example and the example to be predicted on (the test case). This means, even if we transform the training (as well as test case) into some other feature, say $\phi(\vx)$, we can still use SVMs for prediction by merely replacing this quantity as

$$\yhat = \sign\left( \sum_{\nlabeledsmall=1}^\nlabeled \alpha_\nlabeledsmall y_\nlabeledsmall \phi(\vx_\nlabeledsmall)^T \phi(\vx) + b \right)$$

Amazing! Because this opens a whole lot of tricks to go beyond linear models that we have studied so far. This form of the predictive model is typically represented in terms of a function known as a kernel $K(\dash{\vx}, \vx)$ in the following manner.

$$\yhat = \sign\left( \sum_{\nlabeledsmall=1}^\nlabeled \alpha_\nlabeledsmall y_\nlabeledsmall K(\vx_\nlabeledsmall, \vx) + b \right)$$

In this case, our kernel is linear because $K(\dash{\vx},\vx) = \phi(\dash{\vx})^T \phi(\vx)$. But the options are many, as we will see next.

## Polynomial kernel

Owing to the kernel trick, a popular kernel is the $d$-th degree polynomial kernel

$$K(\dash{\vx},\vx) = \left(1 + \vx^T\dash{\vx} \right)^d$$

On expansion, a kernel such as this will have elements that will include cross-term interactions resulting in a nonlinear model.

For example, for a two-dimensional input $\vx \in \real^2$, such that $\vx = [x_1,x_2]$ and $\dash{\vx} = [\dash{x_1}, \dash{x_2}]$, with a two-degree polynomial, so that $d=2$, we will have terms such as

\begin{align} K(\dash{\vx},\vx) &= (1 + x_1\dash{x_1} + x_2\dash{x_2})^2 \\\\ &= (1 + 2x_1\dash{x_1} + 2x_2\dash{x_2} + (x_1\dash{x_1})^2 + (x_2\dash{x_2})^2 + 2x_1\dash{x_1}x_2\dash{x_2} \end{align}

Due to these cross-term interactions involving both $x_1$ and $x_2$, the resulting SVM model will be nonlinear.

## Dealing with feature types

Note that the predictive model involves a simple dot product between the weight vector $\vw$ and the instance $\vx$. This is easy for binary and continuous features since both can be treated as real-valued features.

In the case of categorical features a direct metric score calculation is not possible. Therefore, we need to first preprocess the categorical variables using one-hot encoding to arrive at a binary feature representation.