Kernel trick
When we set the derivative of the primal w.r.t. \( \vw \) to zero, we got
$$ \vw = \sum_{\nlabeledsmall=1}^\nlabeled \alpha_\nlabeledsmall y_\nlabeledsmall \vx_\nlabeledsmall $$
Note that the weight vector of the SVM, \( \vw \) is dependent on the observations \( \vx_\nlabeledsmall \).
Moreover, the dependence is based on the value of the Lagrange multipliers \( \alpha_\nlabeledsmall \).
We also now know that only some of them, the so-called support vectors, will be active in the calculation.
Remember that our predictive model is
\begin{align}
\yhat &= \sign\left(f(\vx)\right) \\\\
&= \sign\left( \vw^T \vx + b \right) \\\\
& = \sign\left( \sum_{\nlabeledsmall=1}^\nlabeled \alpha_\nlabeledsmall y_\nlabeledsmall \vx_\nlabeledsmall^T \vx + b \right)
\end{align}
where, in the last step, we have substituted the result of \( \vw \) from the primal.
This is a very interesting (and important) form.
Note the inner product \( \vx_\nlabeledsmall^T \vx \). It is a dot product of a training example and the example to be predicted on (the test case).
This means, even if we transform the training (as well as test case) into some other feature, say \( \phi(\vx) \), we can still use SVMs for prediction by merely replacing this quantity as
$$ \yhat = \sign\left( \sum_{\nlabeledsmall=1}^\nlabeled \alpha_\nlabeledsmall y_\nlabeledsmall \phi(\vx_\nlabeledsmall)^T \phi(\vx) + b \right) $$
Amazing! Because this opens a whole lot of tricks to go beyond linear models that we have studied so far.
This form of the predictive model is typically represented in terms of a function known as a kernel \( K(\dash{\vx}, \vx) \) in the following manner.
$$ \yhat = \sign\left( \sum_{\nlabeledsmall=1}^\nlabeled \alpha_\nlabeledsmall y_\nlabeledsmall K(\vx_\nlabeledsmall, \vx) + b \right) $$
In this case, our kernel is linear because \( K(\dash{\vx},\vx) = \phi(\dash{\vx})^T \phi(\vx) \). But the options are many, as we will see next.