## Kernel trick

When we set the derivative of the primal w.r.t. \( \vw \) to zero, we got

$$ \vw = \sum_{\nlabeledsmall=1}^\nlabeled \alpha_\nlabeledsmall y_\nlabeledsmall \vx_\nlabeledsmall $$

Note that the weight vector of the SVM, \( \vw \) is dependent on the observations \( \vx_\nlabeledsmall \).
Moreover, the dependence is based on the value of the Lagrange multipliers \( \alpha_\nlabeledsmall \).
We also now know that only some of them, the so-called support vectors, will be active in the calculation.

Remember that our predictive model is

\begin{align}
\yhat &= \sign\left(f(\vx)\right) \\\\
&= \sign\left( \vw^T \vx + b \right) \\\\
& = \sign\left( \sum_{\nlabeledsmall=1}^\nlabeled \alpha_\nlabeledsmall y_\nlabeledsmall \vx_\nlabeledsmall^T \vx + b \right)
\end{align}

where, in the last step, we have substituted the result of \( \vw \) from the primal.

This is a very interesting (and important) form.

Note the inner product \( \vx_\nlabeledsmall^T \vx \). It is a dot product of a training example and the example to be predicted on (the test case).
This means, even if we transform the training (as well as test case) into some other *feature*, say \( \phi(\vx) \), we can still use SVMs for prediction by merely replacing this quantity as

$$ \yhat = \sign\left( \sum_{\nlabeledsmall=1}^\nlabeled \alpha_\nlabeledsmall y_\nlabeledsmall \phi(\vx_\nlabeledsmall)^T \phi(\vx) + b \right) $$

Amazing! Because this opens a whole lot of tricks to go beyond linear models that we have studied so far.
This form of the predictive model is typically represented in terms of a function known as a **kernel** \( K(\dash{\vx}, \vx) \) in the following manner.

$$ \yhat = \sign\left( \sum_{\nlabeledsmall=1}^\nlabeled \alpha_\nlabeledsmall y_\nlabeledsmall K(\vx_\nlabeledsmall, \vx) + b \right) $$

In this case, our kernel is linear because \( K(\dash{\vx},\vx) = \phi(\dash{\vx})^T \phi(\vx) \). But the options are many, as we will see next.