Introduction to kernel methods

This learning module has many interactive demos. It is easier to work with them on a larger screen. Bookmark and revisit if you are currently on a small screen device.

\(\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\asterisk}{\ast} \newcommand{\sup}{\text{sup}} \newcommand{\inf}{\text{inf}} \newcommand{\min}{\text{min}\;} \newcommand{\max}{\text{max}\;} \newcommand{\maxunder}[1]{\underset{#1}{\max}} \newcommand{\minunder}[1]{\underset{#1}{\min}} \newcommand{\real}{\mathbb{R}} \newcommand{\natural}{\mathbb{N}} \newcommand{\integer}{\mathbb{Z}} \newcommand{\rational}{\mathbb{Q}} \newcommand{\irrational}{\mathbb{I}} \newcommand{\complex}{\mathbb{C}} \newcommand{\cardinality}[1]{|#1|} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\star}[1]{#1^*} \newcommand{\inv}[1]{#1^{-1}} \newcommand{\indicator}[1]{\mathcal{I}(#1)} \renewcommand{\BigO}[1]{\mathcal{O}(#1)} \renewcommand{\BigOsymbol}{\mathcal{O}} \renewcommand{\smallo}[1]{\mathcal{o}(#1)} \renewcommand{\smallosymbol}[1]{\mathcal{o}} \newcommand{\set}[1]{\mathbb{#1}} \newcommand{\complement}[1]{#1^c} \newcommand{\powerset}[1]{\mathcal{P}(#1)} \newcommand{\setdiff}{\setminus} \newcommand{\setsymmdiff}{\oplus} \newcommand{\dash}[1]{#1^{'}} \newcommand{\permutation}[2]{{}_{#1} \mathrm{ P }_{#2}} \newcommand{\combination}[2]{{}_{#1} \mathrm{ C }_{#2}} \newcommand{\prob}[1]{P(#1)} \newcommand{\pmf}[1]{P(#1)} \newcommand{\pdf}[1]{p(#1)} \newcommand{\cdf}[1]{F(#1)} \newcommand{\expect}[2]{E_{#1}\left[#2\right]} \newcommand{\entropy}[1]{\mathcal{H}\left[#1\right]} \newcommand{\expe}[1]{\mathrm{e}^{#1}} \newcommand{\textexp}[1]{\text{exp}\left(#1\right)} \def\independent{\perp\!\!\!\perp} \def\notindependent{\not\!\independent} \newcommand{\yhat}{\hat{y}} \newcommand{\vs}{\vec{s}} \newcommand{\vt}{\vec{t}} \newcommand{\vu}{\vec{u}} \newcommand{\vv}{\vec{v}} \newcommand{\vw}{\vec{w}} \newcommand{\vx}{\vec{x}} \newcommand{\vy}{\vec{y}} \newcommand{\vz}{\vec{z}} \newcommand{\va}{\vec{a}} \newcommand{\vb}{\vec{b}} \newcommand{\vc}{\vec{c}} \newcommand{\vd}{\vec{d}} \newcommand{\ve}{\vec{e}} \newcommand{\vg}{\vec{g}} \newcommand{\vh}{\vec{h}} \newcommand{\vi}{\vec{i}} \newcommand{\vk}{\vec{k}} \newcommand{\vo}{\vec{o}} \newcommand{\vp}{\vec{p}} \newcommand{\vq}{\vec{q}} \newcommand{\vr}{\vec{r}} \newcommand{\vs}{\vec{s}} \newcommand{\vmu}{\vec{\mu}} \newcommand{\vsigma}{\vec{\sigma}} \newcommand{\vphi}{\vec{\phi}} \newcommand{\vtau}{\vec{\tau}} \newcommand{\vtheta}{\vec{\theta}} \newcommand{\mA}{\mat{A}} \newcommand{\mB}{\mat{B}} \newcommand{\mC}{\mat{C}} \newcommand{\mD}{\mat{D}} \newcommand{\mE}{\mat{E}} \newcommand{\mH}{\mat{H}} \newcommand{\mK}{\mat{K}} \newcommand{\mP}{\mat{P}} \newcommand{\mQ}{\mat{Q}} \newcommand{\mR}{\mat{R}} \newcommand{\mS}{\mat{S}} \newcommand{\mU}{\mat{U}} \newcommand{\mV}{\mat{V}} \newcommand{\mW}{\mat{W}} \newcommand{\mX}{\mat{X}} \newcommand{\mY}{\mat{Y}} \newcommand{\mZ}{\mat{Z}} \newcommand{\mI}{\mat{I}} \newcommand{\mLambda}{\mat{\Lambda}} \newcommand{\mSigma}{\mat{\Sigma}} \newcommand{\mTheta}{\mat{\theta}} \newcommand{\setsymb}[1]{#1} \newcommand{\sA}{\setsymb{A}} \newcommand{\sB}{\setsymb{B}} \newcommand{\sC}{\setsymb{C}} \newcommand{\sO}{\setsymb{O}} \newcommand{\sP}{\setsymb{P}} \newcommand{\sQ}{\setsymb{Q}} \newcommand{\sH}{\setsymb{H}} \newcommand{\sX}{\setsymb{X}} \newcommand{\sY}{\setsymb{Y}} \newcommand{\norm}[2]{||{#1}||_{#2}} \newcommand{\infnorm}[1]{\norm{#1}{\infty}} \newcommand{\fillinblank}{\text{ }\underline{\text{ ? }}\text{ }} \newcommand{\lbrace}{\left\{} \newcommand{\rbrace}{\right\}} \newcommand{\set}[1]{\lbrace #1 \rbrace} \newcommand{\seq}[1]{\left( #1 \right)} \newcommand{\ndim}{N} \newcommand{\ndimsmall}{n} \newcommand{\dataset}{\mathbb{D}} \newcommand{\ndata}{D} \newcommand{\ndatasmall}{d} \newcommand{\labeledset}{\mathbb{L}} \newcommand{\nlabeled}{L} \newcommand{\nlabeledsmall}{l} \newcommand{\unlabeledset}{\mathbb{U}} \newcommand{\nunlabeled}{U} \newcommand{\nunlabeledsmall}{u} \newcommand{\nclass}{M} \newcommand{\nclasssmall}{m} \newcommand{\loss}{\mathcal{L}} \newcommand{\sign}{\text{sign}} \newcommand{\Gauss}{\mathcal{N}} \newcommand{\hadamard}{\circ} \newcommand{\doh}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\dox}[1]{\doh{#1}{x}} \newcommand{\doy}[1]{\doh{#1}{y}} \newcommand{\doxx}[1]{\doh{#1}{x^2}} \newcommand{\doyy}[1]{\doh{#1}{y^2}} \newcommand{\doxy}[1]{\frac{\partial #1}{\partial x \partial y}} \newcommand{\doyx}[1]{\frac{\partial #1}{\partial y \partial x}} \newcommand{\qed}{\tag*{$\blacksquare$}}\)

        Kernel methods
        Machine Learning
      

Introduction

Kernel methods are a group of approaches in machine learning that make predictions on previously unseen data based on their similarity to observations in the training set. The pairwise similarity is encoded in terms of a function known as a kernel. The predictive capabilities of kernel methods can be affected by modifying the kernel and its parameters.

Kernel methods are used for a wide variety of applications, including classification, regression, and density estimation. In this article, we develop the intuition for kernel methods using the example of regression.

Prerequisites

To understand kernel methods, we recommend familiarity with the concepts in

Introduction to machine learning: An introduction to basic concepts in machine learning such as classification, training instances, features, and feature types.
Nearest neighbor methods

Follow the above links to first get acquainted with the corresponding concepts.

Problem setting

Consider an instance $ \vx \in \real^\ndim $, a vector consisting of $ \ndim $ features, $\vx = [x_1, x_2, \ldots, x_\ndim] $.

In classification, the goal of the predictive model is to identify the class that generated a particular instance. We need to assign it to one of the $ M $ classes $ C_1, C_2, \ldots, C_M $ depending on the values of the $ N $ features .

In regression, the goal of the predictive model is to predict a continuous valued output for a given multivariate instance. We need to predict a real-valued output $ \hat{y} \in \real $ that is as close as possible to the true target $ y \in \real $. The hat $ \hat{ } $ denotes that $ \hat{y} $ is an estimate, to distinguish it from the truth.

For both supervised learning settings, the model is inferred over a collection of labeled observations provided as tuples $ (\vx_i,y_i) $ containing the instance vector $ \vx_i $ and the true target variable $ y_i $. This collection of labeled observations is known as the training set $ \labeledset = \set{(\vx_1,y_1), \ldots (\vx_\nlabeled,y_\nlabeled)} $.

Motivation

As we have explained in our comprehensive overview, the predictive model for $K$-nearest neighbors regression is simply the average of target variables of the $K$-nearest observations from the training set. Specifically, if $\mathcal{N}_K(\vx) $ denotes the set of $K$-nearest neighbors of the point $ \vx $, then,

$$ \yhat = \frac{1}{K} \sum_{\vx_\nlabeledsmall \in \mathcal{N}_K(\vx)} y_\nlabeledsmall $$

An alternative neighborhood based approach can be based on the average of target variables of points in a fixed-size neighborhood of the test point.

\begin{equation} f(\vx) = \text{Average}(y_\nlabeledsmall|\vx_\nlabeledsmall \in \text{Neighborhood}(\vx)) \label{eqn:pred-local} \end{equation}

Both these formulations result in predictions that are not smooth over the feature space, due to the discretization imposed by $K$ or the limits of the neighborhood being considered.

Intuition

Instead, a smoother prediction can be achieved by computing the weighted average of the training set examples. Neighborhood-based localization could be ensured by assigning higher weights to training examples that are closer to the test example.

If $ w_\nlabeledsmall $ denotes the weight assigned to the example $ \vx_\nlabeledsmall $, we can arrive at a more refined predictive model.

$$ \yhat = \frac{\sum_{\nlabeledsmall = 1}^{\nlabeled} w_\nlabeledsmall y_\nlabeledsmall}{ \sum_{\nlabeledsmall = 1}^{\nlabeled} w_\nlabeledsmall} $$

In this formulation, each label $ y_\nlabeledsmall $ is weighed by the nearness of the corresponding example $ \vx_\nlabeledsmall $ to the test point $ \vx $. The normalization factor (denominator) is the sum of all the weights assigned.

Intuitively, $ w_\nlabeledsmall $ is a similarity score of the example $ \vx_\nlabeledsmall $ to the test example $ \vx $. Since the weights are pairwise similarity scores, let's denote these as a function of the data points involved

$$ w_\nlabeledsmall = K(\vx,\vx_\nlabeledsmall) $$

This pairwise similarity score function is known as a kernel.

Predictive model

With the definition of kernel as defined above, the predictive model of kernel methods for regression is quite straightforward.

\begin{equation} \yhat = \frac{\sum_{\nlabeledsmall = 1}^{\nlabeled} K(\vx, \vx_\nlabeledsmall) y_\nlabeledsmall}{ \sum_{\nlabeledsmall = 1}^{\nlabeled} K(\vx,\vx_\nlabeledsmall)} $$ \label{eqn:reg-pred} \end{equation}

This formulation of kernel regression is known as the Nadaraya-Watson kernel-weighted average.

Kernels

The crucial building block of a kernel method is the eponymous kernel — the similarity scoring metric.

A typical kernel has the form

$$ K(\vx,\dash{\vx}) = D \left( \frac{|\vx - \dash{\vx}|}{\lambda_{\vx}} \right) $$

In this formulation, $ D $ is the kernel scoring function. We will study several formulations for scoring function $ D $ in the example kernels we investigate in the upcoming sections.

In the equation above, the variable $ \lambda_{\vx} \in \real $ is known as the kernel bandwidth. A larger value of the bandwidth allows for a wider neighborhood — relatively higher weights being assigned to more examples. A lower value constrains the neighborhood by assigning higher weights to only a few nearby examples. The considerations in choosing $ \lambda $ are about the same as those in choosing the number of nearest neighbors in the case of nearest neighbor methods. Some kernels use a fixed value of the bandwidth, while others use a more adaptive value based on $ \vx $, as has been denoted here with $ \lambda_{\vx} $.

Many kernel formulations have been investigated in the context of kernel methods. We will cover the important ones here.

Epanechnikov quadratic kernel

A popular kernel for the Nadaraya Watson approach, this kernel is defined as

$$ K_\lambda(\vx,\dash{\vx}) = D \left( \frac{|\vx - \dash{\vx}|}{\lambda} \right) $$

where, $D $ is defined as

\begin{equation} D(t) = \begin{cases} \frac{3}{4}(1-t^2), &~~~~ \text{if}~~|t| \le 1; \\\\ 0 &~~~~ \text{otherwise} \end{cases} \label{eqn:epanechnikov-kernel} \end{equation}

In the case of the Epanechnikov kernel, the bandwidth parameter $ \lambda $ is a constant.

The Epanechnikov kernel has a compact support. This means, it abruptly goes to zero if the bandwidth-normalized distance between the two points is greater than 1. It more closely resembles a nearest-neighbor like approach because the support of points beyond the kernel boundary is zero.

Tri-cube kernel

The tri-cube kernel defines $ D $ as

\begin{equation} D(t) = \begin{cases} (1-|t|^3)^3, &~~~~ \text{if}~~|t| \le 1; \\\\ 0 &~~~~ \text{otherwise} \end{cases} \label{eqn:tri-cube-kernel} \end{equation}

The tri-cube kernel also uses a fixed bandwidth like the Epanechnikov kernel. Compared to the Epanechnikov kernel, the tri-cube kernel is differentiable in the neighborhood boundary controlled by the bandwidth parameter. But similar to the Epanechnikov kernel, the support of points beyond the neighborhood described by the bandwidth is zero.

Gaussian kernel

A popular smoother choice for the kernel is the Gaussian kernel

$$ K_\lambda(\vx,\dash{\vx}) = \phi_\lambda\left(|\vx - \dash{\vx}|\right) $$

where, $ \phi_\lambda(\cdot) $ is the density function for $ \Gauss(0,\lambda) $ — a Gaussian with zero mean and standard deviation $ \lambda $.

Thus, in the case of Gaussian kernel, the similarity scoring function is

\begin{equation} D(t) = \frac{1}{\sqrt{2\pi\lambda^2}}\expe{-\frac{1}{2} \left(\frac{t}{\lambda}\right)^2} \label{eqn:gaussian-kernel} \end{equation}

Being a smoother alternative, there is no fixed boundary for the Gaussian kernel and all available examples receive some weight in the final prediction. The relative weighting of examples is controlled by changing the bandwidth parameter — the standard deviation of the Gaussian.

Selecting the kernel bandwidth

The bandwidth is a crucial parameter in the case of kernel methods

If the kernel bandwidth is small, then the neighborhood supported by the kernel is small. In other words, the predictions will be averaged over a few training examples and the model will be highly localized. This is similar to choosing a very small value of $ K $ in the case of $ K $-nearest neighbors.
If the kernel bandwidth is large, then the prediction of the model is averaged over numerous examples. This could be troublesome for multimodal regression problems or those with minority classes in the case of classification. The majority class or target values will tend to dominate the overall prediction, just by being larger in population. This will defeat the localization goal of the kernel method.

Thus, it can be observed that there is an inherent bias-variance trade-off when using the kernel bandwidth parameter. And it is better to be selected appropriately for the task and dataset at hand.

Therefore, as with other hyperparameter in machine learning, the kernel bandwidth is typically tuned via cross-validation.

Training

Kernel methods require minimal training as all predictions depend directly on the examples in the neighborhood defined by the kernel. The only hyperparameters that need to be tuned are the kernel type and its bandwidth. As suggested earlier, these are best chosen using cross-validation.

Kernel regression: demo

In this next interaction, understand the nature of the kernel regression model. The blue line indicates the predicted function. The red line indicates the true function that generated the data, with some added noise.

Understand the effect of the size of the bandwidth and kernel type on the prediction function.

Predictive model of kernel regression for chosen bandwidth

Note that increasing the value of $ \lambda $ leads to a smoother predictive model, as it allows for more training data points to affect the function at a point. Conversely, small values of $ \lambda $ lead to high weights for nearby examples and limited impact of distant examples. This effectively results in less smooth predictor that is quite noisy.

You may note that the Epanechnikov and tricube kernel both have limited support regions. If there are no training point within some neigbhorhood, the kernel method fails to present a good prediction. On the other hand, the Gaussian kernel has infinite support and it does not fail even if some region is devoid of training examples.

Dealing with feature types

The example kernels we defined earlier all used a similarity scoring function that calculates distances $ |\vx - \dash{\vx}| $ in the Euclidean space. This is easy for binary and continuous features since both can be treated as real-valued features.

But kernel methods are not limited to real-valued features. Numerous specialized kernels have been designed for categorical and structured observations such as strings, graphs, and trees.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Please donate

Subscribe for article updates

Stay up to date with new material for free.