Introduction to nearest neighbor classifier

This learning module has many interactive demos. It is easier to work with them on a larger screen. Bookmark and revisit if you are currently on a small screen device.

\(\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\asterisk}{\ast} \newcommand{\sup}{\text{sup}} \newcommand{\inf}{\text{inf}} \newcommand{\min}{\text{min}\;} \newcommand{\max}{\text{max}\;} \newcommand{\maxunder}[1]{\underset{#1}{\max}} \newcommand{\minunder}[1]{\underset{#1}{\min}} \newcommand{\real}{\mathbb{R}} \newcommand{\natural}{\mathbb{N}} \newcommand{\integer}{\mathbb{Z}} \newcommand{\rational}{\mathbb{Q}} \newcommand{\irrational}{\mathbb{I}} \newcommand{\complex}{\mathbb{C}} \newcommand{\cardinality}[1]{|#1|} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\star}[1]{#1^*} \newcommand{\inv}[1]{#1^{-1}} \newcommand{\indicator}[1]{\mathcal{I}(#1)} \renewcommand{\BigO}[1]{\mathcal{O}(#1)} \renewcommand{\BigOsymbol}{\mathcal{O}} \renewcommand{\smallo}[1]{\mathcal{o}(#1)} \renewcommand{\smallosymbol}[1]{\mathcal{o}} \newcommand{\set}[1]{\mathbb{#1}} \newcommand{\complement}[1]{#1^c} \newcommand{\powerset}[1]{\mathcal{P}(#1)} \newcommand{\setdiff}{\setminus} \newcommand{\setsymmdiff}{\oplus} \newcommand{\dash}[1]{#1^{'}} \newcommand{\permutation}[2]{{}_{#1} \mathrm{ P }_{#2}} \newcommand{\combination}[2]{{}_{#1} \mathrm{ C }_{#2}} \newcommand{\prob}[1]{P(#1)} \newcommand{\pmf}[1]{P(#1)} \newcommand{\pdf}[1]{p(#1)} \newcommand{\cdf}[1]{F(#1)} \newcommand{\expect}[2]{E_{#1}\left[#2\right]} \newcommand{\entropy}[1]{\mathcal{H}\left[#1\right]} \newcommand{\expe}[1]{\mathrm{e}^{#1}} \newcommand{\textexp}[1]{\text{exp}\left(#1\right)} \def\independent{\perp\!\!\!\perp} \def\notindependent{\not\!\independent} \newcommand{\yhat}{\hat{y}} \newcommand{\vs}{\vec{s}} \newcommand{\vt}{\vec{t}} \newcommand{\vu}{\vec{u}} \newcommand{\vv}{\vec{v}} \newcommand{\vw}{\vec{w}} \newcommand{\vx}{\vec{x}} \newcommand{\vy}{\vec{y}} \newcommand{\vz}{\vec{z}} \newcommand{\va}{\vec{a}} \newcommand{\vb}{\vec{b}} \newcommand{\vc}{\vec{c}} \newcommand{\vd}{\vec{d}} \newcommand{\ve}{\vec{e}} \newcommand{\vg}{\vec{g}} \newcommand{\vh}{\vec{h}} \newcommand{\vi}{\vec{i}} \newcommand{\vk}{\vec{k}} \newcommand{\vo}{\vec{o}} \newcommand{\vp}{\vec{p}} \newcommand{\vq}{\vec{q}} \newcommand{\vr}{\vec{r}} \newcommand{\vs}{\vec{s}} \newcommand{\vmu}{\vec{\mu}} \newcommand{\vsigma}{\vec{\sigma}} \newcommand{\vphi}{\vec{\phi}} \newcommand{\vtau}{\vec{\tau}} \newcommand{\vtheta}{\vec{\theta}} \newcommand{\mA}{\mat{A}} \newcommand{\mB}{\mat{B}} \newcommand{\mC}{\mat{C}} \newcommand{\mD}{\mat{D}} \newcommand{\mE}{\mat{E}} \newcommand{\mH}{\mat{H}} \newcommand{\mK}{\mat{K}} \newcommand{\mP}{\mat{P}} \newcommand{\mQ}{\mat{Q}} \newcommand{\mR}{\mat{R}} \newcommand{\mS}{\mat{S}} \newcommand{\mU}{\mat{U}} \newcommand{\mV}{\mat{V}} \newcommand{\mW}{\mat{W}} \newcommand{\mX}{\mat{X}} \newcommand{\mY}{\mat{Y}} \newcommand{\mZ}{\mat{Z}} \newcommand{\mI}{\mat{I}} \newcommand{\mLambda}{\mat{\Lambda}} \newcommand{\mSigma}{\mat{\Sigma}} \newcommand{\mTheta}{\mat{\theta}} \newcommand{\setsymb}[1]{#1} \newcommand{\sA}{\setsymb{A}} \newcommand{\sB}{\setsymb{B}} \newcommand{\sC}{\setsymb{C}} \newcommand{\sO}{\setsymb{O}} \newcommand{\sP}{\setsymb{P}} \newcommand{\sQ}{\setsymb{Q}} \newcommand{\sH}{\setsymb{H}} \newcommand{\sX}{\setsymb{X}} \newcommand{\sY}{\setsymb{Y}} \newcommand{\norm}[2]{||{#1}||_{#2}} \newcommand{\infnorm}[1]{\norm{#1}{\infty}} \newcommand{\fillinblank}{\text{ }\underline{\text{ ? }}\text{ }} \newcommand{\lbrace}{\left\{} \newcommand{\rbrace}{\right\}} \newcommand{\set}[1]{\lbrace #1 \rbrace} \newcommand{\seq}[1]{\left( #1 \right)} \newcommand{\ndim}{N} \newcommand{\ndimsmall}{n} \newcommand{\dataset}{\mathbb{D}} \newcommand{\ndata}{D} \newcommand{\ndatasmall}{d} \newcommand{\labeledset}{\mathbb{L}} \newcommand{\nlabeled}{L} \newcommand{\nlabeledsmall}{l} \newcommand{\unlabeledset}{\mathbb{U}} \newcommand{\nunlabeled}{U} \newcommand{\nunlabeledsmall}{u} \newcommand{\nclass}{M} \newcommand{\nclasssmall}{m} \newcommand{\loss}{\mathcal{L}} \newcommand{\sign}{\text{sign}} \newcommand{\Gauss}{\mathcal{N}} \newcommand{\hadamard}{\circ} \newcommand{\doh}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\dox}[1]{\doh{#1}{x}} \newcommand{\doy}[1]{\doh{#1}{y}} \newcommand{\doxx}[1]{\doh{#1}{x^2}} \newcommand{\doyy}[1]{\doh{#1}{y^2}} \newcommand{\doxy}[1]{\frac{\partial #1}{\partial x \partial y}} \newcommand{\doyx}[1]{\frac{\partial #1}{\partial y \partial x}} \newcommand{\qed}{\tag*{$\blacksquare$}}\)

        Nearest neighbor classifier
        Machine Learning
      

Introduction

All supervised classification models in machine learning work on a primary assumption — examples belonging to the same class must be similar. In fact, in the training phase, a classifier learns the most dominant similarities among examples of the same class, so that new examples can be checked for such similarities. The nearest neighbor classifier directly works off of this assumption. Given any unlabeled example, find its closest neighbors in the feature space and assign the majority label. Although simple, the nearest neighbor classifier is quite a strong classifier, albeit with some severe practical challenges.

Prerequisites

To understand the nearest neighbor model, we recommend familiarity with the concepts in

Introduction to machine learning: An introduction to basic concepts in machine learning such as classification, training instances, features, and feature types.
Distance and similarity metrics: A comprehensive overview of distance metrics.

Follow the above links to first get acquainted with the corresponding concepts.

Problem setting

In classification, the goal of the predictive model is to identify the class that a particular given instance belongs to.

Consider such an instance $ \vx \in \real^N $, a vector consisting of $ N $ features, $\vx = [x_1, x_2, \ldots, x_N] $.

We need to assign it to one of the $ M $ classes $ C_1, C_2, \ldots, C_M $ depending on the values of the $ N $ features .

Predictive model

As we described earlier, the nearest neighbor classifies an unlabeled example in two steps:

Sort labeled examples from the training set based on their nearness to the given unlabeled example.
Identify the majority label among top $ K $ nearest neighbors. This is the prediction.

This means, the unlabeled instance is assigned to the class that has the most representative examples amongst those that are closest to the given unlabeled example.

Training nearest-neighbor classifier

Naturally, the questions to ask is

How do we quantify nearness?
How many neighbors to consider for the prediction? In other words, what is the value of $ K $?

Barring answering these important questions, the training phase is almost non-existent for the nearest-neighbor classifier because the nearest neighbors are identified when the unlabeled example is presented to the model. There is no learning as such, except the identification of hyperparameters and metrics that address the questions posed above.

Nearness

Nearness is quantified by calculating a distance or similarity metric such as the Euclidean distance, Mahalanobis distance, or the cosine similarity metric. As an example, the Euclidean distance between two examples $ \vx $ and $ \vx' $ is calculated as

$$ d_{ij} = ||\vx - \vx'||_2 = \sqrt{\sum_{n=1}^N \left(x_n - {x'}_n \right)^2} $$

This distance is computed to all the examples in the training set $ \set{(\vx_1, y_1) (\vx_2, y_2), \ldots, (\vx_L, y_L)} $. The distances are then sorted in their ascending order and the first $ K $ examples are chosen. (If using a similarity metric, for nearness, the similarity scores are sorted in the descending order and the first $ K $ examples are chosen).

Choosing $ K $

The value of $ K $ plays a significant role in the performance of the nearest neighbor classifier.

When $ K $ equals 1, the prediction relies on only one neighbor and is very localized. Such a model will have a very high variance. High variance, because the prediction could be different if the nearest training example that lead to the current prediction was missing from the training set. Just changing few examples in the training set could lead to changed predictions for many unseen examples.

When the value of $ K $ is set very high, it relies on a significant fraction of the population, thereby rendering is less local. This may provide for significant support for the prediction of the unlabeled instance, but may create problems in sub-regions that have very few examples of the correct class, especially near class boundaries. It may also be a problem when dealing with imbalanced classes or rare classes, with very few training examples.

Thus, it is important to arrive at a value of $ K $ that is just right. As with all hyperparameter tuning in machine learning, we use cross-validation to arrive at a suitable value for $ K $. We try several candidate values for $ K $ and choose the one one with the highest cross-validation accuracy.

On that note, one interesting thing about the $ K=1 $ nearest neighbor classifier is that given infinite training data, their error rate is never more than twice the minimum achievable error rate of an optimal classifier. This means that if $ N \to \infty $, the error rate of the $ K=1 $ classifier is always less than or equal to that of a classifier that uses true class distributions for prediction [Cover and Hart, 1967].

K-nearest neighbor classifier demo

Let us understand the predictive model of $K$-nearest neighbor classifier on some data. The training phase is non-existent. Choose a value of $ K $ by adjusting the slider and then view the regions of the feature space assigned to each class. You may also adjust the number of categories for the classification problem.

KNN classifier

Note that the KNN-boundaries are nonlinear. As you reduce the value of $ K $, the boundaries get rough and as you increase the value, the boundaries get smoother. This happens because larger values of $ K $ ensure more support for point and that does not change as rapidly as boundaries that depend on fewer points.

Nearest neighbors for regression

In regression, the goal of the predictive model is to predict a continuous valued output for a given multivariate instance.

Consider such an instance $ \vx \in \real^N $, a vector consisting of $ N $ features, $\vx = [x_1, x_2, \ldots, x_N] $.

We need to predict a real-valued output $ \hat{y} \in \real $ that is as close as possible to the true target $ y \in \real $. The hat $ \hat{ } $ denotes that $ \hat{y} $ is an estimate, to distinguish it from the truth.

The extension of nearest neighbor model to the regression problem is straightforward.

Instead of identifying the majority label among nearest neighbors, we choose the mean of the target variable of the nearest neighbors.

Thus, the predictive model for nearest neighbor regression is

$$ \yhat = \frac{1}{K} \sum_{\vx_i \in \mathcal{N}_K(\vx)} y_i $$

where, $\mathcal{N}_k(\vx) $ is the set of $ K $ nearest neighbors of the unlabeled example $ \vx $.

Dealing with feature types

Note that the predictive model involves calculation of a distance or similarity metric. For the most popularly used metrics, such as Euclidean distance or cosine similarity, this is easy for binary and continuous features since both can be treated as real-valued features.

In the case of categorical features a direct metric score calculation is not possible. Therefore, we need to first preprocess the categorical variables using one-hot encoding to arrive at a binary feature representation.

Practical challenges

Despite being an easy to understand approach, the nearest neighbor classifier is quite challenging in practical applications.

The training phase is nonexistent. This may seem like a good thing, but it is not. That's because all the training examples need to be available to the model for identifying the nearest examples. Depending on the training set size, this may rule out several deployment scenarios, such as those in embedded systems or those where training data is proprietary.
Discovering the nearest examples during prediction time means the computational platform for the predictive model needs to powerful enough to quickly compute distances to all training examples to arrive at each prediction. Compare this to parametric classification approaches, where the predictive model is a simple calculation involving the unlabeled instance and the learnt parameters of the model.
In high-dimensional feature spaces, the nearest neighbors may actually be quite far from the unlabeled example, far enough that they should not be considered neighbors at all. In other words, the curse of dimensionality is a severe challenge for the nearest neighbor classifier due to the infeasibility of discovering substantially similar neighbors in high-dimensional spaces.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Please donate

Let's connect

Please share your comments, questions, encouragement, and feedback.

Nearest neighbor classifier

Machine Learning