Introduction to distance metrics and similarity metrics

This learning module has many interactive demos. It is easier to work with them on a larger screen. Bookmark and revisit if you are currently on a small screen device.

\(\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\asterisk}{\ast} \newcommand{\sup}{\text{sup}} \newcommand{\inf}{\text{inf}} \newcommand{\min}{\text{min}\;} \newcommand{\max}{\text{max}\;} \newcommand{\maxunder}[1]{\underset{#1}{\max}} \newcommand{\minunder}[1]{\underset{#1}{\min}} \newcommand{\real}{\mathbb{R}} \newcommand{\natural}{\mathbb{N}} \newcommand{\integer}{\mathbb{Z}} \newcommand{\rational}{\mathbb{Q}} \newcommand{\irrational}{\mathbb{I}} \newcommand{\complex}{\mathbb{C}} \newcommand{\cardinality}[1]{|#1|} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\star}[1]{#1^*} \newcommand{\inv}[1]{#1^{-1}} \newcommand{\indicator}[1]{\mathcal{I}(#1)} \renewcommand{\BigO}[1]{\mathcal{O}(#1)} \renewcommand{\BigOsymbol}{\mathcal{O}} \renewcommand{\smallo}[1]{\mathcal{o}(#1)} \renewcommand{\smallosymbol}[1]{\mathcal{o}} \newcommand{\set}[1]{\mathbb{#1}} \newcommand{\complement}[1]{#1^c} \newcommand{\powerset}[1]{\mathcal{P}(#1)} \newcommand{\setdiff}{\setminus} \newcommand{\setsymmdiff}{\oplus} \newcommand{\dash}[1]{#1^{'}} \newcommand{\permutation}[2]{{}_{#1} \mathrm{ P }_{#2}} \newcommand{\combination}[2]{{}_{#1} \mathrm{ C }_{#2}} \newcommand{\prob}[1]{P(#1)} \newcommand{\pmf}[1]{P(#1)} \newcommand{\pdf}[1]{p(#1)} \newcommand{\cdf}[1]{F(#1)} \newcommand{\expect}[2]{E_{#1}\left[#2\right]} \newcommand{\entropy}[1]{\mathcal{H}\left[#1\right]} \newcommand{\expe}[1]{\mathrm{e}^{#1}} \newcommand{\textexp}[1]{\text{exp}\left(#1\right)} \def\independent{\perp\!\!\!\perp} \def\notindependent{\not\!\independent} \newcommand{\yhat}{\hat{y}} \newcommand{\vs}{\vec{s}} \newcommand{\vt}{\vec{t}} \newcommand{\vu}{\vec{u}} \newcommand{\vv}{\vec{v}} \newcommand{\vw}{\vec{w}} \newcommand{\vx}{\vec{x}} \newcommand{\vy}{\vec{y}} \newcommand{\vz}{\vec{z}} \newcommand{\va}{\vec{a}} \newcommand{\vb}{\vec{b}} \newcommand{\vc}{\vec{c}} \newcommand{\vd}{\vec{d}} \newcommand{\ve}{\vec{e}} \newcommand{\vg}{\vec{g}} \newcommand{\vh}{\vec{h}} \newcommand{\vi}{\vec{i}} \newcommand{\vk}{\vec{k}} \newcommand{\vo}{\vec{o}} \newcommand{\vp}{\vec{p}} \newcommand{\vq}{\vec{q}} \newcommand{\vr}{\vec{r}} \newcommand{\vs}{\vec{s}} \newcommand{\vmu}{\vec{\mu}} \newcommand{\vsigma}{\vec{\sigma}} \newcommand{\vphi}{\vec{\phi}} \newcommand{\vtau}{\vec{\tau}} \newcommand{\vtheta}{\vec{\theta}} \newcommand{\mA}{\mat{A}} \newcommand{\mB}{\mat{B}} \newcommand{\mC}{\mat{C}} \newcommand{\mD}{\mat{D}} \newcommand{\mE}{\mat{E}} \newcommand{\mH}{\mat{H}} \newcommand{\mK}{\mat{K}} \newcommand{\mP}{\mat{P}} \newcommand{\mQ}{\mat{Q}} \newcommand{\mR}{\mat{R}} \newcommand{\mS}{\mat{S}} \newcommand{\mU}{\mat{U}} \newcommand{\mV}{\mat{V}} \newcommand{\mW}{\mat{W}} \newcommand{\mX}{\mat{X}} \newcommand{\mY}{\mat{Y}} \newcommand{\mZ}{\mat{Z}} \newcommand{\mI}{\mat{I}} \newcommand{\mLambda}{\mat{\Lambda}} \newcommand{\mSigma}{\mat{\Sigma}} \newcommand{\mTheta}{\mat{\theta}} \newcommand{\setsymb}[1]{#1} \newcommand{\sA}{\setsymb{A}} \newcommand{\sB}{\setsymb{B}} \newcommand{\sC}{\setsymb{C}} \newcommand{\sO}{\setsymb{O}} \newcommand{\sP}{\setsymb{P}} \newcommand{\sQ}{\setsymb{Q}} \newcommand{\sH}{\setsymb{H}} \newcommand{\sX}{\setsymb{X}} \newcommand{\sY}{\setsymb{Y}} \newcommand{\norm}[2]{||{#1}||_{#2}} \newcommand{\infnorm}[1]{\norm{#1}{\infty}} \newcommand{\fillinblank}{\text{ }\underline{\text{ ? }}\text{ }} \newcommand{\lbrace}{\left\{} \newcommand{\rbrace}{\right\}} \newcommand{\set}[1]{\lbrace #1 \rbrace} \newcommand{\seq}[1]{\left( #1 \right)} \newcommand{\ndim}{N} \newcommand{\ndimsmall}{n} \newcommand{\dataset}{\mathbb{D}} \newcommand{\ndata}{D} \newcommand{\ndatasmall}{d} \newcommand{\labeledset}{\mathbb{L}} \newcommand{\nlabeled}{L} \newcommand{\nlabeledsmall}{l} \newcommand{\unlabeledset}{\mathbb{U}} \newcommand{\nunlabeled}{U} \newcommand{\nunlabeledsmall}{u} \newcommand{\nclass}{M} \newcommand{\nclasssmall}{m} \newcommand{\loss}{\mathcal{L}} \newcommand{\sign}{\text{sign}} \newcommand{\Gauss}{\mathcal{N}} \newcommand{\hadamard}{\circ} \newcommand{\doh}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\dox}[1]{\doh{#1}{x}} \newcommand{\doy}[1]{\doh{#1}{y}} \newcommand{\doxx}[1]{\doh{#1}{x^2}} \newcommand{\doyy}[1]{\doh{#1}{y^2}} \newcommand{\doxy}[1]{\frac{\partial #1}{\partial x \partial y}} \newcommand{\doyx}[1]{\frac{\partial #1}{\partial y \partial x}} \newcommand{\qed}{\tag*{$\blacksquare$}}\)

        Distance and similarity metrics
        Machine Learning
      

Introduction

Central to the foundations of machine learning is the concept of distance, or its opposite, similarity. Supervised machine learning, such as approaches to classification, relies on the assumption that there is some similarity among examples of the same class, or conversely, distant examples belong to separate classes. Similarly, unsupervised machine learning assumes that there are groups of similar examples and that such groups are distant from each other. This notion manifests explicitly in distance-based models such as nearest neighbors, support vector machines and K-means clustering.

In this article, we will describe some popular ways of compute similarity or distance among examples, in the context of machine learning.

Prerequisites

To understand distance metrics introduced in this article, we recommend familiarity with the concepts in

Introduction to machine learning: An introduction to basic concepts in machine learning such as classification, training instances, features, and feature types.

Follow the above links to first get acquainted with the corresponding concepts.

Problem setting

In machine learning, multivariate instances are represented vectors consisting of attributes or features. In this article, we will consider $ \ndim$-dimensional real-valued feature vectors that we will denote with bold-faced lower case letters. For example, $ \vx \in \real^\ndim $, where $ \vx $ is a shorthand for the vector $ \vx = [x_1, \ldots, x_\ndim] $.

Euclidean distance

Quite possibly the most popular distance in machine learning is the Euclidean distance, also casually known as the L2 distance due to the usage of $ L_2 $-norm in the formula. It is the straight-line distance between two points in the Euclidean space.

For real-valued vectors $ \vx \in \real^\ndim $ and $ \vy \in \real^\ndim $, it is computed as their $ L_2 $-norm

$$ d_{\text{Euclidean}}(\vx, \vy) = \norm{\vx - \vy}{2} $$

where, the $ L_2 $ norm is defined as

$$ \norm{\vx - \vy}{2} = \sqrt{\sum_{\ndimsmall=1}^{\ndim} \left(x_\ndimsmall - y_\ndimsmall\right)^2} $$

It can be shown that the Euclidean distance is equivalent to

$$ \norm{\vx - \vy}{2} = \sqrt{(\vx - \vy) \cdot (\vx - \vy)} = \sqrt{\norm{\vx}{2}^2 + \norm{\vy}{2}^2 - 2\vx \cdot \vy} $$

where, $ \cdot $ denotes a dot product of the operands.

It should be noted that the Euclidean distance lies in is unbounded and may take any positive real-value in $ [0,\infty] $.

Minkowski distance

Euclidean distance is $ L_2 $-norm of the difference of vectors. The generalization of this distance to the $L_p$-norm is known as the Minkowski distance.

For real-valued vectors $ \vx \in \real^\ndim $ and $ \vy \in \real^\ndim $, their Minkowski distance is computed as

$$ d_{\text{Minkowski}}(\vx, \vy) = \norm{\vx - \vy}{p} $$

where, the $ L_p $ norm is defined as

$$ \norm{\vx - \vy}{p} = \left[ \sum_{\ndimsmall=1}^{\ndim} \left|x_\ndimsmall - y_\ndimsmall\right|^p \right]^{\frac{1}{p}} $$

For $p=2$, we have the Euclidean distance.

Manhattan distance

A special case of Minkowski distance when $ p = 1 $ is known as the Manhattan distance, computed as the $L_1$ norm of the difference of two vectors.

For real-valued vectors $ \vx \in \real^\ndim $ and $ \vy \in \real^\ndim $, their Manhattan distance is computed as

$$ d_{\text{Manhattan}}(\vx, \vy) = \norm{\vx - \vy}{1} $$

where, the $ L_p $ norm is defined as

$$ \norm{\vx - \vy}{p} = \sum_{\ndimsmall=1}^{\ndim} \left|x_\ndimsmall - y_\ndimsmall\right|^p $$

It should be noted that the distance is the sum of absolute difference along each dimension. This distance is akin to walking from point $ \vx $ to point $ \vy $ in the Euclidean space so that we start the walk along the $ 1$-st dimension, then along the $ 2$-nd dimension, and so on. This situation is analogous to walking between two points in a metropolitan area such as Manhattan by walking alongside city blocks. Hence the name, Manhattan distance. For the same reason, the $ L_1 $ distance is also known by alternative names such as city block distance, taxicab metric, or rectilinear distance.

Chebyshev distance

The Minkowski distance with $ p = \infty $ is known as the Chebyshev distance. It is computed as the $L_\infty$-norm of the difference of two vectors.

For real-valued vectors $ \vx \in \real^\ndim $ and $ \vy \in \real^\ndim $, their Chebyshev distance is computed as

$$ d_{\text{Chebyshev}}(\vx, \vy) = \norm{\vx - \vy}{\infty} $$

where, the $ L_\infty $ norm is defined as

$$ \norm{\vx - \vy}{\infty} = \underset{\ndimsmall=1,\ldots,\ndim}{\max} \left|x_\ndimsmall - y_\ndimsmall\right| $$

While Euclidean and Manhattan distance are an aggregate over differences along all dimensions, the Chebyshev distance is the difference along the dimension that is most different among the two vectors.

Weighted Minkowski distance

The Minkowski distance gives equal weighting to each dimension. Some tasks may require specialized weighting per dimension, for example if some dimensions are more important than others. The generalization of Minkowski distance that enables such per-dimension weighting is known as weighted Minkowski distance.

For real-valued vectors $ \vx \in \real^\ndim $ and $ \vy \in \real^\ndim $ and weights $ \vw \in \real^\ndim $, the weighted Minkowski distance is computed as

$$ d_{\text{weightedMinkowski}}(\vx, \vy, \vw) = \left[ \sum_{\ndimsmall=1}^{\ndim} w_\ndimsmall \left|x_\ndimsmall - y_\ndimsmall\right|^p \right]^{\frac{1}{p}} $$

Cosine similarity

Cosine similarity is particularly popular in text-mining approaches that utilize the bag-of-words strategy for modeling text, for example to build text classifiers or for information retrieval.

The cosine similarity of two vectors is computed as the cosine of the angle of the angle between the two vectors. For real-valued vectors $ \vx \in \real^\ndim $ and $ \vy \in \real^\ndim $, it is computed as

$$ \text{cosim}(\vx, \vy) = \cos \theta $$

where, $ \theta $ is the angle between the vectors $ \vx $ and $ \vy $.

\begin{equation} \text{cosim}(\vx, \vy) = \frac{\vx \cdot \vy}{\norm{\vx}{2} \norm{\vy}{2}} \label{eqn:cosine-similarity} \end{equation}

Being a normalized score, a cosine of the angle between two vectors, the cosine similarity is bounded in the region $ [0,1] $.

Relationship: Cosine similiarty and Euclidean distance

Euclidean distance can be written in terms of cosine similarity through these simple transformations.

\begin{aligned} d_{\text{Euclidean}}(\vx, \vy) &= \sqrt{\norm{\vx - \vy}{2}} \\\\ &= \sqrt{\norm{\vx}{2}^2 + \norm{\vy}{2}^2 - 2\vx \cdot \vy} \\\\ &= \sqrt{\norm{\vx}{2}^2 + \norm{\vy}{2}^2 - 2\norm{\vx}{2}\norm{\vy}{2} \cos \theta} \\\\ \end{aligned}

If $ \vx $ and $ \vy $ are unit vectors, then

$$ d_{\text{Euclidean}}(\vx, \vy) = \sqrt{2(1 - \cos \theta)} $$

This means, for unit vectors, cosine similarity is akin to the Euclidean distance. For unit vectors, high cosine similarity will imply low Euclidean distance and vice versa. When magnitudes of vectors should not matter, we should use the magnitude-invariant metric — the cosine similarity metric. When they do, we should use the magnitude-respecting Euclidean distance.

Mahalanobis distance

Euclidean distance gives equal weighting to every dimension. In some settings, it might be useful to provide preferential weighting to certain dimensions.

The Mahalanobis distance weighs each dimension by the inverse of the variance in the data along that dimension. This leads to a scale-invariant distance because dimensions with higher variance (stretch or scaling) have lowered contributions to the overall distance.

For real-valued vectors $ \vx \in \real^\ndim $ and $ \vy \in \real^\ndim $, their Mahalanobis distance is computed as

$$ d_{\text{Mahalanobis}}(\vx, \vy, \mSigma) = \sqrt{(\vx - \vy)\mSigma^{-1}(\vx - \vy)} $$

where, $ \mSigma $ is the covariance of the data containing these vectors.

In the case of a diagonal covariance matrix, we ignore interactions among dimensions, $ \mSigma = \vsigma^2 \mI $. Here, the vector $ \vsigma^2 = [\sigma_1^2, \ldots, \sigma_\ndim^2] $,

\begin{aligned} d_{\text{Mahalanobis}}(\vx, \vy, \vsigma^2 \mI) &= \sqrt{(\vx - \vy)\mSigma^{-1}(\vx - \vy)} \\\\ &= \sqrt{\sum_{\ndimsmall=1}^{\ndim} \frac{\left(x_\ndimsmall - y_\ndimsmall\right)^2}{\sigma_\ndimsmall^2}} \end{aligned}

where, $ \sigma_\ndimsmall $ is the variance of the sample set over the $ \ndimsmall $-th dimension.

Relationship: Mahalanobis distance and Euclidean distance

Just like cosine similarity is a magnitude-invariant version of Euclidean distance, the Mahalanobis is a scale-invariant version of Euclidean distance.

For real-valued vectors $ \vx \in \real^\ndim $ and $ \vy \in \real^\ndim $, consider first standardizing the data. This is typically done by subtracting the mean $ \mu_\ndimsmall $ for each dimension $ \ndimsmall $ and then dividing by the corresponding standard deviation $ \sigma_\ndimsmall $. Standardization results in the updated vectors $ \dash{\vx} $ and $ \dash{\vy} $ as

$$ \dash{\vx} = [\frac{x_1 - \mu_1}{\sigma_1}, \ldots, \frac{x_\ndim - \mu_\ndim}{\sigma_\ndim}] $$ $$ \dash{\vy} = [\frac{y_1 - \mu_1}{\sigma_1}, \ldots, \frac{y_\ndim - \mu_\ndim}{\sigma_\ndim}] $$

The Euclidean distance of these normalized vectors is

\begin{aligned} d_{\text{Euclidean}}(\dash{\vx}, \dash{\vy}) &= \sqrt{\norm{\dash{\vx} - \dash{\vy}}{2}} \\\\ &= \sqrt{\sum_{\ndimsmall=1}^{\ndim} \left(\dash{x}_\ndimsmall - \dash{y}_\ndimsmall\right)^2} \\\\ &= \sqrt{\sum_{\ndimsmall=1}^{\ndim} \left(\frac{x_\ndimsmall - \mu_\ndimsmall}{\sigma} - \frac{y_\ndimsmall - \mu_\ndimsmall}{\sigma}\right)^2} \\\\ &= \sqrt{\sum_{\ndimsmall=1}^{\ndim} \left(\frac{x_\ndimsmall - \mu_\ndimsmall - y_\ndimsmall + \mu_\ndimsmall}{\sigma}\right)^2} \\\\ &= \sqrt{\sum_{\ndimsmall=1}^{\ndim} \frac{(x_\ndimsmall - y_\ndimsmall)^2}{\sigma^2}} \\\\ &= d_{\text{Mahalanobis}}(\vx, \vy, \vsigma^2 \mI) \end{aligned}

Thus, the Mahalanobis distance with a diagonal covariance matrix $ \vsigma^2 \mI $ is effectively Euclidean distance between standardized vectors — vectors that have been transformed by subtracting their mean and divided by their standard deviation. Therefore, Mahalanobis distance with a diagonal covariance matrix is also known as standardized Euclidean distance.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Please donate

Let's connect

Please share your comments, questions, encouragement, and feedback.