Introduction to loss functions for classification

This learning module has many interactive demos. It is easier to work with them on a larger screen. Bookmark and revisit if you are currently on a small screen device.

\(\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\asterisk}{\ast} \newcommand{\sup}{\text{sup}} \newcommand{\inf}{\text{inf}} \newcommand{\min}{\text{min}\;} \newcommand{\max}{\text{max}\;} \newcommand{\maxunder}[1]{\underset{#1}{\max}} \newcommand{\minunder}[1]{\underset{#1}{\min}} \newcommand{\real}{\mathbb{R}} \newcommand{\natural}{\mathbb{N}} \newcommand{\integer}{\mathbb{Z}} \newcommand{\rational}{\mathbb{Q}} \newcommand{\irrational}{\mathbb{I}} \newcommand{\complex}{\mathbb{C}} \newcommand{\cardinality}[1]{|#1|} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\star}[1]{#1^*} \newcommand{\inv}[1]{#1^{-1}} \newcommand{\indicator}[1]{\mathcal{I}(#1)} \renewcommand{\BigO}[1]{\mathcal{O}(#1)} \renewcommand{\BigOsymbol}{\mathcal{O}} \renewcommand{\smallo}[1]{\mathcal{o}(#1)} \renewcommand{\smallosymbol}[1]{\mathcal{o}} \newcommand{\set}[1]{\mathbb{#1}} \newcommand{\complement}[1]{#1^c} \newcommand{\powerset}[1]{\mathcal{P}(#1)} \newcommand{\setdiff}{\setminus} \newcommand{\setsymmdiff}{\oplus} \newcommand{\dash}[1]{#1^{'}} \newcommand{\permutation}[2]{{}_{#1} \mathrm{ P }_{#2}} \newcommand{\combination}[2]{{}_{#1} \mathrm{ C }_{#2}} \newcommand{\prob}[1]{P(#1)} \newcommand{\pmf}[1]{P(#1)} \newcommand{\pdf}[1]{p(#1)} \newcommand{\cdf}[1]{F(#1)} \newcommand{\expect}[2]{E_{#1}\left[#2\right]} \newcommand{\entropy}[1]{\mathcal{H}\left[#1\right]} \newcommand{\expe}[1]{\mathrm{e}^{#1}} \newcommand{\textexp}[1]{\text{exp}\left(#1\right)} \def\independent{\perp\!\!\!\perp} \def\notindependent{\not\!\independent} \newcommand{\yhat}{\hat{y}} \newcommand{\vs}{\vec{s}} \newcommand{\vt}{\vec{t}} \newcommand{\vu}{\vec{u}} \newcommand{\vv}{\vec{v}} \newcommand{\vw}{\vec{w}} \newcommand{\vx}{\vec{x}} \newcommand{\vy}{\vec{y}} \newcommand{\vz}{\vec{z}} \newcommand{\va}{\vec{a}} \newcommand{\vb}{\vec{b}} \newcommand{\vc}{\vec{c}} \newcommand{\vd}{\vec{d}} \newcommand{\ve}{\vec{e}} \newcommand{\vg}{\vec{g}} \newcommand{\vh}{\vec{h}} \newcommand{\vi}{\vec{i}} \newcommand{\vk}{\vec{k}} \newcommand{\vo}{\vec{o}} \newcommand{\vp}{\vec{p}} \newcommand{\vq}{\vec{q}} \newcommand{\vr}{\vec{r}} \newcommand{\vs}{\vec{s}} \newcommand{\vmu}{\vec{\mu}} \newcommand{\vsigma}{\vec{\sigma}} \newcommand{\vphi}{\vec{\phi}} \newcommand{\vtau}{\vec{\tau}} \newcommand{\vtheta}{\vec{\theta}} \newcommand{\mA}{\mat{A}} \newcommand{\mB}{\mat{B}} \newcommand{\mC}{\mat{C}} \newcommand{\mD}{\mat{D}} \newcommand{\mE}{\mat{E}} \newcommand{\mH}{\mat{H}} \newcommand{\mK}{\mat{K}} \newcommand{\mP}{\mat{P}} \newcommand{\mQ}{\mat{Q}} \newcommand{\mR}{\mat{R}} \newcommand{\mS}{\mat{S}} \newcommand{\mU}{\mat{U}} \newcommand{\mV}{\mat{V}} \newcommand{\mW}{\mat{W}} \newcommand{\mX}{\mat{X}} \newcommand{\mY}{\mat{Y}} \newcommand{\mZ}{\mat{Z}} \newcommand{\mI}{\mat{I}} \newcommand{\mLambda}{\mat{\Lambda}} \newcommand{\mSigma}{\mat{\Sigma}} \newcommand{\mTheta}{\mat{\theta}} \newcommand{\setsymb}[1]{#1} \newcommand{\sA}{\setsymb{A}} \newcommand{\sB}{\setsymb{B}} \newcommand{\sC}{\setsymb{C}} \newcommand{\sO}{\setsymb{O}} \newcommand{\sP}{\setsymb{P}} \newcommand{\sQ}{\setsymb{Q}} \newcommand{\sH}{\setsymb{H}} \newcommand{\sX}{\setsymb{X}} \newcommand{\sY}{\setsymb{Y}} \newcommand{\norm}[2]{||{#1}||_{#2}} \newcommand{\infnorm}[1]{\norm{#1}{\infty}} \newcommand{\fillinblank}{\text{ }\underline{\text{ ? }}\text{ }} \newcommand{\lbrace}{\left\{} \newcommand{\rbrace}{\right\}} \newcommand{\set}[1]{\lbrace #1 \rbrace} \newcommand{\seq}[1]{\left( #1 \right)} \newcommand{\ndim}{N} \newcommand{\ndimsmall}{n} \newcommand{\dataset}{\mathbb{D}} \newcommand{\ndata}{D} \newcommand{\ndatasmall}{d} \newcommand{\labeledset}{\mathbb{L}} \newcommand{\nlabeled}{L} \newcommand{\nlabeledsmall}{l} \newcommand{\unlabeledset}{\mathbb{U}} \newcommand{\nunlabeled}{U} \newcommand{\nunlabeledsmall}{u} \newcommand{\nclass}{M} \newcommand{\nclasssmall}{m} \newcommand{\loss}{\mathcal{L}} \newcommand{\sign}{\text{sign}} \newcommand{\Gauss}{\mathcal{N}} \newcommand{\hadamard}{\circ} \newcommand{\doh}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\dox}[1]{\doh{#1}{x}} \newcommand{\doy}[1]{\doh{#1}{y}} \newcommand{\doxx}[1]{\doh{#1}{x^2}} \newcommand{\doyy}[1]{\doh{#1}{y^2}} \newcommand{\doxy}[1]{\frac{\partial #1}{\partial x \partial y}} \newcommand{\doyx}[1]{\frac{\partial #1}{\partial y \partial x}} \newcommand{\qed}{\tag*{$\blacksquare$}}\)

        Loss functions for training classification models
        Machine Learning
      

Introduction

Here's a common recipe for training classifiers: acquire a labeled data set (the training set), define a loss function, and then, adapt the classifier parameters to minimize the loss function on the training set. There, you have a trained classifier. In this article, we will focus on loss functions for learning classification models.

Prerequisites

To understand the various loss functions for classifications, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

Problem setting

In classification, the goal of the predictive model is to assign a given instance to the appropriate class. In this analysis, we will assume that the instances have real-valued features. Loss functions typically work on the space of target variables, so the input feature space does not matter much for this presentation, anyways.

Consider such an instance $ \vx \in \real^\ndim $, a vector consisting of $ N $ features, $\vx = [x_1, x_2, \ldots, x_\ndim] $. In classification scenarios, each instance belongs to one of the $ \nclass $ classes $ C_1, C_2, \ldots, C_\nclass $.

The classifier model is inferred over a collection of labeled observations provided as tuples $ (\vx_i,y_i) $ containing the instance vector $ \vx_i $ and the actual class label, the target variable, $ y_i $. This collection of labeled observations is known as the training set. We will denote the labeled training set as $ \labeledset = \set{(\vx_1,y_1), \ldots (\vx_\nlabeled,y_\nlabeled)} $.

To distinguish between actual target values and predicted outputs, we will denote the actuals with $ y $ and the predicted as $ \yhat $, where the $ \hat{} $ on $ y $ indicates that it is an estimate of the actual $ y $.

The function learned by the model is $ f: \real^\ndim \to \real $. It calculates the class score for an instance and based on this score the class label $ \yhat $ is predicted. For example, in binary classification, if the score $ f(\vx) $ is positive, then $ \yhat = 1 $, else it is zero. As another example, in multiclass classification, each class may receive a score, and the instance is assigned to the class with the highest score, $ \yhat = \argmax_{C_\nclasssmall \in \set{C_1,\ldots,C_\nclass}} f_{C_\nclasssmall}(\vx) $.

Characteristics of a good loss function

As we described earlier, a classifier is trained by adapting its parameters to minimize the value of the chosen loss function on the training set.

A loss function is a function over actual targets and the corresponding predicted outputs in the training set. In general, it is an aggregate over pairwise comparisons of predicted output values and the actual desired values.

To be a good loss function, there are some desirable characteristics. A good loss function should closely imitate the desirable metrics that the classifier will be evaluated on. Refer our comprehensive overview of evaluation metrics for classification. For example, if the end goal is evaluating the model by precision or recall, then it is futile to train the model by minimizing accuracy.

The $0-1$ loss

The simplest loss function is the $ 0-1 $ loss, read as zero one loss. It is computed as

\begin{aligned} \loss_{01 \text{loss}}(y, \yhat) = \indicator{y \ne \yhat} \end{aligned}

where, $ \indicator{a} $ is the indicator function that takes on the value 1 if $ a $ is true, and 0 otherwise. In this formula, the value will be $ 1 $ when the prediction is incorrect, $ y \ne \yhat $, and zero otherwise.

Calculating over the entire training set is easy: just sum over the loss of each pair.

\begin{aligned} \loss_{01 \text{loss}} = \sum_{\nlabeledsmall=1}^{\nlabeled} \indicator{y_\nlabeledsmall \ne \yhat_\nlabeledsmall} \end{aligned}

Minimizing the zero-one loss is equivalent to minimizing the number of incorrect predictions. In other words, improving the accuracy of the classifier.

Although simple and intuitive, there is one big challenge with using the 0-1 loss — it is not differentiable, due to the indicator function. This severely limits its usage in gradient-based optimization strategies for training. Moreover, it is either zero or one. This means, all incorrect predictions receive the same score of $ 1 $. Ideally, the more incorrect a prediction is, the higher should be its loss.

Hinge loss

The hinge loss function is calculated on the score $ f(\vx) $ of the class, as opposed to the final prediction $ \yhat $. It attempts to rectify zero-one loss for the case of binary classification by defining loss that is directly proportional to the degree of score for incorrect predictions. It is computed as

\begin{aligned} \loss_{\text{hinge}}(y, f(\vx)) = \max(0,1 - yf(\vx)) = [1-yf(\vx)]_{+} \end{aligned}

Simply put, the hinge loss is equal to the positive value of $ [1 - yf(\vx)] $. Hence the notation, $ [1 - yf(\vx)]_{+} $.

This formulation is applicable when the actual class labels are represented as positive and negative class (and not 0 or 1 class). This means, $ y \in \set{-1,1} $. It is the loss function used in support vector machines

Negative log-likelihood

Both the hinge loss and the $0-1$ loss are non-differentiable at specific points. A continuously differentiable alternative is the negative log-likelihood (NLL) loss function. A good model should maximize the log-likelihood of the data. Conversely, the negative of the log-likelihood should be low for a good model.

Consider a binary classification problem with the classes $ C_1 $ and $ C_2 $. Let $ y_i = 1 $ if $ \vx_i $ belongs to the class $ C_1 $, and zero if it belongs to the class $ C_2 $. If $ \mTheta $ denotes the parameters of the model, then, the likelihood of the training data $ \labeledset $ given these parameters is

$$ P(\labeledset|\mTheta) = \prod_{\nlabeledsmall=1}^\nlabeled P(\yhat = 1|\vx_\nlabeledsmall,\mTheta)^{y_\nlabeledsmall} \left(1 - P(\yhat = 1|\vx\nlabeledsmall,\mTheta)\right)^{1 - y_\nlabeledsmall} $$

As mentioned earlier, maximizing the log-likelihood is equivalent to minimizing the negative of the log-likelihood, the loss.

\begin{equation} \text{NLL}(\labeledset|\mTheta) = - \sum_{\nlabeledsmall=1}^{\nlabeled} \left[ y_\nlabeledsmall \log P(\yhat = 1|\vx_\nlabeledsmall,\mTheta) + \left(1 - y_\nlabeledsmall \right) \log \left(1 - P(\yhat = 1|\vx\nlabeledsmall,\mTheta)\right) \right] \end{equation}

Thus, the per instance loss is \begin{aligned} \loss_{\text{NLL}}(y_\nlabeledsmall, \mTheta) = - y_\nlabeledsmall \log P(\yhat = 1|\vx_\nlabeledsmall,\mTheta) - \left(1 - y_\nlabeledsmall \right) \log \left(1 - P(\yhat = 1|\vx\nlabeledsmall,\mTheta)\right) \end{aligned}

Negative log-likelihood is the loss-function used for training the logistic regression classifier (Specifically though, there we merely maximize the log-likelihood, but that's really the same thing). Owing to its differentiability it is also commonly used as a loss function in train deep neural networks.

Binary cross entropy

The negative log-likelihood is differentiable but it requires probabilistic outputs for estimating $ P(\yhat=1|\mTheta) $ and $ P(\yhat=0|\mTheta) $. An alternative, with a similar formulation, for non-probabilistic models is the binary cross-entropy (BCE) loss that directly works with the score of the positive class $ f(\vx,\mTheta) $ , where, $ \mTheta $ is the set of parameters of the model.

The per-instance BCE Loss is \begin{aligned} \loss_{\text{BCE}}(y_\nlabeledsmall, \vx_\nlabeledsmall, \mTheta) = - y_\nlabeledsmall \log f(\vx_\nlabeledsmall,\mTheta) - \left(1 - y_\nlabeledsmall \right) \log \left(1 - f(\vx_\nlabeledsmall,\mTheta)\right) \end{aligned}

Just like other losses, the per-instance BCE loss is aggregated as a sum over all the examples in the training set (or minibatch, if optimizing using minibatch stochastic gradient descent).

Note that the formulation is similar to the negative log-likelihood loss, with the probability $ P(\yhat|\mTheta) $ now replaced with $ f(\vx_\nlabeledsmall,\mTheta) $ the score of the positive class.

For a training instance $ \vx_\nlabeledsmall $, only one of the two terms is active depending on the value of $ y_\nlabeledsmall $ and the other becomes zero. Suppose $ f_{y}(\vx,\mTheta) $ denotes the score of the model for the class $ y $, so that $ sum_{y \in {0,1}} f_y(\vx,\mTheta) = 1 $. In this case, we can also rewrite the BCE loss alternatively as

\begin{aligned} \loss_{\text{BCE}}(y_\nlabeledsmall, \vx_\nlabeledsmall, \mTheta) = - \log f_{y_\nlabeledsmall}(\vx_\nlabeledsmall,\mTheta) \label{eqn:bce-alternative} \end{aligned}

It is called binary cross-entropy because its formulation is similar to the cross-entropy between two discrete distributions $ p $ and $ q $, calculated as $ \entropy{p,q} = -\sum_{x \in \mathcal{X}} p(x) \log q(x) $. In our case, the cross-entropy is between the distribution of $ y $ and the distribution of $ f(\vx,\mTheta) $.

The BCE loss is differentiable, but has a numerical challenge. If $ f(\vx,\mTheta) = 0 $ or $ f(\vx, \mTheta) = 1 $, then one of terms becomes $ \log 0 $ which is mathematically undefined. Some packages like PyTorch get over this issue by setting $ log 0 = \infty $ and then clamping the value of the loss to be greater than $ -100 $, an arbitrary choice that works just fine.

Cross-entropy

A natural extension of the binary cross entropy loss to multiclass problems is the cross-entropy loss.

Typically, multi-class classifiers, such as those implemented with neural networks, will arrive at a score for each class $ \nclasssmall=1,\ldots,\nclass $ as $ f_{\nclasssmall}(\vx,\mTheta) $. The cross-entropy loss first calculates the softmax of these scores to normalize the scores, so that they sum to 1. Then, the negative of the logarithm of this softmax is used to compute the cross-entropy.

The per-instance cross-entropy loss is calculated as

\begin{aligned} \loss_{\text{cross-entropy}}(y_\nlabeledsmall, \vx_\nlabeledsmall, \mTheta) = - \log \frac{\textexp{f_{y_\nlabeledsmall}(\vx_\nlabeledsmall,\mTheta)}}{\sum_{\nclasssmall=1}^{\nclass} \textexp{f_{C_\nclasssmall}(\vx_\nlabeledsmall,\mTheta)}} \end{aligned}

Note that this formulation is similar to the alternative BCE loss formulation that we described in Equation \eqref{eqn:bce-alternative}.

In modern deep learning, the cross-entropy loss is the default recommendation for a good differentiable loss function for training classifiers.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Please donate

Subscribe for article updates

Stay up to date with new material for free.

Loss functions for training classification models

Machine Learning