Evaluation metrics for classification

Machine Learning


An informed choice of a suitable metric can help define an appropriate loss to optimize the model for a given task during training. By rigorously evaluating to understand the generalization performance of a model using techniques such as cross-validation, a trained model may be identified to be superior to other models and hence chosen to be deployed for the particular task. Making these informed choices during training and testing is possible with a clear understanding of evaluation metrics.

In this article we will cover the many metrics to evaluate performance of machine learning models for classification. We will also comment on their suitability for various tasks and scenarios. We have a separate comprehensive article on evaluation metrics for regression.


To understand the various evaluation metrics for classification, we recommend familiarity with the concepts in

  • Introduction to machine learning: An introduction to basic concepts in machine learning such as classification, training instances, features, and feature types.

Follow the above links to first get acquainted with the corresponding concepts.

Problem setting

To evaluate the performance of a model, we compare the predictions of the model to the actual target values on set of examples. If the set of examples has been used for training the model, then we are effectively measuring the performance on the training set. If the set of examples has not been used for training the model, the so called unseen or held-out examples, then the metric is a performance on the test set.

In regression, the goal of the predictive model is to predict a continuous valued output for a given multivariate instance. We need to predict a real-valued output \( \hat{y} \in \real \) that is as close as possible to the true target \( y \in \real \). The hat \( \hat{ } \) denotes that \( \hat{y} \) is an estimate, to distinguish it from the truth.

In classification, the goal of the predictive model is to identify the class that a particular instance belongs to. For example, the model may be required to choose one class among the set of \( \nclass \) classes, \( \set{C_1,\ldots,C_\nclass} \). In binary classification problems, the classifier chooses between two classes, typically \( \set{-1,1} \) or \( \set{0,1} \). In multi-labeled classification scenarios, the classifier predicts multiple categories for the same instance. Irrespective of the particular setting, from the evaluation perspective, we compare the predicted class labels to actual categories of those instances.


The most straightforward and commonly used evaluation metric for classification performance is the accuracy score. Accuracy is calculated as the fraction of predictions that are correct.

If there are \( \nlabeled \) instances in the evaluation set, then accuracy is computed as

\begin{aligned} \text{Accuracy} &= \frac{\text{Number of correct predictions}}{\text{Total number of predictions}} \\\\ &= \frac{\sum_{\nlabeledsmall=1}^{\nlabeled} \indicator{y_\nlabeledsmall == \yhat_\nlabeledsmall}}{\nlabeled} \end{aligned}

where, \( \indicator{a} \) the indicator function that takes on the value 1 if \( a \) is true, and 0 otherwise. In this formula, the value will be \( 1 \) when \( y_\nlabeledsmall \) is equal to the value of the predicted class label \( \yhat_\nlabeledsmall \).

Accuracy is suitable for evaluating multiclass and binary classification scenarios. It is also the default evaluation metric for such scenarios.

Nevertheless, accuracy may not suitable under certain conditions.

  • If the classes are imbalanced. Accuracy assigns equal weighting to each class in the calculation of the score. This has the unintended consequence of the majority class dominating the score. For example, if \( 95\% \) of the examples belong to one class, then a classifier that assigns all examples to that dominant class will have an accuracy of \( 95\% \). This may not be desirable as such a model is useless in practice.
  • In some classification scenarios, such as anomaly detection, we may need a more nuanced metric that specifically measures performance with respect to individual classes and particular errors. In those cases, an all encompassing metric such as accuracy is not suitable.
  • The accuracy metric should not be used for multi-labeled classification scenarios since that involves multiple predictions per instance. In the multi-labeled setting, \( \yhat_\nlabeledsmall \) and \( y_\nlabeledsmall \) are vectors.

Contingency table

At the heart of all scores for binary classification performance is the concept of the contingency table. A contingency table is a 4-cell table that lists the number of instances that satisfy certain conditions.

For the design of the contingency table, we consider the binary classes to be \( \set{-1, 1} \), the negative and positive classes, respectively.

Actually positive: \(y = 1\) Actually negative: \(y = -1\)
Predicted positive: \(\yhat = 1\) True positive \( \text{TP} \) False negative \( \text{FN} \)
Predicted negative: \(\yhat = -1\) False positive \( \text{FP} \) True negative \( \text{TN} \)

The values of the cells \( TP, TN, FP, FN \) are calculated by counting the number of instances that fit the criteria for that particular cell.

  • True positives \(\text{TP}\): The number of positive examples that were also predicted to be positive. That is the count of examples where \( \yhat_\nlabeledsmall = 1 \) and \( y_\nlabeledsmall = 1 \)
  • True negatives \(\text{TN}\): The number of negative examples that were also predicted to be negative. This is the count of examples where \( \yhat_\nlabeledsmall = -1 \) and \( y_\nlabeledsmall = -1 \)
  • False positives \(\text{FP}\): The number of negative examples that were incorrectly predicted to be positive. That is the count of examples where \( \yhat_\nlabeledsmall = 1 \) and \( y_\nlabeledsmall = -1 \)
  • False negatives \(\text{FN}\): The number of positive examples that were incorrectly predicted to be negative. That is the count of examples where \( \yhat_\nlabeledsmall = -1 \) and \( y_\nlabeledsmall = 1 \)

Equipped with such a filled up contingency table, we can compute several useful evaluation metrics as we shall see next. The accuracy metric for binary classification in terms of the contingency table values is simple to calculate as

$$ \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{FN} + \text{TN}} $$


The recall or true positive rate measures the fraction of examples that were correctly identified to be positive among all examples that are actually positive. With the contingency table defined earlier, the recall is computed as

$$ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $$

In the anomaly detection or fault detection scenarios, with the positive class denoting the anomalous examples, the recall score is also known as sensitivity In the detection setting, intuitively, the recall measures how sensitive the model is in detecting anomalies or faults from among examples with normal operation.


The precision score measures how precisely the classifier identifies positive examples by avoiding the incorrect identification of negative examples as positive. Thus, the precision score is calculated as the fraction of examples that are correctly predicted to be positives from among all the examples that have been predicted positive (correctly or incorrectly).

Thus, using the contingency table, the precision score is calculated as

$$ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $$


Both recall and precision are focused on getting a higher proportion of true positives. In problems where we need to evaluate from the perspective of the negative class, we compute specificity, also known as the true negative rate.

$$ \text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}} $$

Equipped with specificity and sensitivity (recall), we can evaluate the classifier from the perspective of both classes.

F1 score

Recall and precision are two scores. An aggregate score of these two is the \( F_1 \) score, more popularly written as F1 score. It is computed as the harmonic mean of precision and recall.

\begin{aligned} F_1 &= \left(\frac{\text{Recall} + \text{Precision}}{2} \right)^{-1} \\\\ &= \frac{2}{\frac{1}{\text{Recall}} + \frac{1}{\text{Precision}}} \\\\ &= \frac{2~~\text{Precision}~~\text{Recall}}{\text{Recall}~ +~ \text{Precision}} \end{aligned}

By incorporating both measures, we get a single score to report and evaluate the performance of the model. In binary classification problems with imbalanced classes, it is better to evaluate the model in terms of its \( F_1 \) score instead of the accuracy score.

\( F_\beta \) score

The \( F_1 \) score treats each constituent score (precision and recall) as equally important. A generalization of the \( F_1 \) score is the \( F_\beta \) score, which allows for preferential treatment the constituent scores. It is calculated as a weighted harmonic mean, with a real-valued positive weight \( \beta \) treating the recall as \( \beta \) times more important than the precision.

\begin{aligned} F_\beta &= \left(\frac{\beta\text{Recall} + \text{Precision}}{1 + \beta} \right)^{-1} \\\\ \end{aligned}

Receiver operating characteristic (ROC) curve

Many classifiers can be made to work off a threshold. For example, in anomaly detection, if the anomaly scores computed by the predictive model for certain observations cross a certain threshold, then we may deem those observations to be anomalous. In the context of such threshold-based classifiers, a single precision/recall number may not provide the complete performance profile. Precision and recall are computed for a particular assignment of examples into positive and negative classes. By changing the threshold, the assignments will change. The receiver operating characteristic (ROC) curve allows the evaluation of such a complete performance profile of the model.

ROC curve is the plot of true positive rate (recall) versus the true negative rate (specificity) for changing values of the threshold. We compute the recall and specificity for each value of threshold between the maximum and minimum possible value for the thresholds. This exercise provides a list of paired recall and specificity values, which when plotted result in the ROC curve.

For the highest value of threshold, all examples are classified as negative. So the true negative rate is 1.0 and the true positive rate is 0.0. Conversely, for the lowest value of threshold, all examples are classified as positive. So the true negative rate is 0.0 and true positive rate is 1.0.

If the coordinates are indicated as tuples of paired values \( (\text{true positive rate}, \text{true negative rate}) \), then, the diagonal line from \( (0,0) \) to \( (1,1) \) will be the ROC curve of a random model — one that assigns the scores randomly to observations. A good classifier should then have an ROC curve above this random line, to imply learning capability that is better than random categorization. Among multiple good classifiers, the one with the ROC curve that passes closest to the coordinate \( (1,0) \) will be the winner.

In addition to being an evaluation metric, the ROC curve is also used to discover optimal thresholds. If giving equal weighting to recall and specificity, the point along the ROC curve that is closest to the coordinate \( (1,0) \) is considered as a suitable threshold.

Area under the curve (AUC)

The ROC curve is comprehensive. To make it amenable to quantitative comparison, we need to condense it to a single number. The area under the curve (AUC) is simply the area under the ROC curve — a single value. A higher value of AUC indicates a better classifier.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Let's connect

Please share your comments, questions, encouragement, and feedback.