Adaboost

Machine Learning

Introduction

Boosting is a strategy to learn an accurate predictive model by intelligently forming an ensemble of weak models. Boosting is a powerful general training strategy that works both for classification and regression problems. In this article, we will focus on the earliest of boosting strategies, the AdaBoost model.

Prerequisites

To understand the AdaBoost classifier, we recommend familiarity with the concepts in

  • Introduction to machine learning: An introduction to basic concepts in machine learning such as classification, training instances, features, and feature types.

Follow the above links to first get acquainted with the corresponding concepts.

Problem setting

AdaBoost was originally designed for classification tasks. Recent boosting approaches extend to regression tasks too. In this article, we will focus on the classification scenario, specifically, binary classification.

Consider an instance \( \vx \in \real^\ndim \), a vector consisting of \( \ndim \) features, \(\vx = [x_1, x_2, \ldots, x_\ndim] \).

In classification, the goal of the predictive model is to identify the class that generated a particular instance. The two classes in the binary classification scenario will be represented by the scalar variable \( y \in \set{-1,1} \) that takes on values \( 1 \) and \( -1 \), representing the two classes.

The model is inferred over a collection of labeled observations provided as tuples \( (\vx_i,y_i) \) containing the instance vector \( \vx_i \) and the true target variable \( y_i \). This collection of labeled observations is known as the training set \( \labeledset = \set{(\vx_1,y_1), \ldots (\vx_\nlabeled,y_\nlabeled)} \).

Weak classifier

A weak classifier is one whose prediction accuracy is only slightly better than random guessing. An example weak classifier, the one used in AdaBoost, is the decision stump — a decision tree with a single node and two leaves.

Consider such a classifier \( H \) that predicts the category of an instance \( \vx ) as

$$ \yhat_H = H(\vx) $$

where \( \yhat_H \in \set{-1,1} \).

With this predictive model, the error rate on the training set \( \labeledset \) is

$$ \text{err}_{H} = \frac{1}{\nlabeled} \sum_{i=1}^{\nlabeled} \indicator{y_i \ne H(\vx_i) } $$

where, the indicator function \( \indicator{y_i \ne H(\vx_i) } \) takes on the value \( 1 \) when the condition is matched — that is when the prediction is incorrect.

The purpose of boosting is to sequentially train such classifiers on modified versions of the training set to produce a sequence of classifiers \( H_m \) for \( m \in \set{1, \ldots, M} \).

Predictive model

The collective predictive model of this ensemble is a linear combination of the predictions of the individual classifiers in the ensemble.

\begin{equation} \yhat = \sign\left( \sum_{m=1}^M \alpha_m H_m(\vx) \right) \label{eqn:class-pred} \end{equation}

where \( \alpha_m \) is the weight assigned to the classifier \( H_m \) during the training process.

Training AdaBoost

Here's the algorithm for Adaboost for learning an ensemble classifier from weak learners \( \yhat = H(\vx) \) such that \( \yhat \in \set{-1,1} \).

Initialization

First, initialize observation weights to be equal for all training examples

$$ w_\nlabeledsmall^{(1)} = \frac{1}{\nlabeled}, \forall \nlabeledsmall \in \set{1,\ldots,\nlabeled} $$

Train sequence of weak classifiers

Repeat the following steps in that order for \( m = 1 \text{ to } M :\)

  1. Fit a classifier \( H_m(\vx) \) to the training data using the weights \(w_\nlabeledsmall^{(m)}\)
  2. Compute error of classifier \( H_m \) as $$ \text{err}_m = \frac{\sum_{\nlabeledsmall=1}^\nlabeled w_\nlabeledsmall^{(m)} \indicator{y_\nlabeledsmall \ne H_m(\vx_\nlabeledsmall)}}{\sum_{\nlabeledsmall=1}^\nlabeledsmall w_\nlabeledsmall^{(m)}} $$
  3. Compute the weight of the classifier to be inversely proportional to its error. $$ \alpha_m = \log \left( \frac{1 - \text{err}_m}{\text{err}_m} \right) $$
  4. Update observation weights to give higher weights to misclassified examples $$ w_\nlabeledsmall^{(m+1)} \leftarrow w_\nlabeledsmall^{(m)} \exp\left[ \alpha_m \indicator{y_\nlabeledsmall \ne H_m(\vx_\nlabeledsmall)} \right], \forall \nlabeledsmall = 1,\ldots,\nlabeled $$

Final classifier

Output final classifier

$$ H(\vx) = \sign\left(\sum_{m=1}^M \alpha_m H_m(\vx)\right) $$

Adaboost training demo

Let us understand the predictive model of the Adaboost classifier on some data. Choose the maximum number of boosting iterations (the number of weak classifiers in the ensemble), \( M \), by adjusting the slider. Note that if the classifier has zero error, it will terminate training and return, providing the simplest possible boundary.

Note that the overall predictive model boundaries are mostly axis aligned. To achieve a curvature in the separation boundary, we need to increase the number of boosting iterations.

Adaboost classifier fit to data

Why exponential term for weighing observations?

Note the nature of the observation weights in the training algorithm.

$$ w_\nlabeledsmall^{(m+1)} \leftarrow w_\nlabeledsmall^{(m)} \exp \left[ -\alpha_m y_\nlabeledsmall H_m(\vx_\nlabeledsmall) \right],~~~~ \forall \nlabeledsmall = 1,\ldots,\nlabeled $$

When this term was originally proposed, it's primary motivation was computational ease. Why?

Suppose the overall classifier after the first \( m \) weak classifiers have been trained is \( F_m \).

$$ F_m(\vx) = \sign\left( \sum_{t=1}^m \alpha_t H_t(\vx) \right) $$

The error of this classifier on an example \( \vx_\nlabeledsmall \) is

$$ \text{err}_{F_m}(\vx_\nlabeledsmall) = \indicator{y_\nlabeledsmall \ne F_m(\vx_\nlabeledsmall)} $$

Note that \( y_\nlabeledsmall \in \set{-1,1} \) and \( F_m(\vx_\nlabeledsmall) \in \set{-1,1} \). Whenever \( y_\nlabeledsmall \ne F_m(\vx_\nlabeledsmall) \), it is the case that \( y_\nlabeledsmall F_m(\vx_\nlabeledsmall) \le 0 \).

Thus, the error of the overall classifier \( F_m \) on an example \( \vx_\nlabeledsmall \) after \( m \) weak classifiers have been trained can be written as

$$ \text{err}_{F_m}(\vx_\nlabeledsmall) = \indicator{y_\nlabeledsmall F_m(\vx_\nlabeledsmall) \le 0} $$

Substituting \( F_m \) in terms of the constituent weak classifiers, we get

$$ \text{err}_{F_m}(\vx_\nlabeledsmall) = \indicator{y_\nlabeledsmall \sum_{t=1}^m \alpha_t H_t(\vx_\nlabeledsmall) \le 0} $$

Where, we have ignored the \( \sign \) function as the outcome will be independent of that.

Now, it is the case that for any \( p \in \real \)

$$ \indicator{p \le 0} \le \text{ exp}\left(-p \right) $$

Thus, the error of classifier \( F_m \) is upper bounded as follows.

$$ \text{err}_{F_m}(\vx_\nlabeledsmall) \le \text{exp} \left[-y_\nlabeledsmall \sum_{t=1}^m \alpha_t H_t(\vx_\nlabeledsmall) \right] $$

Being an exponential term, this is equivalent to

$$ \text{err}_{F_m}(\vx_\nlabeledsmall) \le \text{exp} \left[-y_\nlabeledsmall \sum_{t=1}^{m-1} \alpha_t H_t(\vx_\nlabeledsmall) \right] \text{exp} \left[-y_\nlabeledsmall \alpha_m H_m(\vx_\nlabeledsmall) \right] $$

This means,

$$ \text{err}_{F_m}(\vx_\nlabeledsmall) \le \text{err}_{F_{m-1}}(\vx_\nlabeledsmall) \text{exp} \left[-\alpha_m y_\nlabeledsmall H_m(\vx_\nlabeledsmall) \right] $$

If the weight of an instance \( \vx_\nlabeledsmall \) is to be proportional to the error of the current classifier, as intuitively desired in boosting, it would then be the case that

$$ w_\nlabeledsmall^{(m+1)} \leftarrow w_\nlabeledsmall^{(m)} \text{ exp}\left[ - \alpha_m y_\nlabeledsmall H_m(\vx_\nlabeledsmall) \right] $$

That is how you arrive at the exponential term for the observation weights. In other words, the exponential term helps in a modularized recursive weight updates that depend on the weights from the previous iteration. If not for the exponential term, some other score may require the calculation of the overall classifier error after each weak classifier has been trained, making it computationally expensive.

Choosing the classifier weights \( \alpha_m \)

The nature of the classifier weights may seem an odd choice at first. Why not just \( \frac{1}{\text{err}_m} \), if all we want is inversely proportional to the error?

Instead, in the training procedure, we defined them as

$$ \alpha_m = \log \left( \frac{1 - \text{err}_m}{\text{err}_m} \right) $$

This form of the weights is the result of choosing the weights that minimize the error of the exponential error function of AdaBoost.

$$ \alpha_m = \argmin_{\alpha} \sum_{\nlabeledsmall=1}^{\nlabeled} w_\nlabeledsmall \text{ exp}\left[ \alpha_m \indicator{y_\nlabeledsmall \ne H_m(\vx_\nlabeledsmall)} \right] $$

This minimization results in the weights.

$$ \alpha_m = \frac{1}{2} \log \left( \frac{1 - \text{err}_m}{\text{err}_m} \right) $$

The additional factor \( \frac{1}{2} \) is ignored because it is the same for all classifiers in the linear combination.

Higher \( \alpha_m \) implies a more accurate classifier \( H_m \) and vice versa.

This formulation of the classifier weights only works in the case of classifiers with discrete outputs, where \( H_m(\vx) \in \set{-1,1} \). Therefore, the variant studied so far is known as Discrete AdaBoost.

The variant with probabilistic real-valued output is known as Real AdaBoost and provides the probability of the positive class such that \( H_m(\vx) = \prob{y=1|\vx} \). In this case, the weights turn out to be

$$ \alpha_m = \frac{1}{2} \log \left( \frac{\text{err}_m}{1 - \text{err}_m} \right) $$

Please share

Let your friends, followers, and colleagues know about this resource you discovered.

Subscribe for article updates

Stay up to date with new material for free.