Introduction
AdaBoost works with weak-classifiers to sequentially build a more accurate ensemble. A similar strategy can be used to build an ensemble of strong models, for example, involving tree-based models for classification and regression.
AdaBoost works with weak-classifiers to sequentially build a more accurate ensemble. A similar strategy can be used to build an ensemble of strong models, for example, involving tree-based models for classification and regression.
To understand the boosting approaches to trees, we recommend familiarity with the concepts in
Follow the above links to first get acquainted with the corresponding concepts.
Tree boosting can be used for both classification and regression tasks.
Consider an instance \( \vx \in \real^\ndim \), a vector consisting of \( \ndim \) features, \(\vx = [x_1, x_2, \ldots, x_\ndim] \).
In classification, the goal of the predictive model is to identify the class that generated a particular instance. We need to assign it to one of the \( M \) classes \( C_1, C_2, \ldots, C_M \) depending on the values of the \( N \) features .
In regression, the goal of the predictive model is to predict a continuous valued output for a given multivariate instance. We need to predict a real-valued output \( \hat{y} \in \real \) that is as close as possible to the true target \( y \in \real \). The hat \( \hat{ } \) denotes that \( \hat{y} \) is an estimate, to distinguish it from the truth.
For both supervised learning settings, the model is inferred over a collection of labeled observations provided as tuples \( (\vx_i,y_i) \) containing the instance vector \( \vx_i \) and the true target variable \( y_i \). This collection of labeled observations is known as the training set \( \labeledset = \set{(\vx_1,y_1), \ldots (\vx_\nlabeled,y_\nlabeled)} \).
Remember from our comprehensive article on decision trees and tree-based regression that trees partition the feature space into regions. The prediction in each region is a constant. Suppose for a region \( r \), the constant prediction is \( \gamma_r \).
This means, the predictive model of the tree can be summarized as
$$ T(\vx;\Theta) = \sum_{r=1}^R \gamma_r \indicator{\vx \in r} $$
where, \( \Theta \) is the collection of all parameters of the tree — the regions \( \set{1, \ldots, R} \) and the corresponding constants \( \gamma_r,~~~\forall r \in \set{1,\ldots,R} \).
Intuitively, this means, the prediction for an instance \( \vx \) is the constant \( \gamma_r \) assigned to the region \( r \) that the instance belongs to.
Tree learning involves inferring \( \Theta \) — the regions and their constants.
Thus, the parameters, \( \Theta \), are fit to minimize the Gini index (in classification) or mean squared error (in regression) on the training set consisting of \( \nlabeled \) labeled examples.
$$ \hat{\Theta} = \argmin_{\Theta} \sum_{\nlabeledsmall=1}^{\nlabeled} \loss\left(y_\nlabeledsmall, T(\vx_\nlabeledsmall;\Theta)\right) $$
A boosted tree \( f_M \) is an ensemble of \( M \) such trees. It's predictive model is a sum of the predictions from each of the constituent trees.
\begin{align} \hat{y} &= f_M(\vx) \\\\ &= \sum_{m=1}^M T(\vx;\Theta_m) \label{eqn:class-pred} \end{align}
Note that the predictive model of the boosted tree is an additive model of the constituents. This means, to arrive at the optimal parameters \( \Theta_m \) of the \( m \)-th tree, we must incorporate the predictions from the previous \( m - 1\) trees — the model \( f_{m-1} \). This ensures that the parameters \( \Theta_m \) are optimized for the residuals of the existing \( m - 1 \) tree. In other words, new trees added to the ensemble are optimized to minimize the errors of the previous trees in the ensemble.
This leads to the optimization criteria
$$ \hat{\Theta}_m = \argmin_{\Theta} \sum_{\nlabeledsmall=1}^{\nlabeled} \loss\left(y_\nlabeledsmall, f_{m-1}(\vx_\nlabeledsmall) + T(\vx_\nlabeledsmall;\Theta)\right) $$
The regions for the \( m\)-th tree can then be found greedily as before.
With the knowledge of such regions, the regional constants \( \gamma_{rm} \) for the \( r\)-th region of the \( m\)-th tree can be found by optimizing the following
$$ \gamma_{rm} = \argmin_{\gamma} \sum_{\vx_\nlabeledsmall \in r} \loss(y_\nlabeledsmall, f_{m-1}(\vx_\nlabeledsmall) + \gamma) $$
Note that the regional constants of the newly learnt \( m \)-th tree are modified from their usual values since we need to fix the errors of previous \( m - 1 \) trees in the ensemble.
The above optimization strategy to sequentially create an ensemble of trees is straightforward for regression trees.
In the case of classification trees, if we restrict the output of each tree to be classes \( \set{-1,1} \), then we effectively end up with an AdaBoost-like solution, where we are fitting the new tree to weighted examples
$$ \Theta_m = \argmin_{\Theta} \sum_{\nlabeledsmall=1}^{\nlabeled} w_{\nlabeledsmall}^{(m)} \indicator{y_\nlabeledsmall \ne T(\vx_\nlabeledsmall;\Theta)} $$
with the weights
$$ w_{\nlabeledsmall}^{(m)} = \exp \left[-y_\nlabeledsmall f_{m-1}(\vx_\nlabeledsmall) \right] $$
If each tree is not restricted to the output \( \set{-1,1} \), then the optimization still simplifies to a weighted exponential criteria as
$$ \Theta_m = \argmin_{\Theta} \sum_{\nlabeledsmall=1}^{\nlabeled} w_\nlabeledsmall^{(m)} \exp \left[ -y_\nlabeledsmall T(\vx_\nlabeledsmall; \Theta) \right] $$
Help us create more engaging and effective content and keep it free of paywalls and advertisements!
Stay up to date with new material for free.