Boosted trees

Introduction

AdaBoost works with weak-classifiers to sequentially build a more accurate ensemble. A similar strategy can be used to build an ensemble of strong models, for example, involving tree-based models for classification and regression.

Prerequisites

To understand the boosting approaches to trees, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

Problem setting

Tree boosting can be used for both classification and regression tasks.

Consider an instance $\vx \in \real^\ndim$, a vector consisting of $\ndim$ features, $\vx = [x_1, x_2, \ldots, x_\ndim]$.

In classification, the goal of the predictive model is to identify the class that generated a particular instance. We need to assign it to one of the $M$ classes $C_1, C_2, \ldots, C_M$ depending on the values of the $N$ features .

In regression, the goal of the predictive model is to predict a continuous valued output for a given multivariate instance. We need to predict a real-valued output $\hat{y} \in \real$ that is as close as possible to the true target $y \in \real$. The hat $\hat{ }$ denotes that $\hat{y}$ is an estimate, to distinguish it from the truth.

For both supervised learning settings, the model is inferred over a collection of labeled observations provided as tuples $(\vx_i,y_i)$ containing the instance vector $\vx_i$ and the true target variable $y_i$. This collection of labeled observations is known as the training set $\labeledset = \set{(\vx_1,y_1), \ldots (\vx_\nlabeled,y_\nlabeled)}$.

A quick refresher on trees

Remember from our comprehensive article on decision trees and tree-based regression that trees partition the feature space into regions. The prediction in each region is a constant. Suppose for a region $r$, the constant prediction is $\gamma_r$.

This means, the predictive model of the tree can be summarized as

$$T(\vx;\Theta) = \sum_{r=1}^R \gamma_r \indicator{\vx \in r}$$

where, $\Theta$ is the collection of all parameters of the tree — the regions $\set{1, \ldots, R}$ and the corresponding constants $\gamma_r,~~~\forall r \in \set{1,\ldots,R}$.

Intuitively, this means, the prediction for an instance $\vx$ is the constant $\gamma_r$ assigned to the region $r$ that the instance belongs to.

Tree learning involves inferring $\Theta$ — the regions and their constants.

• Finding the regional constants $\gamma_r$ is easy. Typically, in the case of regression, $\gamma_r$ is the average of target variables of all training examples that belong to the region $r$. In the case of classification, it is the modal class — the class that has the most representative examples belonging in the region $r$.
• Finding the regions is the harder problem. This is typically handled greedily, with a top-down recursive partitioning strategy.

Thus, the parameters, $\Theta$, are fit to minimize the Gini index (in classification) or mean squared error (in regression) on the training set consisting of $\nlabeled$ labeled examples.

$$\hat{\Theta} = \argmin_{\Theta} \sum_{\nlabeledsmall=1}^{\nlabeled} \loss\left(y_\nlabeledsmall, T(\vx_\nlabeledsmall;\Theta)\right)$$

Predictive model

A boosted tree $f_M$ is an ensemble of $M$ such trees. It's predictive model is a sum of the predictions from each of the constituent trees.

\begin{align} \hat{y} &= f_M(\vx) \\\\ &= \sum_{m=1}^M T(\vx;\Theta_m) \label{eqn:class-pred} \end{align}

Training boosted trees

Note that the predictive model of the boosted tree is an additive model of the constituents. This means, to arrive at the optimal parameters $\Theta_m$ of the $m$-th tree, we must incorporate the predictions from the previous $m - 1$ trees — the model $f_{m-1}$. This ensures that the parameters $\Theta_m$ are optimized for the residuals of the existing $m - 1$ tree. In other words, new trees added to the ensemble are optimized to minimize the errors of the previous trees in the ensemble.

This leads to the optimization criteria

$$\hat{\Theta}_m = \argmin_{\Theta} \sum_{\nlabeledsmall=1}^{\nlabeled} \loss\left(y_\nlabeledsmall, f_{m-1}(\vx_\nlabeledsmall) + T(\vx_\nlabeledsmall;\Theta)\right)$$

The regions for the $m$-th tree can then be found greedily as before.

With the knowledge of such regions, the regional constants $\gamma_{rm}$ for the $r$-th region of the $m$-th tree can be found by optimizing the following

$$\gamma_{rm} = \argmin_{\gamma} \sum_{\vx_\nlabeledsmall \in r} \loss(y_\nlabeledsmall, f_{m-1}(\vx_\nlabeledsmall) + \gamma)$$

Note that the regional constants of the newly learnt $m$-th tree are modified from their usual values since we need to fix the errors of previous $m - 1$ trees in the ensemble.

Boosting regression trees

The above optimization strategy to sequentially create an ensemble of trees is straightforward for regression trees.

• Discovering regions for boosted trees is no different from the strategy adopted for single tree learning. The only difference being, instead of optimizing to predict the target variable directly, we instead optimize the regions to predict the residual as explained earlier. Thus, the regions of the $m$-th tree are inferred to optimize the residual $y_\nlabeledsmall - f_{m-1}(\vx_\nlabeledsmall)$.
• The calculation of regional constants, $\gamma_{rm}$, is also straightforward for regression trees. Once the regions have been discovered for the $m$-th tree in the sequence, the $\gamma_{rm}$ is the mean of the residuals $y_\nlabeledsmall - f_{m-1}(\vx_\nlabeledsmall)$ for training examples in that region; $\vx_\nlabeledsmall \in r$.

Boosting classification trees

In the case of classification trees, if we restrict the output of each tree to be classes $\set{-1,1}$, then we effectively end up with an AdaBoost-like solution, where we are fitting the new tree to weighted examples

$$\Theta_m = \argmin_{\Theta} \sum_{\nlabeledsmall=1}^{\nlabeled} w_{\nlabeledsmall}^{(m)} \indicator{y_\nlabeledsmall \ne T(\vx_\nlabeledsmall;\Theta)}$$

with the weights

$$w_{\nlabeledsmall}^{(m)} = \exp \left[-y_\nlabeledsmall f_{m-1}(\vx_\nlabeledsmall) \right]$$

If each tree is not restricted to the output $\set{-1,1}$, then the optimization still simplifies to a weighted exponential criteria as

$$\Theta_m = \argmin_{\Theta} \sum_{\nlabeledsmall=1}^{\nlabeled} w_\nlabeledsmall^{(m)} \exp \left[ -y_\nlabeledsmall T(\vx_\nlabeledsmall; \Theta) \right]$$