Introduction to tree-based regression methodstree-based regression methodstree-based regression methodstree-based regression methods

This learning module has many interactive demos. It is easier to work with them on a larger screen. Bookmark and revisit if you are currently on a small screen device.

\(\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\asterisk}{\ast} \newcommand{\sup}{\text{sup}} \newcommand{\inf}{\text{inf}} \newcommand{\min}{\text{min}\;} \newcommand{\max}{\text{max}\;} \newcommand{\maxunder}[1]{\underset{#1}{\max}} \newcommand{\minunder}[1]{\underset{#1}{\min}} \newcommand{\real}{\mathbb{R}} \newcommand{\natural}{\mathbb{N}} \newcommand{\integer}{\mathbb{Z}} \newcommand{\rational}{\mathbb{Q}} \newcommand{\irrational}{\mathbb{I}} \newcommand{\complex}{\mathbb{C}} \newcommand{\cardinality}[1]{|#1|} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\star}[1]{#1^*} \newcommand{\inv}[1]{#1^{-1}} \newcommand{\indicator}[1]{\mathcal{I}(#1)} \renewcommand{\BigO}[1]{\mathcal{O}(#1)} \renewcommand{\BigOsymbol}{\mathcal{O}} \renewcommand{\smallo}[1]{\mathcal{o}(#1)} \renewcommand{\smallosymbol}[1]{\mathcal{o}} \newcommand{\set}[1]{\mathbb{#1}} \newcommand{\complement}[1]{#1^c} \newcommand{\powerset}[1]{\mathcal{P}(#1)} \newcommand{\setdiff}{\setminus} \newcommand{\setsymmdiff}{\oplus} \newcommand{\dash}[1]{#1^{'}} \newcommand{\permutation}[2]{{}_{#1} \mathrm{ P }_{#2}} \newcommand{\combination}[2]{{}_{#1} \mathrm{ C }_{#2}} \newcommand{\prob}[1]{P(#1)} \newcommand{\pmf}[1]{P(#1)} \newcommand{\pdf}[1]{p(#1)} \newcommand{\cdf}[1]{F(#1)} \newcommand{\expect}[2]{E_{#1}\left[#2\right]} \newcommand{\entropy}[1]{\mathcal{H}\left[#1\right]} \newcommand{\expe}[1]{\mathrm{e}^{#1}} \newcommand{\textexp}[1]{\text{exp}\left(#1\right)} \def\independent{\perp\!\!\!\perp} \def\notindependent{\not\!\independent} \newcommand{\yhat}{\hat{y}} \newcommand{\vs}{\vec{s}} \newcommand{\vt}{\vec{t}} \newcommand{\vu}{\vec{u}} \newcommand{\vv}{\vec{v}} \newcommand{\vw}{\vec{w}} \newcommand{\vx}{\vec{x}} \newcommand{\vy}{\vec{y}} \newcommand{\vz}{\vec{z}} \newcommand{\va}{\vec{a}} \newcommand{\vb}{\vec{b}} \newcommand{\vc}{\vec{c}} \newcommand{\vd}{\vec{d}} \newcommand{\ve}{\vec{e}} \newcommand{\vg}{\vec{g}} \newcommand{\vh}{\vec{h}} \newcommand{\vi}{\vec{i}} \newcommand{\vk}{\vec{k}} \newcommand{\vo}{\vec{o}} \newcommand{\vp}{\vec{p}} \newcommand{\vq}{\vec{q}} \newcommand{\vr}{\vec{r}} \newcommand{\vs}{\vec{s}} \newcommand{\vmu}{\vec{\mu}} \newcommand{\vsigma}{\vec{\sigma}} \newcommand{\vphi}{\vec{\phi}} \newcommand{\vtau}{\vec{\tau}} \newcommand{\vtheta}{\vec{\theta}} \newcommand{\mA}{\mat{A}} \newcommand{\mB}{\mat{B}} \newcommand{\mC}{\mat{C}} \newcommand{\mD}{\mat{D}} \newcommand{\mE}{\mat{E}} \newcommand{\mH}{\mat{H}} \newcommand{\mK}{\mat{K}} \newcommand{\mP}{\mat{P}} \newcommand{\mQ}{\mat{Q}} \newcommand{\mR}{\mat{R}} \newcommand{\mS}{\mat{S}} \newcommand{\mU}{\mat{U}} \newcommand{\mV}{\mat{V}} \newcommand{\mW}{\mat{W}} \newcommand{\mX}{\mat{X}} \newcommand{\mY}{\mat{Y}} \newcommand{\mZ}{\mat{Z}} \newcommand{\mI}{\mat{I}} \newcommand{\mLambda}{\mat{\Lambda}} \newcommand{\mSigma}{\mat{\Sigma}} \newcommand{\mTheta}{\mat{\theta}} \newcommand{\setsymb}[1]{#1} \newcommand{\sA}{\setsymb{A}} \newcommand{\sB}{\setsymb{B}} \newcommand{\sC}{\setsymb{C}} \newcommand{\sO}{\setsymb{O}} \newcommand{\sP}{\setsymb{P}} \newcommand{\sQ}{\setsymb{Q}} \newcommand{\sH}{\setsymb{H}} \newcommand{\sX}{\setsymb{X}} \newcommand{\sY}{\setsymb{Y}} \newcommand{\norm}[2]{||{#1}||_{#2}} \newcommand{\infnorm}[1]{\norm{#1}{\infty}} \newcommand{\fillinblank}{\text{ }\underline{\text{ ? }}\text{ }} \newcommand{\lbrace}{\left\{} \newcommand{\rbrace}{\right\}} \newcommand{\set}[1]{\lbrace #1 \rbrace} \newcommand{\seq}[1]{\left( #1 \right)} \newcommand{\ndim}{N} \newcommand{\ndimsmall}{n} \newcommand{\dataset}{\mathbb{D}} \newcommand{\ndata}{D} \newcommand{\ndatasmall}{d} \newcommand{\labeledset}{\mathbb{L}} \newcommand{\nlabeled}{L} \newcommand{\nlabeledsmall}{l} \newcommand{\unlabeledset}{\mathbb{U}} \newcommand{\nunlabeled}{U} \newcommand{\nunlabeledsmall}{u} \newcommand{\nclass}{M} \newcommand{\nclasssmall}{m} \newcommand{\loss}{\mathcal{L}} \newcommand{\sign}{\text{sign}} \newcommand{\Gauss}{\mathcal{N}} \newcommand{\hadamard}{\circ} \newcommand{\doh}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\dox}[1]{\doh{#1}{x}} \newcommand{\doy}[1]{\doh{#1}{y}} \newcommand{\doxx}[1]{\doh{#1}{x^2}} \newcommand{\doyy}[1]{\doh{#1}{y^2}} \newcommand{\doxy}[1]{\frac{\partial #1}{\partial x \partial y}} \newcommand{\doyx}[1]{\frac{\partial #1}{\partial y \partial x}} \newcommand{\qed}{\tag*{$\blacksquare$}}\)

        Tree-based regression
        Machine Learning
      

Introduction

Tree-based regression approaches are nonlinear models that work by partitioning the input space into cuboid regions. The prediction of the model is based on some aggregate statistic, such as the mean or the median of the training examples in the cuboid region that matches the unlabeled example. Owing to this, trained tree-based models are easily understood with no machine learning background leading to their wide deployment in industrial applications. There are many variants of tree-based approaches, such as the CART, ID3 and C4.5.

In this article, we focus on the CART-based regression approach.

Prerequisites

To understand the naive Bayes classifier, we recommend familiarity with the concepts in

Probability: A sound understanding of conditional and marginal probabilities and Bayes Theorem is desirable.
Introduction to machine learning: An introduction to basic concepts in machine learning such as classification, training instances, features, and feature types.
Decision tree classifier: For motivation and alternative perspectives.

Follow the above links to first get acquainted with the corresponding concepts.

Problem setting

In regression, the goal of the predictive model is to predict a continuous valued output for a given multivariate instance.

Consider such an instance $ \vx \in \real^N $, a vector consisting of $ \ndim $ features, $\vx = [x_1, x_2, \ldots, x_\ndim] $.

We need to predict a real-valued output $ \hat{y} \in \real $ that is as close as possible to the true target $ y \in \real $. The hat $ \hat{ } $ denotes that $ \hat{y} $ is an estimate, to distinguish it from the truth.

From decision rules to decision trees

The decision rule for fruit classification we saw earlier were

$ \text{If color == red} \Rightarrow 2 $
$ \text{If color == yellow and shape == oval} 3 $.
$ \text{If color == yellow and shape == round and bottom == depressed } \Rightarrow 2 $.
Else, $ \text{ mango } $.

Intuitively, this is a binary tree shown below

Decision trees are such binary trees. Each node in the tree compares a feature to a value.

Regions of the feature space

Each node of the tree partitions the feature space into two regions — one that satisfies the left path and one that satisfies the right. Each region formed this way is cuboid shaped. Descendants of a node further split the region into smaller cuboids.

So, taken together, a tree achieves a recursive binary splitting of the feature space.

The final regions resulting from this recursive partitioning represent the leaf nodes of the tree. A tree with $ M $ leaf nodes will result in $ M $ such leaf regions.

The region gets the value of the average of the target variables of the training examples whose features match that particular region of the feature space.

Consider one such region $ r $. Suppose $ \nlabeled_r $ examples from the training set belong to the region $ r $.

\begin{equation} \hat{c}_r = \frac{1}{\nlabeled_r} \sum_{\vx_i \in r} y_i \label{eqn:region-value} \end{equation}

Predictive model

In a tree-based regression model, prediction is straightforward.

For an unlabeled instance that we wish to predict for, we traverse the tree finding the appropriate leaf or region that matches the features of the instance. Then, the unlabeled instance is classified to the value assigned to that region.

In a tree with $ R $ regions, the prediction is \begin{equation} \yhat = \sum_{r = 1}^R \hat{c}_r \indicator{\vx \in r} \label{eqn:yhat} \end{equation}

Here, $ \indicator{\vx \in r} $ is the indicator function which takes on the value $ 1 $ if the instance $ \vx $ matches region $ r $, else it is zero. The constant $ \hat{c}_r $ is the target value assigned to the region $ r $, as defined in Equation \eqref{eqn:region-value}.

Training a decision tree

To build a tree-based regression model, we need to, well, infer a tree. This involves determining the structure of the tree — the nodes and their relationships.

Tree learning uses a greedy strategy to recursively identify the best nodes to add.

Starting with the full training set, identify the best node to add to the tree.
The chosen node splits the feature space into two regions. Examples that match the node criteria will be used as the training set to grow right child of this node. Examples that do not match will be used as the training set to grow the left child.

So, the key question is: Given a training set what is the best node to create?

The various tree-based approaches to regression differ in how they address this challenge. We study the CART approach in this article.

Selecting nodes for tree-building

What is a node? A feature and a criteria applied on that feature. For example, $ \text{color == red} $.

What constitutes a good node in a tree? One that achieves low prediction error. Lower prediction error is possible if the constants $ \hat{c}_r $ assigned to any region $ r $ are closest to the true target values in that region.

Thus, we select nodes that lead to lowest mean squared error in the regions formed by that node.

The error $ e_r $ within a region $ r $ can be calculated as

\begin{equation} e_r = \frac{1}{\nlabeled_r} \sum_{\vx_i \in r} \left( y_i - \hat{c}_r \right)^2 \label{eqn:region-error} \end{equation}

Dealing with feature types

Different feature types lead to corresponding node criteria.

Feature type	Example criteria	Explanation
Categorical	$ x_c == \text{'value'} $	$ \text{'value'} $ is one of the values taken by the categorical variable $ x_c $
Binary	$ x_b == 1 $	Binary variables take on the value $ 0 $ or $ 1 $
Real-valued	$ x_r \le t $	$ t $ is a threshold on the value of the real-valued variable $ x_r $

Controlling model complexity

In parametric models with weights such as support vector machines, the complexity is typically controlled by using some form of regularization — a penalty on the weights. In the case of tree-based classifiers, there are no weights; only nodes in the tree.

Deeper trees means more nodes. More nodes means more recursive partitioning of the tree into regions. More regions means smaller neighborhoods. Smaller neighborhoods means fewer training examples in most regions, leading to limited generalizability of the tree from being too specific to few training examples — a classic case of overfitting.

Consequently, the model complexity of a tree is controlled by limiting the depth of the tree. Since the tree-building follows a greedy strategy based on mean squared error, the typical practice involves letting the tree grow big and then pruning it to limit its complexity.

Pruning

After letting the tree grow as big as possible, a cost-complexity criterion is applied to prune or remove unnecessary nodes and their descendants from the tree.

If $ R $ denotes the number of terminal nodes (leaves or regions) in the tree, and $ e_r(R) $ denotes the prediction error of the $r$-th region of the tree as defined in Equation \eqref{eqn:region-error}, then, the cost complexity of the tree is defined as

$$ C(R) = \sum_{m=1}^{M} \frac{1}{\nlabeled_r} e_r(R) + \alpha |R| $$

where, $ \alpha $ is a trade-off factor between tree complexity and its performance. Being a cost, we prefer trees with lower value of the cost. Larger values of $ \alpha $ lead to fewer nodes in the tree and vice versa. Typically, the search for nodes to prune is greedy — we collapse the internal node that produces the smallest per node increase in cost-complexity. A suitable value of $ \alpha $ is achieved by cross-validation.

Challenges

Despite being easy to understand, tree-based regression offer several challenges

Instability: The greedy search strategy for tree-building means that small changes in the training set leads to a different series of splits and nodes in the tree, limiting their stability across datasets. This also means that tree-based classifiers typically have higher variance than other approaches. Bagging approaches, such as random forests are designed to alleviate this challenge by creating an ensemble for prediction, because each tree in the forest has been trained on a subset of the training set.
Categorical features: While trees offer a straightforward way to work with categorical features, it also becomes challenging when dealing with categorical features that have too many distinct values. A feature with $ q $ distinct unordered values results in $ 2^{q-1} - 1 $ possible partitions of the feature space — quite prohibitive for large values of $ q $.
Non axis-aligned predictions: Tree-based models partition the feature space into axis-aligned cuboid regions. This is a severe problem for tasks that involve a diagonal or cyclic prediction that are not axis-aligned. For example, to arrive at a diagonal predictor, the tree-based model will need to be very deep.
Lack of smoothness: The predictions from the tree-based model are constant within a region. They do not smoothly vary across regions either. For tasks that expect a smoother output over the feature space, this can be severely limiting factor.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Please donate

Subscribe for article updates

Stay up to date with new material for free.

Feature type	Example criteria	Explanation
Categorical	\( x_c == \text{'value'} \)	\( \text{'value'} \) is one of the values taken by the categorical variable \( x_c \)
Binary	\( x_b == 1 \)	Binary variables take on the value \( 0 \) or \( 1 \)
Real-valued	\( x_r \le t \)	\( t \) is a threshold on the value of the real-valued variable \( x_r \)