Introduction to CART decision trees

This learning module has many interactive demos. It is easier to work with them on a larger screen. Bookmark and revisit if you are currently on a small screen device.

\(\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\asterisk}{\ast} \newcommand{\sup}{\text{sup}} \newcommand{\inf}{\text{inf}} \newcommand{\min}{\text{min}\;} \newcommand{\max}{\text{max}\;} \newcommand{\maxunder}[1]{\underset{#1}{\max}} \newcommand{\minunder}[1]{\underset{#1}{\min}} \newcommand{\real}{\mathbb{R}} \newcommand{\natural}{\mathbb{N}} \newcommand{\integer}{\mathbb{Z}} \newcommand{\rational}{\mathbb{Q}} \newcommand{\irrational}{\mathbb{I}} \newcommand{\complex}{\mathbb{C}} \newcommand{\cardinality}[1]{|#1|} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\star}[1]{#1^*} \newcommand{\inv}[1]{#1^{-1}} \newcommand{\indicator}[1]{\mathcal{I}(#1)} \renewcommand{\BigO}[1]{\mathcal{O}(#1)} \renewcommand{\BigOsymbol}{\mathcal{O}} \renewcommand{\smallo}[1]{\mathcal{o}(#1)} \renewcommand{\smallosymbol}[1]{\mathcal{o}} \newcommand{\set}[1]{\mathbb{#1}} \newcommand{\complement}[1]{#1^c} \newcommand{\powerset}[1]{\mathcal{P}(#1)} \newcommand{\setdiff}{\setminus} \newcommand{\setsymmdiff}{\oplus} \newcommand{\dash}[1]{#1^{'}} \newcommand{\permutation}[2]{{}_{#1} \mathrm{ P }_{#2}} \newcommand{\combination}[2]{{}_{#1} \mathrm{ C }_{#2}} \newcommand{\prob}[1]{P(#1)} \newcommand{\pmf}[1]{P(#1)} \newcommand{\pdf}[1]{p(#1)} \newcommand{\cdf}[1]{F(#1)} \newcommand{\expect}[2]{E_{#1}\left[#2\right]} \newcommand{\entropy}[1]{\mathcal{H}\left[#1\right]} \newcommand{\expe}[1]{\mathrm{e}^{#1}} \newcommand{\textexp}[1]{\text{exp}\left(#1\right)} \def\independent{\perp\!\!\!\perp} \def\notindependent{\not\!\independent} \newcommand{\yhat}{\hat{y}} \newcommand{\vs}{\vec{s}} \newcommand{\vt}{\vec{t}} \newcommand{\vu}{\vec{u}} \newcommand{\vv}{\vec{v}} \newcommand{\vw}{\vec{w}} \newcommand{\vx}{\vec{x}} \newcommand{\vy}{\vec{y}} \newcommand{\vz}{\vec{z}} \newcommand{\va}{\vec{a}} \newcommand{\vb}{\vec{b}} \newcommand{\vc}{\vec{c}} \newcommand{\vd}{\vec{d}} \newcommand{\ve}{\vec{e}} \newcommand{\vg}{\vec{g}} \newcommand{\vh}{\vec{h}} \newcommand{\vi}{\vec{i}} \newcommand{\vk}{\vec{k}} \newcommand{\vo}{\vec{o}} \newcommand{\vp}{\vec{p}} \newcommand{\vq}{\vec{q}} \newcommand{\vr}{\vec{r}} \newcommand{\vs}{\vec{s}} \newcommand{\vmu}{\vec{\mu}} \newcommand{\vsigma}{\vec{\sigma}} \newcommand{\vphi}{\vec{\phi}} \newcommand{\vtau}{\vec{\tau}} \newcommand{\vtheta}{\vec{\theta}} \newcommand{\mA}{\mat{A}} \newcommand{\mB}{\mat{B}} \newcommand{\mC}{\mat{C}} \newcommand{\mD}{\mat{D}} \newcommand{\mE}{\mat{E}} \newcommand{\mH}{\mat{H}} \newcommand{\mK}{\mat{K}} \newcommand{\mP}{\mat{P}} \newcommand{\mQ}{\mat{Q}} \newcommand{\mR}{\mat{R}} \newcommand{\mS}{\mat{S}} \newcommand{\mU}{\mat{U}} \newcommand{\mV}{\mat{V}} \newcommand{\mW}{\mat{W}} \newcommand{\mX}{\mat{X}} \newcommand{\mY}{\mat{Y}} \newcommand{\mZ}{\mat{Z}} \newcommand{\mI}{\mat{I}} \newcommand{\mLambda}{\mat{\Lambda}} \newcommand{\mSigma}{\mat{\Sigma}} \newcommand{\mTheta}{\mat{\theta}} \newcommand{\setsymb}[1]{#1} \newcommand{\sA}{\setsymb{A}} \newcommand{\sB}{\setsymb{B}} \newcommand{\sC}{\setsymb{C}} \newcommand{\sO}{\setsymb{O}} \newcommand{\sP}{\setsymb{P}} \newcommand{\sQ}{\setsymb{Q}} \newcommand{\sH}{\setsymb{H}} \newcommand{\sX}{\setsymb{X}} \newcommand{\sY}{\setsymb{Y}} \newcommand{\norm}[2]{||{#1}||_{#2}} \newcommand{\infnorm}[1]{\norm{#1}{\infty}} \newcommand{\fillinblank}{\text{ }\underline{\text{ ? }}\text{ }} \newcommand{\lbrace}{\left\{} \newcommand{\rbrace}{\right\}} \newcommand{\set}[1]{\lbrace #1 \rbrace} \newcommand{\seq}[1]{\left( #1 \right)} \newcommand{\ndim}{N} \newcommand{\ndimsmall}{n} \newcommand{\dataset}{\mathbb{D}} \newcommand{\ndata}{D} \newcommand{\ndatasmall}{d} \newcommand{\labeledset}{\mathbb{L}} \newcommand{\nlabeled}{L} \newcommand{\nlabeledsmall}{l} \newcommand{\unlabeledset}{\mathbb{U}} \newcommand{\nunlabeled}{U} \newcommand{\nunlabeledsmall}{u} \newcommand{\nclass}{M} \newcommand{\nclasssmall}{m} \newcommand{\loss}{\mathcal{L}} \newcommand{\sign}{\text{sign}} \newcommand{\Gauss}{\mathcal{N}} \newcommand{\hadamard}{\circ} \newcommand{\doh}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\dox}[1]{\doh{#1}{x}} \newcommand{\doy}[1]{\doh{#1}{y}} \newcommand{\doxx}[1]{\doh{#1}{x^2}} \newcommand{\doyy}[1]{\doh{#1}{y^2}} \newcommand{\doxy}[1]{\frac{\partial #1}{\partial x \partial y}} \newcommand{\doyx}[1]{\frac{\partial #1}{\partial y \partial x}} \newcommand{\qed}{\tag*{$\blacksquare$}}\)

        Decision tree classifier
        Machine Learning
      

Introduction

Tree-based classification approaches are nonlinear models that work by partitioning the input space into cuboid regions. The prediction of the model is based on the most dominant class represented by training examples in the cuboid region that matches the unlabeled example. Owing to this, trained tree-based models are easily understood with no machine learning background leading to their wide deployment in industrial applications. There are many variants of tree-based approaches, such as the CART, ID3 and C4.5.

In this article, we focus on the CART-based decision tree classifier. Then, we comment on the extensions in ID3 and C4.5 in later sections.

Prerequisites

To understand the decision tree classifier, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

Problem setting

In classification, the goal of the predictive model is to identify the class that generated a particular instance.

Consider such an instance $ \vx $, a vector consisting of $ \ndim $ features, $\vx = [x_1, x_2, \ldots, x_\ndim] $. The features may be real-valued, binary, categorical, or mix-type.

We need to assign it to one of the $ \nclass $ classes $ C_1, C_2, \ldots, C_\nclass $ depending on the values of the $ \ndim $ features .

The human classifier

Imagine how humans make simple classifications.

Consider a bowl containing a mix of ripened apples and mangoes. Your job is to separate them into their respective categories — apples and mangoes. You have several features of the fruit to make your decision — color, shape, size, and texture. How do you do it?

From your past observation of fruits, especially apples and mangoes, you come up with decision rules. * If color is red, then it is an apple. * If color is yellow, then it could be a mango or a golden apple. So, we need to look at additional attributes. * Among fruits with yellow color, if the shape is oval with a pointed bottom, then it is a mango. If it is round with a depressed bottom, then it is an apple.

Tree-based classification approaches follow the same strategy, which we will study next.

From decision rules to decision trees

The decision rule for fruit classification we saw earlier were

$ \text{If color == red} \Rightarrow \text{ apple } $
$ \text{If color == yellow and shape == oval} \Rightarrow \text{ mango } $.
$ \text{If color == yellow and shape == round and bottom == depressed } \Rightarrow \text{ apple } $.
Else, $ \text{ mango } $.

Intuitively, this is a binary tree shown below

Decision trees are such binary trees. Each node in the tree compares a feature to a value.

Regions of the feature space

Each node of the tree partitions the feature space into two regions — one that satisfies the left path and one that satisfies the right. Each region formed this way is cuboid shaped. Descendants of a node further split the region into smaller cuboids.

So, taken together, a tree achieves a recursive binary splitting of the feature space.

The final regions resulting from this recursive partitioning represent the leaf nodes of the tree. A tree with $ M $ leaf nodes will result in $ M $ such leaf regions.

The leaf gets the class label of the majority class among the training examples whose features match that particular leaf.

Predictive model

In a decision tree, prediction is straightforward.

For an unlabeled instance that we wish to classify, we traverse the tree finding the appropriate leaf that matches the features of the instance. Then, the unlabeled instance is classified to the label of that matching leaf node.

We defined earlier that leaves of the tree are regions of the feature space. Consider one such region $ r $. Suppose $ \nlabeled_r $ examples from the training set belong to the region $ r $. Among these, suppose the proportion of training examples belonging to class $ \nclasssmall $ is $ \hat{p}_{r\nclasssmall} $

\begin{equation} \hat{p}_{r\nclasssmall} = \frac{1}{\nlabeled_r} \sum_{\vx_i \in r} \indicator{y_i = \nclasssmall} \label{eqn:class-proportions} \end{equation}

where $ \indicator{.} $ is the indicator function — it returns 1 if $ y_i = \nclasssmall $ else it returns zero.

Using these proportions for all classes $ \nclasssmall \in \set{1, 2, \ldots, \nclass} $, the label for a region $ r $ is the most dominant class in that region — the class with the highest proportion of training examples in the region $ r $.

\begin{equation} \yhat(r) = \argmax_{\nclasssmall \in \set{1,\ldots,\nclass}} \hat{p}_{r\nclasssmall} \label{eqn:yhat} \end{equation}

Training a decision tree

To build a tree-based classifier, we need to, well, infer a tree. This involves determining the structure of the tree — the nodes and their relationships.

Tree learning uses a greedy strategy to recursively identify the best nodes to add.

Starting with the full training set, identify the best node to add to the tree.
The chosen node splits the feature space into two regions. Examples that match the node criteria will be used as the training set to grow right child of this node. Examples that do not match will be used as the training set to grow the left child.

So, the key question is: Given a training set what is the best node to create?

The various tree-based approaches to classification differ in how they address this challenge.

Selecting nodes for tree-building

What is a node? A feature and a criteria applied on that feature. For example, $ \text{color == red} $.

What constitutes a good node in a tree? One that achieves the good dichotomy — a good separation between classes.

Among two features, one that most distinguishes the categories is a better candidate for forming a node.
Among the various values of a feature, the value that most discriminates the categories is a better candidate for the node criteria.

For an ideal feature, , each of the buckets should be pure — each bucket should contain examples from a single class.

Intuitively, we want the decision tree to have nodes that will result in pure regions. A region is pure when it contains examples from a single class.

Thus, to infer nodes, first we need to develop a measure for impurity of the regions — an impurity score.

A simple impurity score : Misclassification error

As explained earlier, a node is desirable in a tree if it will result in more pure regions compared to undesirable nodes. A simple and easy to understand impurity score is the misclassification error of a region — what fraction of examples are misclassified in that region.

In the case of tree-based models, if all examples in a region are assigned to the majority class, only examples from the majority class will be correctly classified and others will be misclassified. Then, the misclassification error is equivalent to one minus the proportion of the majority class. Thus, we can represent misclassification error in terms of the majority class proportion in a region $ r $ as

$$ e_r = 1 - \hat{p}_{r\yhat(r)} $$

Impurity score: Gini index

We have explained in our comprehensive article on Gini index that the Gini index is a smooth and more responsive impurity score than the misclassification error. For a region $ r $, it is defined as

\begin{align} g_r &= \sum_{\nclasssmall \ne \nclasssmall'} \hat{p}_{r\nclasssmall} \hat{p}_{r\nclasssmall'} \\\\ &= \sum_{\nclasssmall = 1}^{\nclass} \hat{p}_{r\nclasssmall} \left(1 - \hat{p}_{r\nclasssmall} \right) \end{align}

Gini index is the preferred impurity score for CART trees owing to its responsiveness to small changes in class proportions.

Impurity score: Cross-entropy

An alternative smooth impurity score is the cross-entropy, also known as deviance. It is defined as

$$ c_r = - \sum_{\nclasssmall = 1}^{\nclass} \hat{p}_{r\nclasssmall} \log \hat{p}_{r\nclasssmall} $$

For its smoothness and responsiveness to small changes in class proportions, the cross-entropy is the preferred impurity score in several tree-learning algorithms such as the ID3, C4.5, and C5.0, the memory-efficient enhancement to C4.5. The score used in these approaches is known as the information gain, which is the reduction in cross-entropy of the training set at a node when it is partitioned as per the node criteria.

Dealing with feature types

Different feature types lead to corresponding node criteria.

Feature type	Example criteria	Explanation
Categorical	$ x_c == \text{'value'} $	$ \text{'value'} $ is one of the values taken by the categorical variable $ x_c $
Binary	$ x_b == 1 $	Binary variables take on the value $ 0 $ or $ 1 $
Real-valued	$ x_r \le t $	$ t $ is a threshold on the value of the real-valued variable $ x_r $

Controlling model complexity

In parametric models with weights such as support vector machines, the complexity is typically controlled by using some form of regularization — a penalty on the weights. In the case of tree-based classifiers, there are no weights; only nodes in the tree.

Deeper trees means more nodes. More nodes means more recursive partitioning of the tree into regions. More regions means smaller neighborhoods. Smaller neighborhoods means fewer training examples in most regions, leading to limited generalizability of the tree from being too specific to few training examples — a classic case of overfitting.

Consequently, the model complexity of a tree is controlled by limiting the depth of the tree. Since the tree-building follows a greedy strategy based on Gini-index or cross-entropy, the typical practice involves letting the tree grow big and then pruning it to limit its complexity.

Pruning

After letting the tree grow as big as possible, a cost-complexity criterion is applied to prune or remove unnecessary nodes and their descendants from the tree.

If $ R $ denotes the number of terminal nodes (leaves or regions) in the tree, and $ Q_r(R) $ denotes the impurity score of the $r$-th region of the tree, then, the cost complexity of the tree is defined as

$$ C(R) = \sum_{m=1}^{M} \frac{1}{\nlabeled_r} Q_r(R) + \alpha |R| $$

where, $ \alpha $ is a trade-off factor between tree complexity and its performance. Being a cost, we prefer trees with lower value of the cost. Larger values of $ \alpha $ lead to fewer nodes in the tree and vice versa. Typically, the search for nodes to prune is greedy — we collapse the internal node that produces the smallest per node increase in cost-complexity. A suitable value of $ \alpha $ is achieved by cross-validation.

Challenges

Despite being easy to understand, tree-based classifiers offer several challenges

Instability: The greedy search strategy for tree-building means that small changes in the training set leads to a different series of splits and nodes in the tree, limiting their stability across datasets. This also means that tree-based classifiers typically have higher variance than other approaches. Bagging approaches, such as random forests are designed to alleviate this challenge by creating an ensemble for prediction, because each tree in the forest has been trained on a subset of the training set.
Categorical features: While trees offer a straightforward way to work with categorical features, it also becomes challenging when dealing with categorical features that have too many distinct values. A feature with $ q $ distinct unordered values results in $ 2^{q-1} - 1 $ possible partitions of the feature space — quite prohibitive for large values of $ q $.
Diagonal separators: Tree-based classifiers partition the feature space into axis-aligned cuboid regions. This is a severe problem for tasks that involve a separate that is not axis-aligned. To arrive at a diagonal separator, the tree-based classifier will need to be very deep.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Please donate

Subscribe for article updates

Stay up to date with new material for free.

Feature type	Example criteria	Explanation
Categorical	\( x_c == \text{'value'} \)	\( \text{'value'} \) is one of the values taken by the categorical variable \( x_c \)
Binary	\( x_b == 1 \)	Binary variables take on the value \( 0 \) or \( 1 \)
Real-valued	\( x_r \le t \)	\( t \) is a threshold on the value of the real-valued variable \( x_r \)