Introduction to random forest

This learning module has many interactive demos. It is easier to work with them on a larger screen. Bookmark and revisit if you are currently on a small screen device.

\(\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\asterisk}{\ast} \newcommand{\sup}{\text{sup}} \newcommand{\inf}{\text{inf}} \newcommand{\min}{\text{min}\;} \newcommand{\max}{\text{max}\;} \newcommand{\maxunder}[1]{\underset{#1}{\max}} \newcommand{\minunder}[1]{\underset{#1}{\min}} \newcommand{\real}{\mathbb{R}} \newcommand{\natural}{\mathbb{N}} \newcommand{\integer}{\mathbb{Z}} \newcommand{\rational}{\mathbb{Q}} \newcommand{\irrational}{\mathbb{I}} \newcommand{\complex}{\mathbb{C}} \newcommand{\cardinality}[1]{|#1|} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\star}[1]{#1^*} \newcommand{\inv}[1]{#1^{-1}} \newcommand{\indicator}[1]{\mathcal{I}(#1)} \renewcommand{\BigO}[1]{\mathcal{O}(#1)} \renewcommand{\BigOsymbol}{\mathcal{O}} \renewcommand{\smallo}[1]{\mathcal{o}(#1)} \renewcommand{\smallosymbol}[1]{\mathcal{o}} \newcommand{\set}[1]{\mathbb{#1}} \newcommand{\complement}[1]{#1^c} \newcommand{\powerset}[1]{\mathcal{P}(#1)} \newcommand{\setdiff}{\setminus} \newcommand{\setsymmdiff}{\oplus} \newcommand{\dash}[1]{#1^{'}} \newcommand{\permutation}[2]{{}_{#1} \mathrm{ P }_{#2}} \newcommand{\combination}[2]{{}_{#1} \mathrm{ C }_{#2}} \newcommand{\prob}[1]{P(#1)} \newcommand{\pmf}[1]{P(#1)} \newcommand{\pdf}[1]{p(#1)} \newcommand{\cdf}[1]{F(#1)} \newcommand{\expect}[2]{E_{#1}\left[#2\right]} \newcommand{\entropy}[1]{\mathcal{H}\left[#1\right]} \newcommand{\expe}[1]{\mathrm{e}^{#1}} \newcommand{\textexp}[1]{\text{exp}\left(#1\right)} \def\independent{\perp\!\!\!\perp} \def\notindependent{\not\!\independent} \newcommand{\yhat}{\hat{y}} \newcommand{\vs}{\vec{s}} \newcommand{\vt}{\vec{t}} \newcommand{\vu}{\vec{u}} \newcommand{\vv}{\vec{v}} \newcommand{\vw}{\vec{w}} \newcommand{\vx}{\vec{x}} \newcommand{\vy}{\vec{y}} \newcommand{\vz}{\vec{z}} \newcommand{\va}{\vec{a}} \newcommand{\vb}{\vec{b}} \newcommand{\vc}{\vec{c}} \newcommand{\vd}{\vec{d}} \newcommand{\ve}{\vec{e}} \newcommand{\vg}{\vec{g}} \newcommand{\vh}{\vec{h}} \newcommand{\vi}{\vec{i}} \newcommand{\vk}{\vec{k}} \newcommand{\vo}{\vec{o}} \newcommand{\vp}{\vec{p}} \newcommand{\vq}{\vec{q}} \newcommand{\vr}{\vec{r}} \newcommand{\vs}{\vec{s}} \newcommand{\vmu}{\vec{\mu}} \newcommand{\vsigma}{\vec{\sigma}} \newcommand{\vphi}{\vec{\phi}} \newcommand{\vtau}{\vec{\tau}} \newcommand{\vtheta}{\vec{\theta}} \newcommand{\mA}{\mat{A}} \newcommand{\mB}{\mat{B}} \newcommand{\mC}{\mat{C}} \newcommand{\mD}{\mat{D}} \newcommand{\mE}{\mat{E}} \newcommand{\mH}{\mat{H}} \newcommand{\mK}{\mat{K}} \newcommand{\mP}{\mat{P}} \newcommand{\mQ}{\mat{Q}} \newcommand{\mR}{\mat{R}} \newcommand{\mS}{\mat{S}} \newcommand{\mU}{\mat{U}} \newcommand{\mV}{\mat{V}} \newcommand{\mW}{\mat{W}} \newcommand{\mX}{\mat{X}} \newcommand{\mY}{\mat{Y}} \newcommand{\mZ}{\mat{Z}} \newcommand{\mI}{\mat{I}} \newcommand{\mLambda}{\mat{\Lambda}} \newcommand{\mSigma}{\mat{\Sigma}} \newcommand{\mTheta}{\mat{\theta}} \newcommand{\setsymb}[1]{#1} \newcommand{\sA}{\setsymb{A}} \newcommand{\sB}{\setsymb{B}} \newcommand{\sC}{\setsymb{C}} \newcommand{\sO}{\setsymb{O}} \newcommand{\sP}{\setsymb{P}} \newcommand{\sQ}{\setsymb{Q}} \newcommand{\sH}{\setsymb{H}} \newcommand{\sX}{\setsymb{X}} \newcommand{\sY}{\setsymb{Y}} \newcommand{\norm}[2]{||{#1}||_{#2}} \newcommand{\infnorm}[1]{\norm{#1}{\infty}} \newcommand{\fillinblank}{\text{ }\underline{\text{ ? }}\text{ }} \newcommand{\lbrace}{\left\{} \newcommand{\rbrace}{\right\}} \newcommand{\set}[1]{\lbrace #1 \rbrace} \newcommand{\seq}[1]{\left( #1 \right)} \newcommand{\ndim}{N} \newcommand{\ndimsmall}{n} \newcommand{\dataset}{\mathbb{D}} \newcommand{\ndata}{D} \newcommand{\ndatasmall}{d} \newcommand{\labeledset}{\mathbb{L}} \newcommand{\nlabeled}{L} \newcommand{\nlabeledsmall}{l} \newcommand{\unlabeledset}{\mathbb{U}} \newcommand{\nunlabeled}{U} \newcommand{\nunlabeledsmall}{u} \newcommand{\nclass}{M} \newcommand{\nclasssmall}{m} \newcommand{\loss}{\mathcal{L}} \newcommand{\sign}{\text{sign}} \newcommand{\Gauss}{\mathcal{N}} \newcommand{\hadamard}{\circ} \newcommand{\doh}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\dox}[1]{\doh{#1}{x}} \newcommand{\doy}[1]{\doh{#1}{y}} \newcommand{\doxx}[1]{\doh{#1}{x^2}} \newcommand{\doyy}[1]{\doh{#1}{y^2}} \newcommand{\doxy}[1]{\frac{\partial #1}{\partial x \partial y}} \newcommand{\doyx}[1]{\frac{\partial #1}{\partial y \partial x}} \newcommand{\qed}{\tag*{$\blacksquare$}}\)

        Random forest
        Machine Learning
      

Introduction

Random forest, as the name implies, is a collection of trees-based models trained on random subsets of the training data. Being an ensemble model, the primary benefit of a random forest model is the reduced variance compared to training a single tree. Since each tree in the ensemble is trained on a random subset of the overall training set, the ensemble as a whole is less likely to overfit the training set. Random forest based classifiers are some of the most accurate models in many classification challenges.

Prerequisites

To understand random forests, we recommend familiarity with the concepts in

Probability: A sound understanding of conditional and marginal probabilities and Bayes Theorem is desirable.
Introduction to machine learning: An introduction to basic concepts in machine learning such as classification, training instances, features, and feature types.
Familiarity with tree-based models such as decision tree classifier and tree-based regression.

Follow the above links to first get acquainted with the corresponding concepts.

Problem setting

Random forests can be used for both classification and regression tasks.

Consider an instance $ \vx \in \real^\ndim $, a vector consisting of $ \ndim $ features, $\vx = [x_1, x_2, \ldots, x_\ndim] $.

In classification, the goal of the predictive model is to identify the class that generated a particular instance. We need to assign it to one of the $ M $ classes $ C_1, C_2, \ldots, C_M $ depending on the values of the $ N $ features .

In regression, the goal of the predictive model is to predict a continuous valued output for a given multivariate instance. We need to predict a real-valued output $ \hat{y} \in \real $ that is as close as possible to the true target $ y \in \real $. The hat $ \hat{ } $ denotes that $ \hat{y} $ is an estimate, to distinguish it from the truth.

For both supervised learning settings, the model is inferred over a collection of labeled observations provided as tuples $ (\vx_i,y_i) $ containing the instance vector $ \vx_i $ and the true target variable $ y_i $. This collection of labeled observations is known as the training set $ \labeledset = \set{(\vx_1,y_1), \ldots (\vx_\nlabeled,y_\nlabeled)} $.

Predictive model

Random forest is an ensemble of trees. Therefore, the prediction of the random forest is based on the collective wisdom of the trees that make up the forest.

Classification

In the classification setting, the prediction of the random forest is the most dominant class among predictions by individual trees. If there are $ T $ trees in the forest, then the number of votes received by a class $ \nclasssmall $ is

$$ v_{\nclasssmall} = \sum_{t=1}^T \indicator{\yhat_t == \nclasssmall} $$

where, $ \yhat_t $ is the prediction of the $ t$-th tree on a particular instance. The indicator function $ \indicator{\yhat_t == \nclasssmall} $ takesn on the value $ 1 $ if the condition is met, else it is zero.

Given these votes, the final prediction of the random forest is the class with the most votes

\begin{equation} \yhat = \argmax_{\nclasssmall \in \set{1,\ldots,\nclass}} v_{\nclasssmall} \label{eqn:class-pred} \end{equation}

Regression

In the regression setting, the prediction of the random forest is the average of the predictions made by the individual trees. If there are $ T $ trees in the forest, each making a prediction $ \yhat_t $, the final prediction $ \yhat $ is

\begin{equation} \yhat = \frac{1}{T} \sum_{t=1}^{T} \yhat_t \label{eqn:reg-pred} \end{equation}

Training a random forest

Training random forest models is based on the idea of bootstrap aggregating, also known as bagging. Bagging-based ensemble learning works as follows

randomly sample (with replacement) $ \nlabeled $ training examples from the training set of size $ \nlabeled $.
train a tree-based model on the sample collected in the first step.

The above steps are repeated for each of the $ T $ trees that form the random forest. The number of trees in the forest, $ T $, is a hyperparameter, typically in the hundreds or thousands, depending on the size of the training set. It can be tuned with cross-validation or out-of-bag error. The out-of-bag error is the error of a tree on the observations that were not part of the sampled training set used to train that particular tree.

The steps 1 and 2 above work for creating any ensemble of models based on bagging. In the case of random forests, further randomization is introduced in the form of feature bagging.

Remember that tree-based models work by identifying the top feature at each stage to form a node in the tree. With feature bagging, random forests randomize this step by selecting one-among-the-top-few feature instead of the best feature at each stage.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Please donate

Subscribe for article updates

Stay up to date with new material for free.