# Random forest

## Introduction

Random forest, as the name implies, is a collection of trees-based models trained on random subsets of the training data. Being an ensemble model, the primary benefit of a random forest model is the reduced variance compared to training a single tree. Since each tree in the ensemble is trained on a random subset of the overall training set, the ensemble as a whole is less likely to overfit the training set. Random forest based classifiers are some of the most accurate models in many classification challenges.

## Prerequisites

To understand random forests, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

## Problem setting

Random forests can be used for both classification and regression tasks.

Consider an instance $\vx \in \real^\ndim$, a vector consisting of $\ndim$ features, $\vx = [x_1, x_2, \ldots, x_\ndim]$.

In classification, the goal of the predictive model is to identify the class that generated a particular instance. We need to assign it to one of the $M$ classes $C_1, C_2, \ldots, C_M$ depending on the values of the $N$ features .

In regression, the goal of the predictive model is to predict a continuous valued output for a given multivariate instance. We need to predict a real-valued output $\hat{y} \in \real$ that is as close as possible to the true target $y \in \real$. The hat $\hat{ }$ denotes that $\hat{y}$ is an estimate, to distinguish it from the truth.

For both supervised learning settings, the model is inferred over a collection of labeled observations provided as tuples $(\vx_i,y_i)$ containing the instance vector $\vx_i$ and the true target variable $y_i$. This collection of labeled observations is known as the training set $\labeledset = \set{(\vx_1,y_1), \ldots (\vx_\nlabeled,y_\nlabeled)}$.

## Predictive model

Random forest is an ensemble of trees. Therefore, the prediction of the random forest is based on the collective wisdom of the trees that make up the forest.

#### Classification

In the classification setting, the prediction of the random forest is the most dominant class among predictions by individual trees. If there are $T$ trees in the forest, then the number of votes received by a class $\nclasssmall$ is

$$v_{\nclasssmall} = \sum_{t=1}^T \indicator{\yhat_t == \nclasssmall}$$

where, $\yhat_t$ is the prediction of the $t$-th tree on a particular instance. The indicator function $\indicator{\yhat_t == \nclasssmall}$ takesn on the value $1$ if the condition is met, else it is zero.

Given these votes, the final prediction of the random forest is the class with the most votes

\begin{equation} \yhat = \argmax_{\nclasssmall \in \set{1,\ldots,\nclass}} v_{\nclasssmall} \label{eqn:class-pred} \end{equation}

#### Regression

In the regression setting, the prediction of the random forest is the average of the predictions made by the individual trees. If there are $T$ trees in the forest, each making a prediction $\yhat_t$, the final prediction $\yhat$ is

\begin{equation} \yhat = \frac{1}{T} \sum_{t=1}^{T} \yhat_t \label{eqn:reg-pred} \end{equation}

## Training a random forest

Training random forest models is based on the idea of bootstrap aggregating, also known as bagging. Bagging-based ensemble learning works as follows

1. randomly sample (with replacement) $\nlabeled$ training examples from the training set of size $\nlabeled$.
2. train a tree-based model on the sample collected in the first step.

The above steps are repeated for each of the $T$ trees that form the random forest. The number of trees in the forest, $T$, is a hyperparameter, typically in the hundreds or thousands, depending on the size of the training set. It can be tuned with cross-validation or out-of-bag error. The out-of-bag error is the error of a tree on the observations that were not part of the sampled training set used to train that particular tree.

The steps 1 and 2 above work for creating any ensemble of models based on bagging. In the case of random forests, further randomization is introduced in the form of feature bagging.

Remember that tree-based models work by identifying the top feature at each stage to form a node in the tree. With feature bagging, random forests randomize this step by selecting one-among-the-top-few feature instead of the best feature at each stage.