cancel
Showing results for
Did you mean:

# Seeing the Forest for the Trees: An Introduction to Random Forest

Sr. Data Science Content Engineer
Created on

The Forest Tool in Alteryx implements a random forest. Random forests are pretty neat. They leverage ensemble learning to use what are typically considered to be weak learners (Decision Trees) to create a stronger and more robust modeling method.

Random forest models are composed of decision trees, so it is important to make sure you understand the trees before taking on the forest. If you need to brush up on decision trees, please take a moment to check out Planting Seeds – An Introduction to Decision Trees.

As you may know, two major limitations of decision trees are that they are prone to overfitting, and that they tend to be non-robust, meaning a small change in the training data results in a very different tree. Random Forest models overcome these two shortcomings of decision trees by generating many decision trees, and then aggregating the predictions of each individual tree to a single model prediction.

Creating and then combining the results of a bunch of decisions trees seems pretty basic, however, simply creating multiple trees out of the exact same training data wouldn’t be productive – it would result in a series of strongly correlated trees. All of these trees would sort data in the same way, so there would be no advantage to this method over a single decision tree. This is where the fancy part of random forests starts to come into play. To decorrelate the trees that make up a random forest, a process called bootstrap aggregating (also known as bagging) is conducted. Bagging generates new training data sets from an original data set by sampling the original training data with replacement (bootstrapping). This is repeated for as many decision trees that that will make up the random forest. Each individual bootstrapped data set is then used to construct a tree. This process effectively decreases the variance (error introduced by random noise in the training data, i.e., overfitting) of the model without increasing the bias (underfitting). On its own, bagging the training data to generate multiple trees creates what is known as a bagged trees model.

A similar process called the random subspace method (also called attribute bagging or feature bagging) is also implemented to create a random forest model. For each tree, a subset of the possible predictor variables is sampled, resulting in a smaller set of predictor variables to select from for each tree. This further decorrelates the trees by preventing dominant predictor variables from being the first or only variables selected to create splits in each of the individual decision trees. Without implementing the random subspace method, there is a risk that one or two dominant predictor variables would consistently be selected as the first splitting variable for each decision tree, and the resulting trees would be highly correlated. The combination of bagging and the random subspace method result in a random forest model.

The aggregating part of bootstrap aggregating comes from combining the predictions of each of these decision trees to determine an overall model prediction. The output of the overall model is then the mode of classes (classification) or the mean prediction (regression) of all the predictions of the individual trees, for each individual record.

In this case, the record plugged into the (simplified) random forest model was classified as Versicolor in the majority of the trees (2/3), so the Random Forest will classify the record as Versicolor.

There are a couple more components of random forest that are important to highlight.

Out of Bag Error

Bagging effectively causes about 1/3 of the original training data to be excluded from each individual decision tree. This excluded data is referred to as the out-of-bag (OOB) observations.  This effect has been leveraged to create Out of Bag (OOB) error estimates, which can be used in place of Cross-validation metrics.

Out of bag error is calculated by running records through each decision tree that they were not a part of training data for, and then aggregating those results to a single prediction. An OOB prediction can be determined for all training records, meaning that an overall OOB MSE can be calculated, and used as a model error rate.

Variable Importance

Another neat thing about random forests is that during their implementation, predictor variable importance is calculated by leveraging Gini Importance, which is used to determine the nodes of the individual decision trees, to generate Mean Decrease in Impurity (MDI).

MDI is the average (mean) of a variable’s total decrease in node impurity, weighted by the proportion of samples reaching that node in each individual decision tree in the random forest. Each predictor variable used to create the random forest model has a resulting MDI value, which is used to rank variable importance to the model. Higher Mean Decrease in Gini indicates higher variable importance.

Limitations of random forest

What random forests gain over decision trees in model robustness is lost in interpretability and approachability.  In a decision tree you can see the individual variable splits and thresholds used to sort the target variable. This is not possible in the Random Forest, where hundreds of trees are aggregated to create an estimation. Because they are less approachable and easily interpreted, random forest models are often seen as "black boxes."

Strengths

Random forests share many of the same strengths as the decision trees they are made up of. They can be applied to categorical or continuous target variables, and they can handle unbalanced datasets, outliers, and non-linear relationships. In addition to these strengths, random forests tend to have much better predictive power than a single decision tree and are less prone to overfitting. Random forest models tend to perform very well in estimating categorical data.

In summary…

Random forest is an ensemble machine learning method that leverages the individual predictive power of decision trees by creating multiple decision trees and then combining the trees into a single model by aggregating the individual tree predictions. Random forests are more robust and tend to have better predictive power than a decision tree. However, they are also more opaque and can seem more intimidating. Hopefully, this article has allowed you to see the forest for the (decision) trees and has shown you how neat random forests are.