This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
It's the most wonderful time of the year - Santalytics 2020 is here! This year, Santa's workshop needs the help of the Alteryx Community to help get back on track, so head over to the Group Hub for all the info to get started!
Decision Trees are algorithms that sort data based on predictor variables. As a predictive model, decision trees use observations of predictive variables to make conclusions about a target variable. Decision trees have a structure similar to that of a flow-chart, where each internal node is a test attribute, and each branch is an outcome of a test. A major benefit of decision trees is that they are relatively straight forward and easy to interpret.
Creating a decision tree involves selecting input variables (one target, and one or more predictors), and creating split points on the predictor variable(s) until an effective tree is constructed. Decision trees are made up of nodes, roots, leaves (sometimes called terminal nodes), and branches.
All points in the decision tree are nodes. The topmost node is the “root” and the terminal nodes are the “leaves”. Each leaf is a classification label for the output variable (y) which is used to make the prediction. The internal (non-terminal) nodes represent decision points that the data are split at. The branches are the outcomes of each test (node).
Decision trees can be applied to categorical or numeric data. When the target variable (what you are trying to predict) is categorical, a classification tree is constructed. When the target variable is continuous, a regression tree is constructed. Classification and regression trees are very similar, but do differ on a few points: most notably how splits (variable thresholds on which the data are divided) are determined.
For both classification and regression trees, variables and split points are chosen using an algorithm that only considers a single split point at a time (locally optimal), as opposed to the context of each split in the whole model (this is known as a greedy algorithm). The intent of the greedy algorithm is to find a globally optimal model by making the best possible choice at each individual split.
Splits in a classification tree are determined either by minimizing a measure of misclassification (known as Gini Impurity) at each split or by creating splits that result in the "purest" daughter nodes (this method is often referred to as Information Gain).
Gini Impurity measures how often a randomly chosen record from the data set used to train the model will be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. Gini Impurity reaches zero when all records in a node fall into a single category.
Information Gain is an entropy-based information index. In this context, entropy can be thought of as a measure of unpredictability – so a group where each record has the same value would have zero entropy. Information gain attempts to determine which predictor variables and split thresholds are most useful for determining the target variable by minimizing the entropy of the resulting groups. Information Gain can be thought of as the decrease in entropy after a dataset is split, and is calculated by subtracting the average entropy of the resulting daughter nodes after a split from the entropy of the parent node.
Gini Impurity and Information Gain are very similar in the context of constructing a classification tree. If you are interested in learning more, a paper on the theoretical comparison of Gini Impurity and Information Gain criteria can be found here. Typically, it does not make a significant difference whether you use Gini Impurity or Information Gain to determine splits.
Regression trees splits are most often constructed by minimizing the sum of the squared errors at each split, also known as Least Squares Criterion. Sum of squared error is calculated by taking the difference between the value predicted by the split and the actual known value of the training data for each record (this is known as the error or residual), squaring that value, and then summing the squared errors for all training records that pass through the node.
For both classification and regression trees, the tree stops “growing” (i.e. adding split points/nodes to sort the data) based on a stopping criterion, for example, a minimum number of training instances assigned to each leaf and node of the tree.
To improve predictive power of the model, after construction decision trees are often "pruned". In the context of decision trees, pruning refers to the technique used to reduce the size of decision trees by removing sections of the tree that do not contribute significant predictive power to the overall model. The goal of pruning is to remove branches from the decision tree without reducing predictive accuracy, measured against a cross-validation data set. Pruning reduces the overall complexity of the final decision tree model and combats model overfitting, thereby improving overall model accuracy.
What are a few limitations of Decision Tree Models?
Decision trees tend to be less accurate than many other predictive modeling approaches. Decision trees also tend to not be very robust, meaning a small change in training data can equate to a large change in the tree. Due to the nature of greedy algorithms, a globally-optimal (best overall possible model) decision tree cannot be guaranteed. Decision trees can also be prone to overfitting, meaning the model performs well on the data it was trained with, but poorly on other data sets.
Many limitations of decision trees have been addressed by ensemble learning, with methods like random forest or boosting. However, these models tend to be less approachable and are often more opaque.
What about strengths?
Decision trees are easy to illustrate, understand, and interpret. They can be applied to both categorical and continuous data, they are resistant to outliers in a data set, as well as to irrelevant predictor variables. Additionally, decision trees allow you to specifically identify variable thresholds used to sort your data by the target variable.
The Decision Tree Tool is one of the most popular predictive tools in Alteryx and the Data Science community as a whole. This is probably because it is a highly approachable and easily understood modeling method. However, there are limitations to this method, and, as with all methods, it is important to understand how the underlying algorithm works, and how modeling decisions might impact your outcome.