Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Knowledge Base

Definitive answers from Designer Desktop experts.

Understanding the Outputs of the Decision Tree Tool

SydneyF
Alteryx Alumni (Retired)
Created

This article reviews the outputs of the Decision Tree Tool. For a general description on how Decision Trees work, read Planting Seeds: An Introduction to Decision Trees, for a run-down on the configuration of the Decision Tree Tool, check out the Tool Mastery Article, and for a really awesome and accessible overview of the Decision Tree Tool, read the Data Science Blog Post: An Alteryx Newbie Takes on the Predictive Suite: Decision Tree.

Like the configuration, the outputs of the Decision Tree Tool change based on (1) your target variable, which determines whether a Classification Tree or Regression Tree is built, and (2) which algorithm you selected to build the model with (rpart or C5.0).

Your target variable determines whether the tool constructs a Classification Tree or a Regression Tree. Classification and regression trees are very similar, but do differ on a few points: most notably how splits (variable thresholdson which the data are divided) are determined, but also on how the resulting model and predictions are assessed. This is because categorical and continuous predictions cannot be assessed using the same metrics. Classification Trees are typically evaluated with confusion matrices and F1-Scores, whereas Regression Trees are assessed with values like R2 and Mean Square Error (MSE).

Interpreting the Outputs

O (Output): A serialized model object. It is the actual Decision Tree Model that you have created with the Decision Tree Tool. It can be used as an input for other Predictive Tools, like the Score Tool, which will run your model to estimate the target variable, or the Model Comparison Tool (available in the Predictive District of the Alteryx Gallery) which compares the performance of different models on a validation data set.

R (Report): This is a static report that summarizes your Decision Tree Model. It will look different depending on which algorithm you selected to create your Decision Tree with in the tool’s configuration. The default (if you didn’t go into model customization) is rpart.

For the rpart Algorithm

DecisionTreeReport.png

The Call (2) is a print out of the core R code used to generate your model. This allows you to double-check the configuration of your Decision Tree Tool.

The Model Summary (3) lists the variables that were actually used to construct the model. We can see that for this tree, only half of the variables provided were used. Root node error is the percent of correctly sorted records at the first (root) splitting node. This value can be used to calculate two measures of predictive performance in combination with Rel Error and X Error, both of which are included in the Pruning Table. Root Node Error x Rel Error is the resubstitution error rate (the error rate computed on the training sample). Root Node Error x X Error is the cross-validated error rate, which is a more objective measure of predictive accuracy. n is the number of records used to construct the tree.

The PruningTable (4 & 5) depicts information about pruning from the rpart algorithm. Rel error (relative error) is 1 – R2 root mean square error. This is the error for predictions of the data that were used to estimate the model. The x-error is the cross-validation error (generated by the rpart built-in cross validation). Each level in the Pruning table is the depth of the tree where each of the corresponding values were calculated. This can help you make a decision on where to prune the tree.

Each row in this table represents a different height/depth of the tree. More levels in a tree has lower classification error on training, but with an increased risk of overfitting. Cross-validation error typically increases as the tree “grows’ after the optimal level. The rule of thumb is to select the lowest level where rel_error _ xstd < xerror.

Finally, the Leaf Summary lists variables and split thresholds at each node, as well as how the target variable records were split by percentages.

If you chose to include Tree Plot or Pruning Plot in your Tool Configuration (or both), Under the Plot Tab in Model Customization, you will also see an illustration of your decision tree (the Tree Plot) and/or a Pruning Plot.

The Tree Plot is an illustration of the nodes, branches and leaves of the decision tree created for your data by the tool. In the plot, the nodes include the thresholds and variables used to sort the data. For classification trees, the leaves (terminal nodes) include the fraction of records correctly sorted by the decision tree.

TreePlot.png

If you constructed a Regression Tree (your target variables are continuous), your Tree Plot will look slightly different.

rpart_RegressionTree.png

For the Tree Plot of a regression tree, the terminal nodes depict the predicted response at that node. These values are calculated as the average response (target variable value) of all the records from your training data that were sorted in to each terminal node.

The Pruning Plot depicts the cross-validated error summary. The Complexity Parameter (cp) values are plotted against the cross-validation error calculated by the rpart algorithm.

2018-02-15_11-13-30.png

The blue dashed line represents the highest cross-validated error minus the minimum cross-validated error, plus the standard deviation of the error at that tree. A reasonable choice of cp for pruning is often the leftmost value where the mean is less than the horizontal line. In this case, we see that the optimal size of the tree is 3 terminal nodes.

For the C5.0 Algorithm

The C5.0 Report (R) output is spread out over multiple pages. At the top of the report, there are buttons you can use to navigate between the pages.

The first page of the report, like the rpart report, includes the R code used to create the model under Call:

c50report1.png

It also specifies the version of C5.0 used, as well as the date and time the model was generated.

The next page in the report is a text write out of the decision tree. In the first line, it describes the data provided to the tool to generate the model. Cases is equivalent to records. This model was built from 150 records with 5 variables.

c50report2.png

The next page is a continuation of the written-out tree.

c50report3.png

The Size and Errors data is broken into two pages. For this tree, Size is 5, and Errors are 3 (2%). This means that of the training data, three records were incorrectly sorted.

c50report4.png

This page also includes a confusion matrix that details how the training records were classified.

The next page describes attribute usage, which is how the predictor variables were used to sort the data.

Time is how long it took the C5.0 algoritihm to build the decision tree.

c50report5.png

The very last page, if you selected to have a Tree Plot in your report (R) output, is the plotted figure of the tree.

c50report6.png

If you chose the decomposed tree into rule-based model under model customization for the C5.0 algorithm, the report will not include a tree plot. Instead of a tree, the report will include a list of rules used to sort the data. Those rules look like this:

c50rulesreport.png

I (Interactive): This is an interactive dashboard. It will include different information depending on if you build a classification or regression tree. For the classification tree, the interactive report includes a Summary Tab and Misclassification Tab, as well and Tree Tab if you used the rpart algorithm. For a regression tree, the interactive dashboard consists of a Summary Tab, A Model Performance Tab and a Variable Importance Tab.

The interactive output looks the same for trees built in rpartor C5.0, except that C5.0 will not include an interactive tree plot, which is included for rpart classification trees.

The classification tree Summary Tab includes model Accuracy, measured as the percent of correctly sorted data, the F1_Score,the model Precisionand model Recall. Precision and Recall are combined to calculate the F1_Score.

2018-02-15_11-16-43.png

The Misclassifications tab displays a confusion matrix(sometimes called a table of confusion), which displays a breakdown of the number of false positives,false negatives,true positives, andtrue negatives.

2018-02-15_11-16-58.png

The Tree tab depicts a plot of the tree that you can interactively zoom in and out of. This will only be available if you used the rpart algorithim.

2018-02-15_11-17-18.png

Clicking on individual branches will allow you to interactively examine the performance of each branch on the training data.

InteractiveBranch.png

For Regression Trees; there is a Summary, a Model Performance, and a Variable Importance Tab.

The Summary Tab includes a series of model performance and error measures; R-Squared(sometimes called the Coefficient of determination), Adjusted R-Squared, Mean Absolute Error, Mean Absolute Percentage Error, Mean Squared Error(MSE), and Root Mean Square Error.

RegressiontreeIoutput.png

The Model Performance tab includes similar metrics to the Summary Tab;Mean Absolute Error, Mean Absolute Percent Error (MAPE), R2 Score (coefficient of determination), Relative Absolute Error, and Root Mean Square Error.

The Model Performance Tab also includes a histogram of the residuals, with some summary statistics of the residuals. The break down of the residuals (of a comparison between predicted and actual values) can be used to check error variance (where in your data values the model is performing poorly. We can see in this example histogram, that the residuals are normally distributed. If the histogram indicates that random error is not normally distributed, it suggests that the model's underlying assumptions may have been violated.

regressionmodelperformance.png

The variable importance tab displays variable importance for each predictor variable in your decision tree. Variable importance is measured as the sum of the goodness of split measurements for each split for which it was the primarily variable plot goodness (adjusted agreement for all splits in which it was a surrogate). Scaled to sum to 100 and the rounded variables are shown. Any variable with a proportion less than 1% are omitted.

It is calculated for each variable individually and the value is calculated as the sum of the decrease in impurity, it counts both when the variables appear as a primary split and when it appears as a surrogate. Then it is transformed into percentage scoring, the highest values as 100 and consecutively proportional until the lower values. You can read better description of what variable importance means in the rpart R package vignette.

regressionvariableimportance.png

That is the end of the Decision Tree Tool Outputs. I hope you've enjoyed this rundown, and have a better idea on how to interpret your Alteryx Decision Tree.

Attachments
Comments
Prashant_Iyer
5 - Atom

Hey! Thanks for the explanation. I just wanted whether the accuracy as shown in the interactive dashboard (for classification model) is for the test data set or the training data set? 

SydneyF
Alteryx Alumni (Retired)

Hi @Prashant_Iyer,

 

All metrics calculated in the Report (R) and Interactive (I) outputs of the Decision Tree Tool are based on the training data. If you would like to assess your model(s) with test data, you may be interested in the Model Comparison Tool, which is available for download on the Alteryx Analytics Gallery.

 

Thanks!

 

Sydney

tanthiamhuat
5 - Atom

Do you mind upload your model? I did a very simple of Decision Tree with Iris dataset, but it is taking donkey years in running. I suspect some input parameters are not entered correctly.

 

DT.png

SydneyF
Alteryx Alumni (Retired)

Hi @tanthiamhuat,

 

I went ahead and attached the packaged workflow, hope it helps!

DawnDuong
13 - Pulsar
13 - Pulsar

Great article, thank you.

SaiKrishna2589
8 - Asteroid

Hi,

can someone please provide more clarity on the below.

 

'The rule of thumb is to select the lowest level where rel_error _ xstd < xerror'?