Test accuracy of predictive model on new data

Question

Hi,

I am experimenting with the Decision Tree model and have some difficulties understanding how to test the accuracy of a model on new data. The interactive report produced by the Decision Tree tool is really informative, but I as far as I understand the performance scores (accuracy, F1, precision, recall) are not evaluated on a test set. Does that mean that once the model is built, they are calculated using the entire data it was trained on? The report showed accuracy of 72% over 5 classes.

Here is how I tested the accuracy with a Score tool: I split the data in train and test sets and trained the Decision Tree model on the train set only. Then I used the Score tool with the saved model as yxdb and the test data as inputs. If the score column with the highest probability was the one corresponding to the correct label, I marked the record as correctly predicted. This way I only got 20% accuracy for 5 classes, which is way lower. It makes sense to me that the interactive report is generated on new data, so that it reflects the prediction capability of the model, so I am confused why the accuracy scores differ so much.

My questions are following:

How can I get the interactive report while running a saved model on new data?

How is the accuracy calculated in the Decision Tree tool?

Is decision tree using C5.0 supported by the Score tool? I got an error message is not one of the allowed types.

I will really appreciate your comments on whether I am approaching this correctly and your help with the questions.

Thanks,
Sophie

SydneyF · Accepted Answer

Hi @Sophie_,

How is the accuracy calculated in the Decision Tree tool?

Your understanding is correct.The metrics presented in both the Report (R) and Interactive (I) outputs are not based on an independent validation data set. The metrics are derived from how the model performs sorting the data it was trained with. These values are lower than 100% for a few reasons, one of which is a decision tree will stop splitting the data at a given point during training, or prune nodes to mitigate overfitting. The 72% over 5 classes is how your training data performs on the model it trained. You can check this by using the Score Tool to run your training data through your decision tree model. The accuracy should match.

In machine learning, Decision Trees are considered to be "weak learners". The term weak learner is used to describe a predictive model or algorithm that performs poorly (it correctly estimates the target variable only slightly more often than random chance). Based on your description, it sounds like you conducted a reasonable validation test. Did you split your data 50/50? 70/30? Did the interactive report of the Decision Tree Tool still indicate close to 72% accuracy? I suspect it may have decreased, depending on the size of the reduced training data set. That being said, with a decision tree it is expected that the accuracy of estimates on you validation data will be worse than the model accuracy.

How can I get the interactive report while running a saved model on new data?

The interactive report is generated by R code in the Decision Tree Tool.  As stated, it is generated with the model's own metrics, and not validation data. There is not currently a way to get an interactive report based on validation data from an Alteryx Tool. It is a really interesting feature enhancement that could be suggested for the Score Tool on our Product Ideas forum. Please let me know if you do, I will be sure to star it. :)

If you have any familiarly with R, another option would be to use some of the visualization packages in R to create your own interactive plots. The R packages ggplot2 and plot.ly are two visualization packages that are installed with the Alteryx Predictive Tools. For reference on using these packages, I would recommend looking at resources provided by the maintainer, or the R community.

This Data Science blog post might also be of interest to you: Custom Interactive Visualizations

Is decision tree using C5.0 supported by the Score tool?

The C5.0 decision tree is supported by the Score Tool. I am able to use the Score Tool on a C5.0 model generated with the Iris Dataset without issue. If you would like to post a copy of your workflow with sample data that reproduces the error, I would be happy to troubleshoot your workflow.

Hopefully this helps clear things up! Please let me know if you have any further questions I might be able to assist you with.

In case you are interested, there are a few Community Articles as well as a Data Science Blog Post that cover Decision Trees, and the Decision Tree Tool.

An Alteryx Newbie Works Through the Predictive Suite: Decision Tree

An Introduction to Decision Trees

Tool Mastery | Decision Tree

Understanding the Output of the Decision Tree Tools