2022.1.1.30569 Patch Release Update

The 2022.1.1.30569 Patch/Minor release has been removed from the Download Portal due to a missing signature in some of the included files. This causes the files to not be recognized as valid files provided by Alteryx and might trigger warning messages by some 3rd party programs. If you installed the 2022.1.1.30569 release, we recommend that you reinstall the patch.

Alteryx Designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.
SOLVED

Re: Tool Mastery | Decision Tree

joacoachinelli
6 - Meteoroid

Hi!

I was wondering if someone could help me with my modelling:

I´m not beaing able to understand why I get a "poor" value when I change the Data Input I use for the same Decision Tree Model:

 

When I use my Decision Tree to predict the same Data Inputs with which I built it, it has a relative "Accurate" score. (i.e.: Score: 524,037 over the 514,996 Actual Value)

joacoachinelli_3-1612822831834.png

 

 

 

But when I change my Input Data and use the same Decision Tree Model, in new data I want to Predict, I get an "Inaccurate" score (Score: 421,566 while the Actual value was in fact 270,820)

joacoachinelli_2-1612822753099.png

 

 

 

Why could it be? Is it something bvious about the modelling that I´m missing?  

Thanks!

 

1 REPLY 1
SydneyF
Alteryx Alumni (Retired)

Hi @joacoachinelli!

 

Thank you for posting to the Alteryx Community. What you are seeing is an artifact of how Decision Trees (and most machine learning algorithms) work, and is expected behavior. 

 

A decision tree creates rules for splitting data into groups. The algorithm "learns" these rules based on the training data. A decision tree model will perform well on the data it was trained with because it has effectively already "seen" this data, and created rules to sort this particular data set as correctly as possible.

 

Because of this, evaluating a model using the training data will always return overly optimistic results, and it is a best practice to evaluate your models using data that was not included in the training data. This subset of data is also known as holdout or validation data.

 

You can read more about this concept here:

 

https://community.alteryx.com/t5/Data-Science/Holdouts-and-Cross-Validation-Why-the-Data-Used-to-Eva...

 

It is possible for a model to focus too much on the specific details in your training data, which causes the model to perform poorly on data is has not seen before because it fails to make "generalized" rules - this is known as overfitting. Decision trees in particular are prone to overfitting

 

You can read more about his here:


https://community.alteryx.com/t5/Data-Science/Bias-Versus-Variance/ba-p/351862

Labels