Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

BUG?; Why do I get conflicting preformance measures in my Random Forest and Boosted Model?

Atabarezz
13 - Pulsar

 

This time in a sample workflow we tested different models where;

  1. I loaded a sample dataset (german credit in the regression sample)
  2. Calculate Yes/No, ratio of defaults
  3. Split estimation and validation samples
  4. Trained random forest model and a boosted model on estimation sample
  5. Checked performance measures thru estimation sample running Lift Chart Tool and Model Comparison Tool for both models.
  6. Then checked performance measures thru validation sample for proper measures again using both Lift Chart Tool and Model Comparison Tool again for both models...

Here is the workflow;Picture6.png

 

 

As you can see the random forest for the estimation sample shows a perfect model, it literally overtrained it, but 0,74 ROC, 0,510 GINI so the graph and the numbers are not relevant to each other... This is the first confusing situation.
Picture7.png

 

When it comes to Boosted model, for estimation sample the AUC is; 0,69

 

Picture8.png

 

 

 

When we check these measures with a Model Comparison Tool for again the estimation sample

the AUC is; 1,00 instead of  0,74 for random forest

the AUC is; 0,0972 instead of  0,694 for bossted model... A huge difference "What's going on in here?!!!"

 

Picture9.png

 

Here are the result this time for validation sample the AUC is; 0,639

 

Picture10.png

 

the AUC is; 0,65 for bossted model in validation sample

 

Picture11.png

 

When we check these measures with a Model Comparison Tool this time for the validation sample

the AUC is; 0,6835 instead of  0,6391 for random forest

the AUC is; 0,1887 instead of  0,65 for bossted model... Again a huge difference apples to oranges, "What's going on in here?!!!"

Picture12.png

 

 

When we look at the Gains and ROC charts Booste Model cureves are inverted!!!

Picture13.png

 

Although in the tool we mentioned the expected values are always "Yes" Boosting learns for the "No" target value instead I suppose...

 

 

#randomforest #boostedmodel #liftchart #modelcomparison #AUC #Gini

5 REPLIES 5
Atabarezz
13 - Pulsar

Here is the workflow with data that you can replicate in your machine...

 

Atabarezz
13 - Pulsar

I correct that the first "0,74 ROC" metioned is not ROC but Area under the Gains Chart... not relevant to ROC...

Also "Boosted model, for estimation sample the AUC is; 0,69" it's again not the ROC but Area under the Gains Chart

Atabarezz
13 - Pulsar

RF boosted.jpg

 

Should mention this as well... Through the same workflow same connection and same Target definition = "Yes"

Boosted model estimates the inverse of what RF estimates... for "Yes" in Random Forest --> Boosted says "No"

For "No" in Random Forest --> Boosted model says 'Yes'

 

SophiaF
Alteryx
Alteryx

@Atabarezz - In regards to the difference with the AUC in the lift chart and the model comparison, they are calculated differently and therefore cannot be compared. For the difference/inverse in the predictions for the Random Forest and Boosted Model, this is indeed a bug that will be addressed in a future release. 

 

Here are some more specifics:

 

This behavior is related to a known issue with the Boosted Model Tool when specifying the Bernoulli Loss Function in Model Customization. It is related to the gbm package, which is the package used in the Alteryx Boosted Model. Essentially, between R 3.2.3. and R 3.3.2, the data structure of the output of the gbm Boosted Model changed, and one of the outcomes is that when the Bernoulli loss function is specified, the results are scored against whatever outcome (yes vs. no, 0 vs 1) comes first after being sorted. This causes the Score Tool or Model Comparison Tool to flip the results for the Boosted Model with the Bernoulli Loss function option enabled, where there outcome is binomial.

 

Current workarounds:

 

1. Rebuild the model without specifying the Bernoulli loss function. If your model is predicting a binomial outcome, this is the best option for you;  what the tool does by default results in the same loss function when there are only two outcomes.

 

2. Manually switch the field names using a Select Tool

 

 

I will post an update once a release is out that contains this fix. Thanks for reporting!

Sophia Fraticelli
Senior Solutions Architect
Alteryx, Inc.
Atabarezz
13 - Pulsar

Looking forward for the fix in 2018.3 release... Best

Labels