BUG?; Why do I get conflicting preformance measures in my Random Forest and Boosted Model?

This time in a sample workflow we tested different models where;

I loaded a sample dataset (german credit in the regression sample)
Calculate Yes/No, ratio of defaults
Split estimation and validation samples
Trained random forest model and a boosted model on estimation sample
Checked performance measures thru estimation sample running Lift Chart Tool and Model Comparison Tool for both models.
Then checked performance measures thru validation sample for proper measures again using both Lift Chart Tool and Model Comparison Tool again for both models...

Here is the workflow;

As you can see the random forest for the estimation sample shows a perfect model, it literally overtrained it, but 0,74 ROC, 0,510 GINI so the graph and the numbers are not relevant to each other... This is the first confusing situation.

When it comes to Boosted model, for estimation sample the AUC is; 0,69

When we check these measures with a Model Comparison Tool for again the estimation sample

the AUC is; 1,00 instead of 0,74 for random forest

the AUC is; 0,0972 instead of 0,694 for bossted model... A huge difference "What's going on in here?!!!"

Here are the result this time for validation sample the AUC is; 0,639

the AUC is; 0,65 for bossted model in validation sample

When we check these measures with a Model Comparison Tool this time for the validation sample

the AUC is; 0,6835 instead of 0,6391 for random forest

the AUC is; 0,1887 instead of 0,65 for bossted model... Again a huge difference apples to oranges, "What's going on in here?!!!"

When we look at the Gains and ROC charts Booste Model cureves are inverted!!!

Although in the tool we mentioned the expected values are always "Yes" Boosting learns for the "No" target value instead I suppose...

#randomforest #boostedmodel #liftchart #modelcomparison #AUC #Gini

Reporting

Output

Custom Tools

Bug

Predictive Analysis

Accepted answers

SophiaF

@Atabarezz - In regards to the difference with the AUC in the lift chart and the model comparison, they are calculated differently and therefore cannot be compared. For the difference/inverse in the predictions for the Random Forest and Boosted Model, this is indeed a bug that will be addressed in a future release.

Here are some more specifics:

This behavior is related to a known issue with the Boosted Model Tool when specifying the Bernoulli Loss Function in Model Customization. It is related to the gbm package, which is the package used in the Alteryx Boosted Model. Essentially, between R 3.2.3. and R 3.3.2, the data structure of the output of the gbm Boosted Model changed, and one of the outcomes is that when the Bernoulli loss function is specified, the results are scored against whatever outcome (yes vs. no, 0 vs 1) comes first after being sorted. This causes the Score Tool or Model Comparison Tool to flip the results for the Boosted Model with the Bernoulli Loss function option enabled, where there outcome is binomial.

Current workarounds:

1. Rebuild the model without specifying the Bernoulli loss function. If your model is predicting a binomial outcome, this is the best option for you; what the tool does by default results in the same loss function when there are only two outcomes.

2. Manually switch the field names using a Select Tool

I will post an update once a release is out that contains this fix. Thanks for reporting!

All comments

Atabarezz

Here is the workflow with data that you can replicate in your machine...

Random Forest and Boosted Lift vs Comparison.yxmd

Atabarezz

I correct that the first "0,74 ROC" metioned is not ROC but Area under the Gains Chart... not relevant to ROC...

Also "Boosted model, for estimation sample the AUC is; 0,69" it's again not the ROC but Area under the Gains Chart

Atabarezz

RF boosted.jpg

Should mention this as well... Through the same workflow same connection and same Target definition = "Yes"

Boosted model estimates the inverse of what RF estimates... for "Yes" in Random Forest --> Boosted says "No"

For "No" in Random Forest --> Boosted model says 'Yes'

SophiaF

Here are some more specifics:

Current workarounds:

2. Manually switch the field names using a Select Tool

I will post an update once a release is out that contains this fix. Thanks for reporting!

Quick Links

This months top contributors

atcodedog05 19598

Qiu 15867

binu_acs 15708

MarqueeCrew 13708

apathetichell 13703