This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
This time in a sample workflow we tested different models where;
I loaded a sample dataset (german credit in the regression sample)
Calculate Yes/No, ratio of defaults
Split estimation and validation samples
Trained random forest model and a boosted model on estimation sample
Checked performance measures thru estimation sample running Lift Chart Tool and Model Comparison Tool for both models.
Then checked performance measures thru validation sample for proper measures again using both Lift Chart Tool and Model Comparison Tool again for both models...
Here is the workflow;
As you can see the random forest for the estimation sample shows a perfect model, it literally overtrained it, but 0,74 ROC, 0,510 GINI so the graph and the numbers are not relevant to each other... This is the first confusing situation.
When it comes to Boosted model, for estimation sample the AUC is; 0,69
When we check these measures with a Model Comparison Tool for again the estimation sample
the AUC is; 1,00 instead of 0,74 for random forest
the AUC is; 0,0972 instead of 0,694 for bossted model... A huge difference "What's going on in here?!!!"
Here are the result this time for validation sample the AUC is; 0,639
the AUC is; 0,65 for bossted model in validation sample
When we check these measures with a Model Comparison Tool this time for the validation sample
the AUC is; 0,6835 instead of 0,6391 for random forest
the AUC is; 0,1887 instead of 0,65 for bossted model... Again a huge difference apples to oranges, "What's going on in here?!!!"
When we look at the Gains and ROC charts Booste Model cureves are inverted!!!
Although in the tool we mentioned the expected values are always "Yes" Boosting learns for the "No" target value instead I suppose...
@Atabarezz - In regards to the difference with the AUC in the lift chart and the model comparison, they are calculated differently and therefore cannot be compared. For the difference/inverse in the predictions for the Random Forest and Boosted Model, this is indeed a bug that will be addressed in a future release.
Here are some more specifics:
This behavior is related to a known issue with the Boosted Model Tool when specifying the Bernoulli Loss Function in Model Customization. It is related to the gbm package, which is the package used in the Alteryx Boosted Model. Essentially, between R 3.2.3. and R 3.3.2, the data structure of the output of the gbm Boosted Model changed, and one of the outcomes is that when the Bernoulli loss function is specified, the results are scored against whatever outcome (yes vs. no, 0 vs 1) comes first after being sorted. This causes theScore Tool or Model Comparison Tool to flip the resultsfor the Boosted Model with the Bernoulli Loss function option enabled, where there outcome is binomial.
1. Rebuild the model without specifying the Bernoulli loss function. If your model is predicting a binomial outcome, this is the best option for you; what the tool does by default results in the same loss function when there are only two outcomes.
2. Manually switch the field names using a Select Tool
I will post an update once a release is out that contains this fix. Thanks for reporting!