Hi,
Recently I am using a dataset (dataset 1)and split it into training and validation dataset to train the boosted model.
After I get the trained boosted model. I use Score tool to score a new dataset (dataset2) with the trained boosted model.
In the dataset2, parts of variables do not have value. (Blank/NA), but the scoring tool still gets the score for every row.
So I would like to know how the boosted model and score tool to deal with the missing value and get the score.
As you know, if we use logistic regression, if one of the variables is empty, then the scoring tool can not get the score for this row.
Thank you.
Solved! Go to Solution.
I'm not an expert, but I think this is a feature of those models in R moreso than the Alteryximplementation thereof. I've attached a workflow that I used to play around with it a bit; it uses the Kaggle Titanic data (since it's small and fits the bill in terms of generating NULL predictions). In it's current state, everything is cleaned up so that missing values are either imputed or excluded as features of the model.
In particular, I saved a copy of the Score tool (which is just a macro - you can right-click it to look at it and see the R code), and commented out several lines of R where they explicitly generate log messages if/when NA values are removed. When scoring with either macro, it still always came out exactly the same, which, again, leads me to think it's more to do with R than Alteryx. I also Googled it just a bit in hopes of finding a definitive statement on the matter, but nothing jumped out immediately from that brief effort.
Anyway, hope that helps at least a little. Aside: it also helps to enable logging and look at them closely.
After looking at the Boosted Model macro, it seems JohnJPS is correct; the NA handling is done in R.
The package used in the boosted model tool's R package is 'gbm', and thus, this seems highly relevant:
http://stackoverflow.com/questions/14718648/r-gbm-handling-of-missing-values
Essentially, GBM brings these values into a separate node for each level of the tree. The scores are the same as the scores before that tree split.
Hi @Inactive User,
I was actually about to respond with essentially the same answer as @DylanB, but he beat me to the punch! It's also good to know that the Boosted Model tool isn't the only Predictive tool that can handle missing data. The Decision Tree and Naive Bayes classifier also have built-in R procedures to handle missing data.
Best,
Bridget
Thank you, John. Your workflow is good.
Thank you, everyone. I learn some new today.