Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Boosted Model and Score Tool - How to deal with NULL/NA

Inactive User
Not applicable

Hi,

 

Recently I am using a dataset (dataset 1)and split it into training  and validation dataset  to train the boosted model.

After I get the trained boosted model.  I use Score tool to score a new dataset  (dataset2) with the trained boosted model.

 

In the dataset2, parts of variables do not have value. (Blank/NA), but the scoring tool still gets the score for every row.

So I would like to know how the boosted model and score tool to deal with the missing value and get the score.

 

As you know, if we use logistic regression, if one of the variables is empty, then the scoring tool can not get the score for this row.

 

Thank you.

5 REPLIES 5
JohnJPS
15 - Aurora

I'm not an expert, but I think this is a feature of those models in R moreso than the Alteryximplementation thereof.  I've attached a workflow that I used to play around with it a bit; it uses the Kaggle Titanic data (since it's small and fits the bill in terms of generating NULL predictions).  In it's current state, everything is cleaned up so that missing values are either imputed or excluded as features of the model.

 

In particular, I saved a copy of the Score tool (which is just a macro - you can right-click it to look at it and see the R code), and commented out several lines of R where they explicitly generate log messages if/when NA values are removed.  When scoring with either macro, it still always came out exactly the same, which, again, leads me to think it's more to do with R than Alteryx.  I also Googled it just a bit in hopes of finding a definitive statement on the matter, but nothing jumped out immediately from that brief effort.

 

Anyway, hope that helps at least a little.  Aside: it also helps to enable logging and look at them closely.

 

DylanB
Alteryx Alumni (Retired)

After looking at the Boosted Model macro, it seems JohnJPS is correct; the NA handling is done in R.

 

The package used in the boosted model tool's R package is 'gbm', and thus, this seems highly relevant:

http://stackoverflow.com/questions/14718648/r-gbm-handling-of-missing-values

 

Essentially, GBM brings these values into a separate node for each level of the tree. The scores are the same as the scores before that tree split.

BridgetT
Alteryx Alumni (Retired)

Hi @Inactive User,

 

I was actually about to respond with essentially the same answer as @DylanB, but he beat me to the punch! It's also good to know that the Boosted Model tool isn't the only Predictive tool that can handle missing data. The Decision Tree and Naive Bayes classifier also have built-in R procedures to handle missing data.

 

Best,
Bridget

Bridget Toomey

Research Scientist, Analytic Products

Alteryx
Inactive User
Not applicable

Thank you, John. Your workflow is good.

Inactive User
Not applicable

Thank you, everyone.  I learn some new today.

Labels