Hello!
Please be patient with me as I am somewhat new here- but thank you in advance for any help!
In this case I am using a decision tree, but I have tested a few different algorithms and had the same problem; The model is only scoring a small % of the records it is given and the rest are null.
I figure it could have to do with missing values that are integral to the model, but I summarized both the records containing scores and those that don't below and there are definitely some fields like Age that are much more represented in the records that were scored- but there are many that are fairly equivalent. (this summary is attached)
If this is the problem, is there a hyper parameter I can change that would lessen the need for all of the data?
Is it possible that this is not the problem? What else could it be?
Solved! Go to Solution.
It sounds like this is definitely an issue with data quality and data volume.
In an ideal world every single column will always be complete, it sounds like in a lot of cases there is at least one null value on a line.
It's also important that the model is developed with a large number of records for all the possible permutations.
If you are able to post the data (I appreciate you may not be), then that would allow us to get a deeper understanding.
Ben,
Thanks so much for the quick response. I have attached the scored accounts (omitted only two columns with sensitive data- but where 100% of the values were available).
Just to confirm, since the model "works" in the sense that it is assigning scores to some accounts (That are ~84% accurate), the missing values we are both referring to are in the hold-outs that are being scored, correct?
Is there a "best practice" if I were to impute, or assign a values to the nulls on the records being scored? Is there a standard tool/method for this?
As a follow up, I found the imputation tool (duh!), and replaced all the null values with the median... still didn't score the majority f the accounts :-/
What I would look at is take one row of a record that has not scored.
Confirm that for the combination of dimensions you have in the unscored record, that there is a record with the same dimensions going into the model build. If there is, identify how many records there are, it may be plausible that the sample is too low or non existent.
Ben