Hi..
I am trying to use the Naive Bayes Classifier tool for a various classification analysis (let's call it good customer/ bad customer) but I keep getting different error messages. At first it's data out of bounds for scoring, then it's this "Naive Bayes Classification: Error in apply(log(vapply(seq_along(attribs), function(v) { :"
When I did some simple Titanic testing all seems to be fine though:
I tried to reduce the attributes either number or text (categorical) but the error message is the same.
Anybody might know what may have been the reason for error messages?
Thanks much!
Hello, @goutdelete.
I think it will be easier to debug if we're given a sample workflow. It's tough to say, as I don't see those keywords used in the code within the R tool that's used within the Naive Bayes tool.
Thanks very much @acarter881 for the reply! Let me prep the data and maybe trim some elements so I wouldn't be accidentally put our customer data out there.
By the way there is one post at Stackoverflow with the exact message although I wasn't too sure if it's applicable with my case:
You're welcome, @goutdelete. I saw that as well (it isn't a complete match to your error, but it is close), but I'm sure it's something I should be able to fix if I had the workflow. 🙂
@acarter881 so upon further investigation with the data prep I realized I had one step to cleanse the null value and forgot to update the logic (from isnull() then bad to != 0 then bad). So in the end it accidentally became 1 result only: all good. I suppose it is indeed the same problem like the stackoverflow thread with 1 dim only; Bayes calls for two results to perform the analysis. See the two pics below:
Nonetheless after correction I seem to have new problem; see confusion matrix only shows half and I'm pretty sure it's wrong.
I attached both in the sample workflow below.
Thanks!
Hello, @goutdelete.
I believe these are the only tools you need in your top-most example; the other tools have no effect on the data (see first screenshot).
I think you may want to look into oversampling; it's likely that 23 records of Bad isn't enough. Even if you split that 50/50 with oversampling Bad, then you would only have 46 records in total going into the model, which is likely not going to produce useful results. The Naive Bayes macro uses 500 records (see second screenshot), with a split of 252 Yes and 248 No. This is likely the type of split you want (i.e., 50/50) and shows how you need more records going into the model.
Hi @acarter881 Thanks for the input, let me look into it.
However I don't think reducing other tools would make an impact; selection tool is really just from my original dataset since I have quite a lot more fields, imputation tool on the other hand is necessary, it's the only way that I could think of so that I don't need to go to python route to replace NA with anything such as avg value.
My original data (still a subset only) has over 2000 records so I feel it should be sufficient. 23 might be a smaller number because I used random tool to trim the size of the sample. However it is indeed disproportionate since it's a good/ bad customer type of analyses. Any business would have a big trouble if it's 50/50. :)
On the other hand, I do have some other new error and error message today; if I change the name of the result. Confusion matrix table showed up (but still wrong). And the score error message actually rolled back to what I got before as some "subscript out of bounds"..
In this original subset, it would be 143 Yes and 2014 No. So the table was still wrong unfortunately.