Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Text Classification via Naive Bayes tool

Highlighted
Alteryx Partner

Objective: I have about 100k records (V_string (size 10k+)) and I need to classify them based on certain predictor variables.

 

Target variable (Test data-set) : 10K records, datatype - V_string ( size 10K+)

Predictor Variables (Training data-set):

1.) 100K records, datatype -  V_string (size 10k+)

2.) 100K records, datatype -  Binary

 

I want to use the Naive Bayes tool to classify these string variables (100k distinct records) 

 

Challenge: I'm not able to pass more than 50 records in the training data-set, there is an error thrown:

 

Error: Naive Bayes Classifier (34): Naive Bayes Classification: Error: ngrid1=50 is less than the number of levels 98 in 'MatchKey'
Error: Naive Bayes Classifier (34): Naive Bayes Classification: Execution halted
Error: Naive Bayes Classifier (34): Naive Bayes Classification: The R.exe exit code (1) indicated an error.

 

So basically my training set has only 50 distinct records which means the model will not be trained well and the results will be substandard. 

 

What am I looking for?

If there is a work around for using the Naive Bayes tool or perhaps any other tool in Alteryx to perform this action and how to integrate it in my current workflow

 

Thanks in advance.

Highlighted
Alteryx Certified Partner
Alteryx Certified Partner

@Raghu,

 

There is a post from Dr. Dan that addresses the limit:  https://community.alteryx.com/t5/Advanced-Analytics/Allow-higher-number-of-distinct-values-for-categ...

 

Here is his response:

 

This is hard limit in the naiveBayes function for the e1071 R package that is used to implement the model. A similar hard limit on the number of categories for variables exists in the Forest Model tool as a result of the underlying R package. The reason for this is that the combinatorics involved for more levels in these algorithms gets out of hand if there are levels involved. In addition, often when there are a lot of levels for a categorical variable, many of those levels have a small number of counts, and become unreliable predictors. My advice is to consolidate the number of categories to a smaller number, making sure that there are a reasonable number of counts in each category.

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and reboot. Order shall return.
Labels