Hi everyone,
I am currently busy with a Classification Machine learning workflow and some of the string variables needs to be dealt with in a different manner since the number of categories contained in them exceed 54. I have found the attached flow from a conversation a few years ago which shows me how I can do the encoding. I work with sales data and the type of variables that I need to encode (or use an alternative method) are End user ID, sales office, reseller, reseller parent etc.
I would like to find out if this method of encoding will yield biased data, i.e. as the numbers increase etc.? I can understand that this type of encoding will not work for low, medium, high, but will it work for my use case?
If anyone have some expert advice on encoding variables, I would appreciate it!
Thank you for helping
Hi @Roche, the attached workflow only converts the a certain string value to a corresponding number. On your model, if you use a field set as string, the different values within it will not lead to a biased model. In other words, if you have on your field 1 and 2 and this is a string, the 2 will not mean that it is higher value than 1. This would only apply if you have numeric fields. Therefore, this 1 and 2 would be no different as having letters on it.
Hi @gabrielvilella. Thank you for your advice. The specific few fields that I am mentioning will need to be either encoded, or another method applied. Alteryx does not allow me to have > 54 different categories in a field. So I might need to use a different method then.
Hi @Roche ,
If you ensure those variables are a string, you can then One-Hot Encode them into a binary grid. This will then treat them without intrinsic value.
I've attached an example workflow with the tool to help you.
NOTE: You must create the RecordID field as a string. I will fix this in a later release!
M.
I believe what you need is to create a new variable based on that other one that has too many categories. You can combine 10 similar categories into one, but to determine which can be combine is a job for a data scientist.
Hi @mceleavey, thank you for your advice. These 5 variables that I am trying to find a solution for have over 1000 categories / close to, each. However I think grouping under higher level categories will be the best option here. I like the idea of one-hot encoding for some other variables.
Hi @gabrielvilella , yes I believe that would be the correct approach. Thank you