Encoding categorical variables without having biased data

Question

Hi everyone,

I am currently busy with a Classification Machine learning workflow and some of the string variables needs to be dealt with in a different manner since the number of categories contained in them exceed 54.  I have found the attached flow from a conversation a few years ago which shows me how I can do the encoding.  I work with sales data and the type of variables that I need to encode (or use an alternative method) are End user ID, sales office, reseller, reseller parent etc.

I would like to find out if this method of encoding will yield biased data, i.e. as the numbers increase etc.?  I can understand that this type of encoding will not work for low, medium, high, but will it work for my use case?

If anyone have some expert advice on encoding variables, I would appreciate it!

Thank you for helping

ChangingStrings_UniqueCategorical.yxmd

Roche · Answer

Hi @gabrielvilella , yes I believe that would be the correct approach.  Thank you

Roche · Answer

Hi @mceleavey, thank you for your advice.  These 5 variables that I am trying to find a solution for have over 1000 categories / close to, each.  However I think grouping under higher level categories will be the best option here.  I like the idea of one-hot encoding for some other variables.

gabrielvilella · Answer

I believe what you need is to create a new variable based on that other one that has too many categories. You can combine 10 similar categories into one, but to determine which can be combine is a job for a data scientist.