Hi everyone,
I am currently busy with a Classification Machine learning workflow and some of the string variables needs to be dealt with in a different manner since the number of categories contained in them exceed 54. I have found the attached flow from a conversation a few years ago which shows me how I can do the encoding. I work with sales data and the type of variables that I need to encode (or use an alternative method) are End user ID, sales office, reseller, reseller parent etc.
I would like to find out if this method of encoding will yield biased data, i.e. as the numbers increase etc.? I can understand that this type of encoding will not work for low, medium, high, but will it work for my use case?
If anyone have some expert advice on encoding variables, I would appreciate it!
Thank you for helping