After using the predictive tools in Alteryx over the years I became frustrated with how to correctly use non-binary categorical variables in predictive modelling, and wondered how these should be used, and if indeed they could be used in Alteryx.
Following a conversation with a Microsoft analyst who gave me an overview of the Azure ML platform (baby's first Alteryx), in which the response was less than clear and left me with more questions than answers ...
... I decided to do some digging. I got to the point where phrases such as "binarizing" and "segmentation" were used before in stepped that hero of all things Alteryx, Dr. Nick Jewell.
Dr. Nick mentioned something called "One-Hot Encoding." Now, this was a phrase I had never come across before and I was intrigued. A few hours later and with my brain effectively a marshmallow, I had found what I needed, and more importantly, I had an understanding of why I was doing what I was doing.
In short, predictive models are comfortable with numeric values, but categorical strings make their eyes twitch and ruin their appetites, because of reasons. Now, if the string is a binary, or boolean response, then all is good with the world and we can continue dancing in meadows and sharing sundaes, but when there are multiple values in a categorical field, then your model will simply fold its arms and steadfastly refuse to speak to you, like a petulant teenager.
With this in mind, I set about one-hot encoding my multi-value categorical variable fields, which essentially means I created a boolean response to each value in the field, which basically creates a yes/no grid system.
In the following example, I am trying to predict the miles per gallon (MPG) for any car, given the basic information such as number of cylinders, total displacement, horsepower etc. Now, all these fields are numeric aside from one, [Model].
Using OHE, I will create a field for each of the values in the model column and will give a y/n response for each value. This will convert the model column into multiple binary category columns, which can be handled nicely in most predictive models.
To do this, I carried out some pivoting and messing around with the data which gave me the following outcome:
You can see the column for each model has now been created, and 1 or 0 denotes a true/false value for each.
The data is now in the correct format to be used in our predictive model.
To make this process easier and re-usable, I built the process into a macro:
(shout out to @JohnJPS whose community post got me up and running. You can see the original post HERE.)
I uploaded this to the Alteryx Gallery, so you can download the tool HERE.
The workflow therefore is as follows:
This works by loading in the data, using the auto field to correctly amend the data type to numeric and string values, then assigning the RecordID. At this point, the data is fed into the OHE tool, joined back to the main stream, and the extra fields dropped accordingly. We are now ready to use our predictive models:
And as you can see, we now have each categorical value as a binary predictor variable. Simple.
So, there you go, One-Hot Encoding and how to use multi-value categorical variables in predictive models. This is a vital step in data preparation for predictive analytics and will allow you to use much more of your data as predictor variables, hopefully increasing accuracy along the way.