Data Science

Machine learning & data science for beginners and experts alike.
mceleavey
17 - Castor
17 - Castor

 

After using the predictive tools in Alteryx over the years I became frustrated with how to correctly use non-binary categorical variables in predictive modelling, and wondered how these should be used, and if indeed they could be used in Alteryx.

CristonS_0-1590678752743.png

Following a conversation with a Microsoft analyst who gave me an overview of the Azure ML platform (baby's first Alteryx), in which the response was less than clear and left me with more questions than answers ...

 



 

 

CristonS_1-1590678752757.png

 

... I decided to do some digging. I got to the point where phrases such as "binarizing" and "segmentation" were used before in stepped that hero of all things Alteryx, Dr. Nick Jewell.

Dr. Nick mentioned something called "One-Hot Encoding." Now, this was a phrase I had never come across before and I was intrigued. A few hours later and with my brain effectively a marshmallow, I had found what I needed, and more importantly, I had an understanding of why I was doing what I was doing.

In short, predictive models are comfortable with numeric values, but categorical strings make their eyes twitch and ruin their appetites, because of reasons. Now, if the string is a binary, or boolean response, then all is good with the world and we can continue dancing in meadows and sharing sundaes, but when there are multiple values in a categorical field, then your model will simply fold its arms and steadfastly refuse to speak to you, like a petulant teenager.

With this in mind, I set about one-hot encoding my multi-value categorical variable fields, which essentially means I created a boolean response to each value in the field, which basically creates a yes/no grid system.

In the following example, I am trying to predict the miles per gallon (MPG) for any car, given the basic information such as number of cylinders, total displacement, horsepower etc. Now, all these fields are numeric aside from one, [Model].


CristonS_2-1590678752761.png



Using OHE, I will create a field for each of the values in the model column and will give a y/n response for each value. This will convert the model column into multiple binary category columns, which can be handled nicely in most predictive models.

To do this, I carried out some pivoting and messing around with the data which gave me the following outcome:

CristonS_3-1590678752766.png



You can see the column for each model has now been created, and 1 or 0 denotes a true/false value for each.

The data is now in the correct format to be used in our predictive model.

 

To make this process easier and re-usable, I built the process into a macro:


CristonS_4-1590678752767.png



(shout out to @JohnJPS whose community post got me up and running. You can see the original post HERE.)

I uploaded this to the Alteryx Gallery, so you can download the tool HERE.

 

 

The workflow therefore is as follows:

CristonS_5-1590678752770.png



This works by loading in the data, using the auto field to correctly amend the data type to numeric and string values, then assigning the RecordID. At this point, the data is fed into the OHE tool, joined back to the main stream, and the extra fields dropped accordingly. We are now ready to use our predictive models:


CristonS_6-1590678752772.png



And as you can see, we now have each categorical value as a binary predictor variable. Simple.

 

So, there you go, One-Hot Encoding and how to use multi-value categorical variables in predictive models. This is a vital step in data preparation for predictive analytics and will allow you to use much more of your data as predictor variables, hopefully increasing accuracy along the way.

Comments
kendalld14
5 - Atom

One Hot Encoding can get messy and confusing. This is a great way to simplify the valuable pre-processing step!

mceleavey
17 - Castor
17 - Castor

Thanks @kendalld14 , that's great feedback as that was the intention. I want to write a few blogs as introductions to data science, and show that with Alteryx it's not a terrifying thing to be feared, like diets or Monday, but can be easily accessible even for normies like me.

 

M.

kendalld14
5 - Atom

@, the blog posts sound great! Looking forward to them. I'm a big believer in the importance of democratizing data science...the value of data is huge. Lowering the barriers to entry for newbies will exponentially increase the value we can gather from all the available data in the world.

RickPack
8 - Asteroid

When I use the One-Hot Encoder tool, I am losing the leading zeroes from RecordID. Am I doing something wrong?

RickPack_0-1595263930635.png   RickPack_2-1595263997832.png 

 

 

 

mceleavey
17 - Castor
17 - Castor

Hi @RickPack ,

 

Your RecordID must be a string.

 

mceleavey_0-1595336407866.png

 

Hjardine
8 - Asteroid

Hi Chris,

 

My record ID is a string and am still having the same issue as @RickPack .

 

 
 

1.PNG2.PNG

alphacharlie33
5 - Atom

I have run into the same issue as @rickpack and @Hjardine -- did anyone find a solution for this?

 

Confirmed my RecordID is a string and that I have removed the extra / overlapping strings when using Join.

 

Thanks!

spencerhong
Alteryx
Alteryx

I am running into the same issue.

  1. RecordID macro - add ID numbers (Type = String)
  2. Select macro - select RecordID & categorical column to use for OHE (Selected those two as I have dozens of columns and the OHE macro conducts on each column that slows the time to run the workflow)
  3. One-Hot Encoder macro - (Choose field = RecordID)
  4. Join macro - issue is that in the OHE macro step, the RecordID column gets changed to go from 00001 to 1, 00002 to 10, 00003 to 100, etc. until 00006 to 10001 and the whole renaming process occurs

 

When I go to then join, the recordIDs do not match because of step 4.

 

The workaround I have found is to "Join by Record Position"; however, this is a small static sample. How do I make it so that OHE macro does not change the recordIDs?

Erin
9 - Comet

Thank you for explaining this so clearly @mceleavey!! Mind blown and excited to amp up my predictor game with the OHE macro!  

mceleavey
17 - Castor
17 - Castor

Thanks @Erin , I hope it's helpful.

 

I didn't realise I'd made a mistake with the example workflow, so for @Hjardine , @RickPack @spencerhong  et al, you need to select the fields you want to OHE then join it back to the main stream:

 

mceleavey_0-1636731008329.png

 

In this example I have the following data:

 

mceleavey_1-1636731035641.png

I wish to OHE the Supplier and Country columns.

Therefore, I select the following fields in the select tool:

mceleavey_2-1636731095291.png

 

Then I simply join it back on the RecordID.

You drop the extra fields and should have:

 

mceleavey_3-1636731131381.png

 

Hopefully that clears it up. Apologies for the lateness of my reply!

 

M.