Data Science

Machine learning & data science for beginners and experts alike.

Often when we want to start developing a machine learning model, we find that we need to prepare the data in great detail to get it to a state that our models can optimally interpret. This also depends on the type of machine learning model that we want to use since some of them support values that others are not capable of handling.

 

For this reason, when we prepare the data to test with multiple models and evaluate which one gives us a better result, the ideal approach is to prepare the data in a way that will allow us to feed it to any type of model. Certain tasks, such as scaling numerical variables so that all the numerical variables are distributed in similar ranges and therefore have a proportional influence on their importance, or tasks such as the conversion of categorical variables to numerical ones, which are understandable by the models, are unavoidable.

 

On this occasion, I have prepared a macro for Alteryx, available at the link at the end of the article, which allows you to perform this last task, focused on converting variables that are categorical in nature into numerical variables, a process called one-hot encoding. Through this process, we create a new column of type flag for each different category existing in a non-numeric column, and depending on the value of each row, at most only one of the flags created will be equal to 1. Visually, it would be something like this:

 

One hot encoding (2).png

Conversion from categorical to numeric variable

 

An important aspect when we carry out this task in training a machine learning model is that for all the new columns that we create in this process, for each of the categorical variables that we want to convert to numeric, we must be able to repeat this process again exactly the same for when we are going to make the prediction of new cases in the future, ending up with exactly the same columns at the end of the process. Therefore, we must save, together with the trained model, the values of the new columns that we will create so as not to create new unknown columns for the model. We also do not want necessary columns to be missing, even if their value is 0.

 

This new macro is very simple and allows us to select if we are using this tool in a predictive model training flow or if it is a scoring one. The difference will allow the one-hot values to be stored in the selected folder or instead to be loaded from said folder so that the process can be repeated identically to the one carried out in training.

 

Picture1 encoding.png

 

In addition, the macro will allow us to select which of the categorical variables of our dataset we want to go through the one-hot encoding process, as well as the maximum number of categories that we want to consider when generating the new variables. The latter is important because if we have variables with high dimensionality (i.e., many different values), we can end up with so many flag-type columns that it can make training the model to be harder than necessary, so the macro will create columns only for the n most common values of the categorical variables instead.

 

It is important to highlight that, just as when we train a machine learning model with code, we must always use the same configuration chosen in the training stage as in the prediction stage so that the process of creating flag-type columns is identical in both phases, allowing the models to operate smoothly.

 

I've included some examples of how to use the macro at the bottom of this article. I hope you find it useful, and let's train models with Alteryx!

 

Note: Scikit-learn version 1.2.1 or higher is needed to run this macro