Data Science

Garabujo7 · ‎08-25-2022

In the first installment of this series, we talked about:

Getting the Data
Integration with Alteryx Designer
Prep Data
Data Health
Findings in the Data

In this second installment of three, we will talk about how to configure the platform to create predictive models.

If you missed Part 1, you can read it here.

Selection of the target variable

This is where we choose the variable we want to predict; this is the result we hope to get from our machine learning models.

In this example, the target variable will be the reservation status. It is a categorical variable since it has two values or categories we are interested in predicting: canceled and not canceled.

The objective of creating this predictive model will be to predict the reservations that would be canceled and those that would not be.

Another detail to take into account will be knowing why some reservations are canceled and others remain, which is very important for the business's health.

The next option that we must select is the machine learning method that we will apply.

We have three possibilities:

Classification
Regression
Time Series Regression

Classification

It will be useful for us to assign a category to each reservation. Options can be two or more.

In our case, it will be the status of the reservation, and it has two possibilities: Canceled and Not Cancelled.

Another way of looking at it is, what is the objective of our analysis? Knowing which characteristics the reservations that will be canceled have. In this way, we will be able to anticipate and take measures to reduce cancellations, reduce the cost of canceled reservations and ensure that most reservations are kept over time.

Regression

When our objective is to predict a number or quantity, we apply a regression technique that will allow us to see the most likely number we will obtain, as well as the causes that generate it.

In this example, it could be the cost of the ticket or the number of guests that the hotel will receive.

Time Series Regression

The third technique that the platform has gives us the possibility of predicting the number of people who will stay during the next six months, for example. This will project the results over time to understand how they will behave related to the following periods.

Correlations

After selecting the objective variable and the technique that we will use, the next step is to check the correlations between the variables.

This step is where we can eliminate variables that do not provide enough information to predict our goal or are very similar in how they influence the result, so understanding them individually is very difficult.

Correlation matrix

Here it shows us the traditional correlation matrix. The bad thing about this graph is that it is difficult to visualize all the correlations easily if we have a large number of columns.

To simplify it, the platform gives us two options, select two variables individually and analyze them.

So we can review the correlation between the two variables in more detail.

Chord Diagram

The other way to visualize the correlations is through the chord diagram, which allows us to see the relationships easily, even if there are many variables.

Interestingly, we can adjust the correlation threshold to focus only on the variables with the highest correlation.

Thus, we can analyze the variables with a high correlation simply regardless of their number.

Outliers

Out-of-range or outlier values can negatively influence our model results and typically require additional analysis to understand them.

The platform allows outliers to be removed. However, depending on the analyst's decision, we can keep all outliers, and the platform will treat them appropriately automatically.

Target Variable

Once we select the target variable, we can analyze its distribution to identify whether it is unbalanced or balanced.

In this case, our target variable has an acceptable distribution.

In case the variable was unbalanced, one category with a higher number of values than the other. The platform will apply the appropriate techniques to make the most of them when creating the models.

We will see those details later when the pipeline of each model is created.

Model Training

This is where we select the parameters that will be used to train the machine learning models.

The first is to choose the metric we will use to evaluate the results.

Metrics for Model Evaluation

We have a variety of metrics available to use the one that best suits the goal we are looking for. The selected metric will influence the models the platform will recommend—different metrics may have different recommended models.

The eight available metrics are:

I will share some resources that will help you select the right metric for your model. There are many, so my recommendation to you is: give a try to some of them to see which one yields the best results for your mode.

You may evaluate you model with the selected metric as well as with the evaluation of the model that we will discuss in the third part of this series.

I say so because as they say in machine learning: There is no free lunch in Data Science. Here is an excerpt of that notion “Coming back to the lunch of it all, you can’t get good machine learning “for free.” You must use knowledge about your data and the context of the world we live in (or the world your data lives in) to select an appropriate machine learning model. There is no such thing as a single, universally best machine learning algorithm, and there are no context or usage-independent (a priori) reasons to favor one algorithm over all others.”

Here are some resources to get more information on the evaluation metrics:

Model Search

To control the duration of model training, we can limit the time it takes to create new pipelines for each model we select.

To ensure that the models give the best results in production, we can select the number of K-Folds we will use for cross-validation. By default, the platform recommends 3.

Ensemble Models

Ensembles are combinations of models that can create better results; for that, we can select the following option:

Holdout

The last one is the percentage of data we will reserve to evaluate the final model.

The number recommended by the platform is 20%.

Feature Engineering

When we develop machine learning models, another of the fundamental processes to obtain good results is Feature Engineering.

Feature engineering refers to creating new variables that did not exist in the original dataset. This process is trial and error to find the best features or variables that help us improve the results.

An example of creating new variables is calculating the age of customers when we only have their date of birth. The age did not exist, and we created it because it provides more information for the model.

Here the platform fully automates the process of creating new variables through Primitives, which are formulas that are applied to existing variables, which will then be used to train the models, taking the ones that are useful for the goal and discarding the ones that are not. They add value to the process.

The list of available Primitives is 38; among them, we have, for example, the natural logarithm that we can apply to transform numbers.

Conclusion

In this second part, we reviewed the configuration and creation of the platform.

Target Variable Selection
Machine Learning Methods
Correlations
Atypical Values
Objective Variable
Model Training
Metrics for Model Evaluation
Feature Engineering

In the third and last part, we will review the results of the models:

Auto Modeling
Evaluation of the Models
Export and Predicting

Read part 3 here.