Data Science

Machine learning & data science for beginners and experts alike.
Don't forget to submit your entry for the Excellence Awards by October 30! | Need more information about the program? Check out the blog here
Alteryx
Alteryx

 

 

gif retrieved from giphy.comgif retrieved from giphy.com

 

 

If you are like me and you have never made an analytical model or you do not have enough time to dedicate to learning statistics, data science, programming, databases, SQL ... but you know the business and have questions to answer that you have not been able to, or you depend on other areas / professionals to help - this interests you!

 

gif retrieved from giphy.comgif retrieved from giphy.com

 

 

 

The Citizen Data Scientist

 

First of all, we must talk about a new role coined by Gartner; the citizen data scientist is the person who adds value to the analysis process and is able to simplify it using analytical models for advanced diagnoses or with predictive and prescriptive capabilities, but does not have academic training nor is its job function related to the field of statistics, analytics, technology or databases.

 

Therefore, Assisted Modeling is the platform for the citizen data scientist par excellence, since it allows them to develop the analyses they need without having the required data science or advanced statistics training. It is rather oriented towards responding to day-to-day business questions quickly and with the great added value of knowing more about the process at the same time.

 

Garabujo7_2-1594929387491.png

 

 

Assisted Modeling explains and defines each of the steps it takes so that it is clear to us what it is doing and the reasons why it made those decisions, even giving us the opportunity to make the selections manually if we do not agree with which it recommends, further customizing the model.

 

 

Garabujo7_3-1594929387493.png

 

 

 

Below is an example of an explanation from the Assisted Modeling tool:

 

 

Garabujo7_4-1594929387494.png

 

Garabujo7_5-1594929387495.png

 

 

 

We see that it not only gives us recommendations, it explains them and allows us to decide whether to apply it or not, making it more flexible.

 

 

CRISP-DM Methodology

 

For reference, the Assisted Modeling platform is based on the CRISP-DM (Cross Industry Standard Process for Data Mining) methodology that includes 5 steps that must be followed in projects for data analysis from any industry to create a systematic and repeatable process.

 

 

Garabujo7_6-1594929387522.png

 

 

 

Assisted Modeling vs. Auto Modeling

 

There are two philosophies regarding the development of models using machine learning; first is the one proposed by Alteryx: Assisted Modeling and the other is Auto-Modeling. Each one has, like everything in life, pros and cons and the choice depends on several factors, here is a basic comparison of both products:

 

  • Auto-Modeling: Fully automated construction of a predictive model. The user selects a data set, chooses the variable they want to predict, and the auto-model returns the best model it can find.

 

  • Assisted Modeling: A transparent modeling process that allows the user to control key decisions when building a predictive model. The user goes through the predefined steps to build a model successfully and correctly, the platform also guides him in the process so that the user understands the steps and decisions.

 

 

Garabujo7_7-1594929387534.png

 

 

This is a reference guide; in the end, the best way to decide is to use the platform to see if it is what you need. Download a trial version right now.

 

 

Enter Assisted Modeling

 

As stated above, part of the latest Alteryx version 2020.2 is Assisted Modeling, a new category of analytical tools in the machine learning category. This is part of the Intelligence Suite option, and includes a Text Mining category that I will talk about in an upcoming blog.

 

 

Garabujo7_8-1594929387538.png

 

 

 

How can I use it?

 

As an additional component, a license is required to use it. If you download Alteryx version 2020.2, the machine learning and text mining analytical blocks will appear with a padlock next to them and will not be usable.

 

 

Garabujo7_9-1594929387539.png

 

 

If you already have your Intelligence Suite license, you can activate it to start using it. The good news is that the Intelligence Suite also has a trial version!

 

To start, you need data.

 

Garabujo7_10-1594929387540.png

 

 

For this article, I will use a sample set that includes customer data from a Telco. The next step is to place the Assisted Modeling tool on the canvas. 

 

 

Garabujo7_11-1594929387548.png

 

 

Garabujo7_12-1594929387551.png

 

 

 

To start Assisted Modeling, click on Run or use the shortcut CTRL -> R.

 

 

Garabujo7_13-1594929387552.png

 

 

 

Click Start Assisted Modeling.

 

 

Garabujo7_14-1594929387555.png

 

 

 

This displays the initial screen with an explanation of the process to create the model and a description of each stage.

 

 

Garabujo7_15-1594929387566.png

 

 

 

Step 1: Select the target variable

 

Select Start Building and it takes us to the screen to select the target variable (what we want to predict) :😞

 

 

Garabujo7_16-1594929387576.png

 

 

 

The interesting thing about Assisted Modeling is that when selecting the target variable, it shows you an explanation of the type of variable and examples of what can be done with this kind of data.

 

To select the variable that we want to predict, we can ask ourselves what we want to answer with the data, then we click Next.

 

By selecting the target field, you automatically choose the type of machine learning method and it gives us use cases where you can apply it.

 

 

 

Garabujo7_17-1594929387579.png

 

 

 

In this case, what we want to predict is a classification, the model will make the prediction according to the available categories, which in this case are two: binary (e.g.: dog or cat) and multinomial (e.g.: high, low, medium).

 

 

Step 2: Configure data types

 

In this step the correct data type will be assigned for the data we will use to model. According to the content, the Assisted Modeling tool will recommend that we discard some variables or change their type, as in the case of the fields that are IDs since they do not provide information for the prediction.

 

 

Garabujo7_18-1594929387590.png

 

 

 

Analyze the content of the column.

 

 

Garabujo7_19-1594929387591.png

 

 

 

Recommend an action to take.

 

 

 

Garabujo7_20-1594929387593.png

 

 

 

Explain the recommended actions to take.

 

 

Garabujo7_21-1594929387594.png

 

 

 

Step 3: Clean up missing values

 

Fields with null or empty values create problems for analytical models, as part of the process Assisted Modeling advises imputation strategies to limit the impact of these data on the results of the model.

 

 

Garabujo7_22-1594929387602.png

 

 

 

Imputing means assigning pre-determined values to an empty or null field. To do so, the variable can be completely discarded if it does not provide information or has very few values, or change it to the median, mode, or mean of the rest of the values. In this way, we can take advantage of those fields that have incomplete information.

 

 

Garabujo7_23-1594929387607.png

 

 

Step 4: Select features

 

Of the variables that the model has, we can choose those that have a greater association with what we seek to predict so that the result is more accurate.

 

 

Garabujo7_24-1594929387619.png

 

 

 

In this case, it indicates that the variable is a good predictor according to the Gini and GKT analysis.

 

 

Garabujo7_25-1594929387621.png

 

 

This step also includes an explanation of the techniques used to evaluate the details of the predictor. Predictors are the independent variables that will help us predict the target.

 

 

 

Garabujo7_26-1594929387623.png

 

Step 5: Select algorithms

 

The last step allows us to select the algorithms that we want to use for prediction, thus complying with the “there is no free lunch” data science theorem, that states that no algorithm is perfect for all cases, you have to try different ones to get the best results that adapt to the data and specific situation.

 

 

Garabujo7_27-1594929387637.png

 

 

 

For categorical variables, we have 4 algorithms available

  • Logistic regression
  • Decision tree
  • Random Forest
  • XGBoost

 

If it is a continuous variable (numerical) we have 3 algorithms at your service

  • Linear regression
  • Decision tree
  • Random Forest

 

Each one has its definition, pros, cons, and practical cases where it is applied.

 

 

Garabujo7_28-1594929387640.png

 

 

 

We click on Run the selected algorithms to train them.

 

 

Model Comparison

 

Once the training of the selected models is concluded, the Assisted Modeling presents the global and individual results together with an explanation of the metrics and a recommendation of the best model according to its accuracy and processing time.

 

 

Garabujo7_29-1594929387655.png

 

 

 

In this case the platform advises that the best model is the XGBoost with an accuracy of 80% and a processing time of 13 seconds.

We can also evaluate the confusion matrices that explain the model's ability to predict each option, which is important depending on the use case we are analyzing.

 

 

Garabujo7_30-1594929387658.png

 

 

 

The importance of variables is another characteristic that is presented.

 

 

Garabujo7_31-1594929387659.png

 

 

 

This tells us which variables, according to each model, are most important for predicting the target variable, important for focusing on the most relevant variables, and generating focused actions on those that may have the greatest impact.

 

Are you a developer and prefer to write your code by hand because it allows you to have more control? No problem, Assisted Modeling is here to help you, you can create prototypes or drafts of the models you require and export them to Python to effortless create the base of your model with just a few clicks.

 

Select Export model to Python.

 

 

Garabujo7_32-1594929387662.png

 

 

 

And now you can see the model in Python code within Alteryx Designer to start using it immediately.

 

To finish the process, select the winning model, by clicking on the check and then clicking on Add Models and Continue to the Workflow.

 

 

Garabujo7_33-1594929387669.png

 

 

 

That creates a complete workflow that you can use to score your data, either batch scoring within Designer, Alteryx Server, or integrated within another system using the Rest API of the Altyeryx Server and even implement it to score in real-time using Alteryx Promote.

 

 

Garabujo7_34-1594929387679.png

 

 

This shows the entire process of the model in Python, on the Jupyter notebook created by the Python tool in Designer, including the steps and explanations!

 

 

Garabujo7_35-1594929387681.png

 

 

Garabujo7_36-1594929387689.png

 

 

 

Scoring

 

To score new data after model training, we can connect the new dataset and use the Predict Values tool to assign a dropout probability to each record.

 

 

Garabujo7_37-1594929387691.png

 

 

 

Even after the model is finished, we can modify the parameters to further refine it, giving great flexibility to the process.

 

Garabujo7_38-1594929387697.png

 

 

 

And without forgetting that it continues to explain each parameter you select.

 

 

Garabujo7_39-1594929387700.png

 

 

 

Justify decisions through self-documentation

 

You have already created your first analytical model, you are not an expert in this, how can you justify the results or explain them to the data science experts?

 

gif retrieved from giphy.comgif retrieved from giphy.com

 

 

 

Do not worry, Assisted Modeling is here to help you with that part, too!

 

At the same time that the assistant was showing us what it was going to do at each stage, at the end of the process it created the analytical flow (or analytical pipeline) with all the steps and decisions we made. Now you can show it and justify the work with the experts as well as potential quality assurance, auditors, and reviewers who need to verify how decisions are being made.

 

 

Garabujo7_41-1594929387814.png

 

 

The flow includes all the steps and we can review and even modify them if necessary.

 

 

Garabujo7_42-1594929387819.png

 

 

 

Additionally, if you want to discuss the results with more people or in another context, you can export the results reports in HTML and take them with you to that important meeting.

 

 

Garabujo7_43-1594929387821.png

 

 

 

This is true augmented intelligence, the ability to harness one's experience and the potential of machine learning.

 

What really gives you power ...

 

 

Tomado de GiphyTomado de Giphy

 

 

 

... is the thrill of solving with Alteryx!

Comments
8 - Asteroid

Thanks for posting, very well explained!

 

There is one feature that would be immeasurably helpful for those (like myself!) who have no background in Data Science but otherwise have a good understanding of some of the concepts. 

When selecting a target variable, I note that when selecting from the list there doesn’t seem to be an explanation of why certain fields can’t be used as a target.

 

I can imagine that many will have scenarios where they know what they want the predictor variable to but AM won’t allow it to be used. 

Being given some context on why would be incredibly helpful for both real world scenarios and the learning process.