ALTERYX INSPIRE | Join us this May for for a multi-day virtual analytics + data science experience like no other! Register Now
The Alteryx Community will be temporarily unavailable for a time due to scheduled maintenance on Thursday, April 22nd. Please plan accordingly.

Data Science

Machine learning & data science for beginners and experts alike.
DiganP
Alteryx
Alteryx

DiganP_0-1614892618180.png

 

 

This post takes you on a guided tour of using EvalML to build and evaluate supervised machine learning pipelines. We'll revisit the BigMart dataset we worked on in my recent Featuretools post.

 

What is EvalML?

EvalML is an AutoML library that builds, optimizes and evaluates machine learning pipelines using domain-specific objective functions. Combined with Featuretools, EvalML can be used to create end-to-end supervised machine learning solutions.

 

Background on the data for this example:

We are going to be looking at the BigMart dataset. There are 1,559 products and 10 stores. You can visualize it as two tables in one: Item and Outlet table.

 

Variable

Description

Item_Identifier

Unique product ID

Item_Weight

Weight of product

Item_Fat_Content

Whether the product is low fat or not

Item_Visibility

The % of total display area of all products in a store allocated to the particular product

Item_Type

The category to which the product belongs

Item_MRP

Maximum Retail Price (list price) of the product

Outlet_Identifier

Unique store ID

Outlet_Establishment_Year

The year in which store was established

Outlet_Size

The size of the store in terms of ground area covered

Outlet_Location_Type

The type of city in which the store is located

Outlet_Type

Whether the outlet is just a grocery store or some sort of supermarket

Item_Outlet_Sales

Sales of the product in the particular store. This is the outcome variable to be predicted.

 

 

DiganP_1-1614892618184.png

 

 

We are going to drop the Item_Identifier and the Outlet_Identifier as we won’t be using them as predictor variables. Our target variable is still Item_Outlet_Sales, the sales of a product in a particular store.

 

We will be bringing in the BigMart data and splitting the columns. Dataframe X will be all our predictor variables, while dataframe y will be our target, Item_Outlet_Sales.

 

 

DiganP_2-1614892618189.png

 

 

The first step is to make sure that our physical, logical and semantic types are correct.

  • Physical Type – The actual data type of the incoming data.
  • Logical Type – This is how the DataFrame interprets the physical data type.
  • Semantic Tags – These are enhanced feature types that allow you to more thoroughly describe your data.

 

 

DiganP_3-1614892618193.png

 

 

Now we need to split the data into training and validation sets to train the model and gauge its performance. We are going to do an 80/20 split, with 80% of the dataset for the model to train on and 20% of the dataset for it to test on. We have 6,818 records for training and 1,705 records for testing purposes.

 

 

DiganP_4-1614892618194.png

 

 

EvalML has many options to configure the pipeline search. We designate the problem type (regression or classification) and optionally select an objective function. (If you don't select a specific objective function, the default for your chosen problem type will be used.) You can imagine a pipeline as nothing more than a sequence of operations to be applied to data, where each operation is either a transformation or a modeling algorithm. An objective function is nothing more than a metric that EvalML will seek to minimize or maximize. You can learn more about objective functions here.

 

EvalML has different objective functions available for regression and classification models.

 

Objective functions for regression include:

  • ExpVariance
  • MaxError
  • MedianAE
  • MSE
  • MAE
  • R2
  • Root Mean Squared Error

 

 

DiganP_5-1614892618196.png

 

 

Objective functions for classification include:

  • MCC Binary
  • Log Loss Binary
  • AUC
  • Precision
  • F1
  • Balanced Accuracy Binary
  • Accuracy Binary

 

 

DiganP_6-1614892618197.png

 

 

For our regression problem, we are going to use Root Mean Squared Error as the objective function. The lower the score is, the better the pipeline.

 

 

DiganP_7-1614892618198.png

 

 

When we call search(), the search for the best pipeline will begin. There is no need to wrangle missing data or categorical variables as EvalML includes various preprocessing steps (like imputation, one-hot encoding and feature selection) to ensure you are getting the best results.

 

As long as your data is in a single table, EvalML can handle it. If not, you can reduce your data to a single table by utilizing Featuretools and its Entity Sets. You can find more information on pipeline components and how to integrate your own custom pipelines into EvalML here.

 

 

DiganP_8-1614892618199.png

 

 

DiganP_9-1614892618202.png

 

 

DiganP_10-1614892618207.png

 

 

After the search is finished, we can view all of the pipelines searched and ranked by score. Internally, EvalML performs cross validation to score the pipelines. If it notices a high variance across cross-validation folds, it will warn you. EvalML also provides additional data checks to analyze your data to assist you in producing the best performing pipeline. These data check utility functions help deal with problems such as overfitting, abnormal data and missing data.

 

 

DiganP_11-1614892618211.png

 

 

If we are interested in getting more details about the pipeline, we can view a summary description using the id from the rankings table:

 

 

DiganP_12-1614892618214.png

 

 

 

DiganP_13-1614892618216.png

 

 

 We can also view the pipeline parameters directly.

 

 

DiganP_14-1614892618218.png

 

 

EvalML has three different pipeline usages:

  • Fit – Fits each component on the provided training data, in order.
  • Predict – Computes the predictions of the component graph on the provided data.
  • Score – Computes the value of an objective on the provided data.

We can now select the best pipeline and score it on our holdout data:

 

 

DiganP_15-1614892618220.png

 

 

Using best_pipeline.graph() we can visualize the steps of this pipeline:

 

 

DiganP_16-1614892618222.png

 

 

We can also get the importance associated with each feature of the resulting pipeline:

 

 

DiganP_17-1614892618226.png

 

 

DiganP_18-1614892618231.png

 

Here are some extra links to help you explore what EvalML has to offer:

 

Comments
ArtApa
Alteryx
Alteryx

Thank you for this post Digan!