Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Data Science

Machine learning & data science for beginners and experts alike.
DiganP
Alteryx Alumni (Retired)

 

DiganP_0-1614885961939.png

 

 

 

In this post, I'll show you an example of how you can use Featuretools to automatically generate new features for use in machine learning. This open-source package from Alteryx makes feature engineering easy and fast.

 

 

What is a Feature?

A feature can be described as a column or variable that represents a measurable piece of data that can be used for analysis. For example:

 

 

Customer ID

Name

Income

Spend

Number of Items

1

Mr. John Doe

50,000

1,200

15

2

Ms. Grace Desilva

75,000

15,000

36

3

Mrs. Alex Hall

65,000

7,500

7

4

Mr. Lopez

125,000

20,000

5

 

 

Here, each column is a feature.

 

What is Feature Engineering?

Feature engineering is the process of using domain knowledge to create new features from raw data to improve performance of machine learning algorithms. Looking at the example earlier:

 

 

Customer ID

Name

Income

Spend

Number of Items

Gender

Spend per Item

1

Mr. John Doe

50,000

1,200

15

Male

100

2

Ms. Grace Desilva

75,000

15,000

36

Female

416.67

3

Mrs. Alex Hall

65,000

7,500

7

Male

1071.43

4

Mr. Lopez

125,000

20,000

5

Female

4000

 

 

Here, we created two features based on the existing columns – Gender and Spend per Item. For Gender, we used the “Mr.,” “Ms.” and “Mrs.” to assume the gender. For Spend per Item, we used the spend over the number of items. These are just simple examples. However, in practice when we have lot of features, creating new features can become complex and cumbersome to manage.

 

Why is Feature Engineering Required?

The performance of a predictive model is heavily dependent on the quality of the features in the dataset used to train the model. If you are able to create new features and thereby give the model more information to work with, the performance will go up. Sometimes creating these features can take some time. The user must explore the data, create visuals, brainstorm, and analyze the results. Some users also see this as an art, as It often requires an understanding of the data and an ability to view it from different perspectives. The bottom line is that if you are good at it, you have a major advantage over the competition.

 

Background on the data for this example:

We are going to be looking at the BigMart dataset. There are 1559 products and 10 stores. You can visualize it as two tables in one: Item and Outlet table.

 

Variable

Description

Item_Identifier

Unique product ID

Item_Weight

Weight of product

Item_Fat_Content

Whether the product is low fat or not

Item_Visibility

The % of total display area of all products in a store allocated to the particular product

Item_Type

The category to which the product belongs

Item_MRP

Maximum Retail Price (list price) of the product

Outlet_Identifier

Unique store ID

Outlet_Establishment_Year

The year in which store was established

Outlet_Size

The size of the store in terms of ground area covered

Outlet_Location_Type

The type of city in which the store is located

Outlet_Type

Whether the outlet is just a grocery store or some sort of supermarket

Item_Outlet_Sales

Sales of the product in the particular store. This is the outcome variable to be predicted.

 

 

 

DiganP_1-1614885961943.png

 

 

 

Automating Feature Engineering – What is Featuretools?

 

DiganP_2-1614885961946.jpeg

 

 

In alignment with APA, Featuretools is a framework to perform automated feature engineering. It excels at transforming temporal and relational datasets into feature matrices for machine learning.

 

There are three major components that we should be aware of:

 

An Entity Set is a collection of entities and the relationships among them. An entity can be viewed as the dataset.

 

 

DiganP_3-1614885961947.png

 

 

Here, we see that the entityset is ‘sales’ and entity id is ‘bigmart.' It’s using the ‘data’ as the dataset based on ID as the index column.

 

You can also see that there are no relationships set yet. Let’s set a relationship using normalize_entity.

 

 

DiganP_4-1614885961949.png

 

 

Ultimately what we have done is broken the bigger dataset table into two tables (BigMart and Outlet) and added a relationship. The BigMart table has 5,681 rows and 7 columns. The Outlet table has 10 rows and 5 columns.

 

The BIG difference is the relationship. We are using the Outlet_Identifier from both tables to add the relationship. This will play a key role in the generation of the new features.

 

 

DiganP_5-1614885961953.png

 

 

Deep Feature Synthesis (DFS) and Feature Primitives

DFS is an automated method for performing feature engineering on relational and temporal data. This helps us create new features from a single dataframe or multiple dataframes.

 

How does DFS work?

DFS creates features by applying Feature Primitives to the entity relationship in an EntitySet.  These primitives (functions) are often methods used to create new features in a dataset. This could be finding the mean for a variable or the average time between events for event log data to predict fraudulent behavior or future customer engagement.

 

Typically, without automated feature engineering, a data scientist would write code to manually aggregate data and apply different statistical functions. We will use DFS to create new features automatically.

 

 

DiganP_6-1614885961954.png

 

 

Here, target_entity is the entity ID for which we want to create new features (BigMart). The max depth controls the complexity of the features being generating by stacking primitives. The n_jobs parameter helps with parallelism, leveraging multiple cores to compute the workload.

 

To see the newly created 36 features, we are going to run the feature_matrix.columns command.

 

 

DiganP_7-1614885961958.png

 

 

This process ran in 0.03 second. If a user were to do this manually, it would take a much longer time!

 

 

Understanding Feature Output

You might be wondering … how are these features being created? How do I trust them?

 

Featuretools has built-in transparency. We can use the featuretools.graph_feature() and featuretools.describe_feature() to help explain what each feature is and the steps Featuretools took to generate it.

 

Let's take a look at a generated feature, outlet.COUNT(bigmart).

 

 

DiganP_8-1614885961959.png

 

 

We can use ft.describe_feature(feature) to see the descriptions of the features. This explains what the feature is and actually can be improved by adding custom definitions. See Generating Feature Descriptions for more detail.

 

Let's see a graphical representative of this feature by using the ft.graph_feature(feature) function.

 

 

DiganP_9-1614885961962.png

 

 

This is the same thing as grouping the Outlet_identifier and counting the aggregations then joining it back to the original dataset. This graph is a game changer as it adds transparency to each feature created.

 

From here, you can go ahead and build your model and predict the Item_Outlet_Sales variable using various algorithms.

 

Want to try this process yourself? Here's what your code might look like in the Python Tool within Designer.

 

 

 

#################################
# List all non-standard packages to be imported by your
# script here (only missing packages will be installed)

from ayx import Package

#Package.installPackages(['graphviz'])

#################################
from ayx import Alteryx

import featuretools as ft
import numpy as np
import pandas as pd

data = Alteryx.read('#1')
data.head()
data.drop(['Item_Identifier'], axis=1, inplace=True)

#################################
# creating and entity set 'es'
es = ft.EntitySet(id = 'sales')

# adding a dataframe
es.entity_from_dataframe(entity_id = 'bigmart', dataframe = data, index = 'ID')

#################################
es.normalize_entity(base_entity_id='bigmart', new_entity_id='outlet', index = 'Outlet_Identifier', additional_variables = ['Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'])

#################################
es['bigmart'].variables

#################################
es['outlet'].variables

#################################
feature_matrix, feature_names = ft.dfs(entityset=es, 
     target_entity = 'bigmart',
     max_depth = 2,
     verbose = 1,
     n_jobs = 3)

#################################
feature_matrix.columns

#################################
feature = feature_names[11]
print(feature)

#################################
ft.describe_feature(feature)

#################################
feature_matrix.head()
Alteryx.write(feature_matrix,1)
ft.graph_feature(feature, to_file="D:\\Desktop Files\\autoML\\feature.png")

#################################
feature_matrix.head()