Data Science

DiganP · ‎03-08-2021

In this post, I'll show you an example of how you can use Featuretools to automatically generate new features for use in machine learning. This open-source package from Alteryx makes feature engineering easy and fast.

What is a Feature?

A feature can be described as a column or variable that represents a measurable piece of data that can be used for analysis. For example:

Customer ID	Name	Income	Spend	Number of Items
1	Mr. John Doe	50,000	1,200	15
2	Ms. Grace Desilva	75,000	15,000	36
3	Mrs. Alex Hall	65,000	7,500	7
4	Mr. Lopez	125,000	20,000	5

Here, each column is a feature.

What is Feature Engineering?

Feature engineering is the process of using domain knowledge to create new features from raw data to improve performance of machine learning algorithms. Looking at the example earlier:

Customer ID	Name	Income	Spend	Number of Items	Gender	Spend per Item
1	Mr. John Doe	50,000	1,200	15	Male	100
2	Ms. Grace Desilva	75,000	15,000	36	Female	416.67
3	Mrs. Alex Hall	65,000	7,500	7	Male	1071.43
4	Mr. Lopez	125,000	20,000	5	Female	4000

Here, we created two features based on the existing columns – Gender and Spend per Item. For Gender, we used the “Mr.,” “Ms.” and “Mrs.” to assume the gender. For Spend per Item, we used the spend over the number of items. These are just simple examples. However, in practice when we have lot of features, creating new features can become complex and cumbersome to manage.

Why is Feature Engineering Required?

The performance of a predictive model is heavily dependent on the quality of the features in the dataset used to train the model. If you are able to create new features and thereby give the model more information to work with, the performance will go up. Sometimes creating these features can take some time. The user must explore the data, create visuals, brainstorm, and analyze the results. Some users also see this as an art, as It often requires an understanding of the data and an ability to view it from different perspectives. The bottom line is that if you are good at it, you have a major advantage over the competition.

Background on the data for this example:

We are going to be looking at the BigMart dataset. There are 1559 products and 10 stores. You can visualize it as two tables in one: Item and Outlet table.

Variable	Description
Item_Identifier	Unique product ID
Item_Weight	Weight of product
Item_Fat_Content	Whether the product is low fat or not
Item_Visibility	The % of total display area of all products in a store allocated to the particular product
Item_Type	The category to which the product belongs
Item_MRP	Maximum Retail Price (list price) of the product
Outlet_Identifier	Unique store ID
Outlet_Establishment_Year	The year in which store was established
Outlet_Size	The size of the store in terms of ground area covered
Outlet_Location_Type	The type of city in which the store is located
Outlet_Type	Whether the outlet is just a grocery store or some sort of supermarket
Item_Outlet_Sales	Sales of the product in the particular store. This is the outcome variable to be predicted.

Automating Feature Engineering – What is Featuretools?

In alignment with APA, Featuretools is a framework to perform automated feature engineering. It excels at transforming temporal and relational datasets into feature matrices for machine learning.

There are three major components that we should be aware of:

An Entity Set is a collection of entities and the relationships among them. An entity can be viewed as the dataset.

Here, we see that the entityset is ‘sales’ and entity id is ‘bigmart.' It’s using the ‘data’ as the dataset based on ID as the index column.

You can also see that there are no relationships set yet. Let’s set a relationship using normalize_entity.

Ultimately what we have done is broken the bigger dataset table into two tables (BigMart and Outlet) and added a relationship. The BigMart table has 5,681 rows and 7 columns. The Outlet table has 10 rows and 5 columns.

The BIG difference is the relationship. We are using the Outlet_Identifier from both tables to add the relationship. This will play a key role in the generation of the new features.

Deep Feature Synthesis (DFS) and Feature Primitives

DFS is an automated method for performing feature engineering on relational and temporal data. This helps us create new features from a single dataframe or multiple dataframes.

How does DFS work?

DFS creates features by applying Feature Primitives to the entity relationship in an EntitySet. These primitives (functions) are often methods used to create new features in a dataset. This could be finding the mean for a variable or the average time between events for event log data to predict fraudulent behavior or future customer engagement.

Typically, without automated feature engineering, a data scientist would write code to manually aggregate data and apply different statistical functions. We will use DFS to create new features automatically.

Here, target_entity is the entity ID for which we want to create new features (BigMart). The max depth controls the complexity of the features being generating by stacking primitives. The n_jobs parameter helps with parallelism, leveraging multiple cores to compute the workload.

To see the newly created 36 features, we are going to run the feature_matrix.columns command.

This process ran in 0.03 second. If a user were to do this manually, it would take a much longer time!

Understanding Feature Output

You might be wondering … how are these features being created? How do I trust them?

Featuretools has built-in transparency. We can use the featuretools.graph_feature() and featuretools.describe_feature() to help explain what each feature is and the steps Featuretools took to generate it.

Let's take a look at a generated feature, outlet.COUNT(bigmart).

We can use ft.describe_feature(feature) to see the descriptions of the features. This explains what the feature is and actually can be improved by adding custom definitions. See Generating Feature Descriptions for more detail.

Let's see a graphical representative of this feature by using the ft.graph_feature(feature) function.

This is the same thing as grouping the Outlet_identifier and counting the aggregations then joining it back to the original dataset. This graph is a game changer as it adds transparency to each feature created.

From here, you can go ahead and build your model and predict the Item_Outlet_Sales variable using various algorithms.

Want to try this process yourself? Here's what your code might look like in the Python Tool within Designer.

#################################
# List all non-standard packages to be imported by your
# script here (only missing packages will be installed)

from ayx import Package

#Package.installPackages(['graphviz'])

#################################
from ayx import Alteryx

import featuretools as ft
import numpy as np
import pandas as pd

data = Alteryx.read('#1')
data.head()
data.drop(['Item_Identifier'], axis=1, inplace=True)

#################################
# creating and entity set 'es'
es = ft.EntitySet(id = 'sales')

# adding a dataframe
es.entity_from_dataframe(entity_id = 'bigmart', dataframe = data, index = 'ID')

#################################
es.normalize_entity(base_entity_id='bigmart', new_entity_id='outlet', index = 'Outlet_Identifier', additional_variables = ['Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'])

#################################
es['bigmart'].variables

#################################
es['outlet'].variables

#################################
feature_matrix, feature_names = ft.dfs(entityset=es, 
     target_entity = 'bigmart',
     max_depth = 2,
     verbose = 1,
     n_jobs = 3)

#################################
feature_matrix.columns

#################################
feature = feature_names[11]
print(feature)

#################################
ft.describe_feature(feature)

#################################
feature_matrix.head()
Alteryx.write(feature_matrix,1)
ft.graph_feature(feature, to_file="D:\\Desktop Files\\autoML\\feature.png")

#################################
feature_matrix.head()