In this post, I'll show you an example of how you can use Featuretools to automatically generate new features for use in machine learning. This open-source package from Alteryx makes feature engineering easy and fast.
A feature can be described as a column or variable that represents a measurable piece of data that can be used for analysis. For example:
Customer ID |
Name |
Income |
Spend |
Number of Items |
1 |
Mr. John Doe |
50,000 |
1,200 |
15 |
2 |
Ms. Grace Desilva |
75,000 |
15,000 |
36 |
3 |
Mrs. Alex Hall |
65,000 |
7,500 |
7 |
4 |
Mr. Lopez |
125,000 |
20,000 |
5 |
Here, each column is a feature.
Feature engineering is the process of using domain knowledge to create new features from raw data to improve performance of machine learning algorithms. Looking at the example earlier:
Customer ID |
Name |
Income |
Spend |
Number of Items |
Gender |
Spend per Item |
1 |
Mr. John Doe |
50,000 |
1,200 |
15 |
Male |
100 |
2 |
Ms. Grace Desilva |
75,000 |
15,000 |
36 |
Female |
416.67 |
3 |
Mrs. Alex Hall |
65,000 |
7,500 |
7 |
Male |
1071.43 |
4 |
Mr. Lopez |
125,000 |
20,000 |
5 |
Female |
4000 |
Here, we created two features based on the existing columns – Gender and Spend per Item. For Gender, we used the “Mr.,” “Ms.” and “Mrs.” to assume the gender. For Spend per Item, we used the spend over the number of items. These are just simple examples. However, in practice when we have lot of features, creating new features can become complex and cumbersome to manage.
The performance of a predictive model is heavily dependent on the quality of the features in the dataset used to train the model. If you are able to create new features and thereby give the model more information to work with, the performance will go up. Sometimes creating these features can take some time. The user must explore the data, create visuals, brainstorm, and analyze the results. Some users also see this as an art, as It often requires an understanding of the data and an ability to view it from different perspectives. The bottom line is that if you are good at it, you have a major advantage over the competition.
Background on the data for this example:
We are going to be looking at the BigMart dataset. There are 1559 products and 10 stores. You can visualize it as two tables in one: Item and Outlet table.
Variable |
Description |
Item_Identifier |
Unique product ID |
Item_Weight |
Weight of product |
Item_Fat_Content |
Whether the product is low fat or not |
Item_Visibility |
The % of total display area of all products in a store allocated to the particular product |
Item_Type |
The category to which the product belongs |
Item_MRP |
Maximum Retail Price (list price) of the product |
Outlet_Identifier |
Unique store ID |
Outlet_Establishment_Year |
The year in which store was established |
Outlet_Size |
The size of the store in terms of ground area covered |
Outlet_Location_Type |
The type of city in which the store is located |
Outlet_Type |
Whether the outlet is just a grocery store or some sort of supermarket |
Item_Outlet_Sales |
Sales of the product in the particular store. This is the outcome variable to be predicted. |
In alignment with APA, Featuretools is a framework to perform automated feature engineering. It excels at transforming temporal and relational datasets into feature matrices for machine learning.
There are three major components that we should be aware of:
An Entity Set is a collection of entities and the relationships among them. An entity can be viewed as the dataset.
Here, we see that the entityset is ‘sales’ and entity id is ‘bigmart.' It’s using the ‘data’ as the dataset based on ID as the index column.
You can also see that there are no relationships set yet. Let’s set a relationship using normalize_entity.
Ultimately what we have done is broken the bigger dataset table into two tables (BigMart and Outlet) and added a relationship. The BigMart table has 5,681 rows and 7 columns. The Outlet table has 10 rows and 5 columns.
The BIG difference is the relationship. We are using the Outlet_Identifier from both tables to add the relationship. This will play a key role in the generation of the new features.
DFS is an automated method for performing feature engineering on relational and temporal data. This helps us create new features from a single dataframe or multiple dataframes.
How does DFS work?
DFS creates features by applying Feature Primitives to the entity relationship in an EntitySet. These primitives (functions) are often methods used to create new features in a dataset. This could be finding the mean for a variable or the average time between events for event log data to predict fraudulent behavior or future customer engagement.
Typically, without automated feature engineering, a data scientist would write code to manually aggregate data and apply different statistical functions. We will use DFS to create new features automatically.
Here, target_entity is the entity ID for which we want to create new features (BigMart). The max depth controls the complexity of the features being generating by stacking primitives. The n_jobs parameter helps with parallelism, leveraging multiple cores to compute the workload.
To see the newly created 36 features, we are going to run the feature_matrix.columns command.
This process ran in 0.03 second. If a user were to do this manually, it would take a much longer time!
You might be wondering … how are these features being created? How do I trust them?
Featuretools has built-in transparency. We can use the featuretools.graph_feature() and featuretools.describe_feature() to help explain what each feature is and the steps Featuretools took to generate it.
Let's take a look at a generated feature, outlet.COUNT(bigmart).
We can use ft.describe_feature(feature) to see the descriptions of the features. This explains what the feature is and actually can be improved by adding custom definitions. See Generating Feature Descriptions for more detail.
Let's see a graphical representative of this feature by using the ft.graph_feature(feature) function.
This is the same thing as grouping the Outlet_identifier and counting the aggregations then joining it back to the original dataset. This graph is a game changer as it adds transparency to each feature created.
From here, you can go ahead and build your model and predict the Item_Outlet_Sales variable using various algorithms.
Want to try this process yourself? Here's what your code might look like in the Python Tool within Designer.
#################################
# List all non-standard packages to be imported by your
# script here (only missing packages will be installed)
from ayx import Package
#Package.installPackages(['graphviz'])
#################################
from ayx import Alteryx
import featuretools as ft
import numpy as np
import pandas as pd
data = Alteryx.read('#1')
data.head()
data.drop(['Item_Identifier'], axis=1, inplace=True)
#################################
# creating and entity set 'es'
es = ft.EntitySet(id = 'sales')
# adding a dataframe
es.entity_from_dataframe(entity_id = 'bigmart', dataframe = data, index = 'ID')
#################################
es.normalize_entity(base_entity_id='bigmart', new_entity_id='outlet', index = 'Outlet_Identifier', additional_variables = ['Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'])
#################################
es['bigmart'].variables
#################################
es['outlet'].variables
#################################
feature_matrix, feature_names = ft.dfs(entityset=es,
target_entity = 'bigmart',
max_depth = 2,
verbose = 1,
n_jobs = 3)
#################################
feature_matrix.columns
#################################
feature = feature_names[11]
print(feature)
#################################
ft.describe_feature(feature)
#################################
feature_matrix.head()
Alteryx.write(feature_matrix,1)
ft.graph_feature(feature, to_file="D:\\Desktop Files\\autoML\\feature.png")
#################################
feature_matrix.head()
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.