community
cancel
Showing results for 
Search instead for 
Did you mean: 

Data Science Blog

Machine learning & data science for beginners and experts alike.
Sr. Data Science Content Engineer
Sr. Data Science Content Engineer

Teamwork makes the dream work. Our results as analysts and data scientists aren’t worth much unless we are able to share them – whether its through a dashboard and interpretive story (and dance) or through the deployment of a model.

 

Deployment can mean a few different things, depending on the scale at which you need your model to be accessible. It can mean deploying a trained model with a container via Promote, or building an application from the ground up with custom code. It can mean publishing an Alteryx workflow with a model to Server and maybe scheduling that workflow to run periodically to create new predictions, or it can even be as simple as sharing a trained model with the person or team that needs to use it to make decisions in the future.

 

It's easy to embed a model developed and trained in Python or R into a Python tool or R tool in Alteryx so that it creates predictions for new data from within a workflow. In this blog post, I’m going to walk you through the process of embedding a trained model object into a Python tool.

 

 

Why Use Alteryx to Share Python or R Models?

 

If you’ve trained and serialized a Python model, you might be wondering why not just send out a Python script with the model that creates predictions, and let them run that?

 

This is totally a reasonable approach, but there are a few things you will need to consider. For example, does the machine to which you’re sharing the script have the right version of Python installed? What happens if the format of the incoming data changes, will the user be able to handle that themselves?

 

Embedding your custom model in an Alteryx workflow allows you to deliver the model and scoring process in a context that your peers or manager might be more familiar with. You can deliver it as a packaged workflow so that the users don’t need to modify any file paths in your script to reference the model object. These are the few of the benefits of “deploying” a model with an Alteryx workflow.

 

By packaging a model within an Alteryx workflow, you can share advanced data analytics with people in a language they are familiar with. They will be empowered to use the model and to modify the Alteryx workflow to accommodate changes they might need to make.

 

 

Training the Model

 

Using Python, I’ve developed a model that classifies mushrooms as poisonous or edible based on 22 various attributes of the mushroom. This dataset is available for download on the UCI Machine Learning Repository.

 

tenor (1).gif

 

 

I’ve attached an exported copy of the workflow used to develop the model. You can check out the preprocessing steps I used; the data is really clean coming in, so all I am doing with the Alteryx tools is adding column names from the data dictionary file.  You can also see the code used to develop a logistic regression that estimates whether a mushroom is poisonous based on how it looks. 

 

 

 

 

2019-07-17_9-16-28.png

 

 

 

 

 

You can train custom Python and R models in Alteryx Designer. One reason to train a model outside of Alteryx is if you’re trying to leverage something specific with an environment, like GPUs. A common solution to the cost-prohibitive nature of GPUs is to use an AWS box with GPUs to train models instead of buying a dedicated machine. In this case, it is much easier to stand up Python and a Jupyter notebook (or other IDE) on the AWS box than to create and license as a separate Alteryx installation just to use the Python tool.

 

Regardless of where I end up developing my code, I can easily save a model object as a file and embed it into a Python tool. To help future users (and myself) I tend to do a lot of data preprocessing in Alteryx. These Alteryx preprocessing steps can (and should) be included in the deployment workflow so that data fed into the model is processed in a consistent way.  

 

 

Saving a Model Object

 

In Python (or R) to save a model as a file, you’ll need to serialize it. The pickle module in Python is designed exactly for that task. When you use the pickle functions, you are converting your model to a format that can safely be brought out of Python and loaded back in at a different time.

 

To create a pickle object, provide a file path and name as a string, and then use the function pickle.dump() to save your model object to the file path/name string, wrapping the string with the open() function, and specifying ‘wb’ as an argument for write binary.

 

modelFilename = "C:/filepathtoWorkflow/model.pkl"

pickle.dump(modelObject, open(modelfilename, 'wb'))

 

You’ll need to do this for any components of your process and code that need to be saved and included in a deployment. For this example, I needed to include both the pickled model object, as well as a pickled file of column names of the pandas data frame I used for the training data. This extra component is necessary because of the one-hot encoding preprocessing step I took to handle the categorical variables (i.e., all the predictor variables) for the mushroom data.

 

Going down a (hopefully) brief rabbit hole, if you haven’t heard of one hot encoding before, it is simply a process that converts a categorical variable (e.g., mushroom color: red, green, white) to binary (yes or no) representations of the value. This step is necessary for scikit-learn models to handle multi-class categorical variables correctly.  

 

 

 

OneHotEncoding.png

 

 

 

To do this one hot encoding, I used the pandas function get_dummies(), which creates a new column for any value in a provided variable. This works great, as long as all the possible values are present in the provided dataset. I feel good about assuming this for the training dataset, but I can’t guarantee that for any future data that needs to be scored. To make sure all the columns used in training are available in the deployment model, I saved the column names of the training data and apply them to the data frame to the deployment model, setting any missing column values to 0.

 

… that was a longer detour than I wanted to take, but hopefully the example makes sense. Any data or information you need from your training script should be exported as a pickle object.

 

Important: save (or move) these pickle objects in the same directory your deployment workflow will be created in.

 

 

Embedding a Model in a Python Workflow to Share

 

Once the model object is trained and saved (along with any other Python objects you need for deployment) you can start a new workflow to share with all of your friends. I started this new workflow by copying everything over from the previous workflow, including the Python tool.

 

I then modified the code in the Python tool to create a process for scoring with a model, instead of training one. This typically means keeping pre-processing steps performed in Python but deleting the cells related to training or evaluating the model and replacing them with the function to create predictions with your model. It also means adding code to write your predictions back out to Alteryx.

 

# Code for scoring

# Predict values of new (one hot encoded) data using model
y_pred = mushroomModel.predict(encoded_X) 

# add a column to the original pandas data frame with predictions

data['predictions'] = y_pred
data['predictions'] = data['predictions'].map({0:"Edible", 1:"Poisionous"})

# Write back data frame to Alteryx

Alteryx.write(data, 1)

 

Once the skeleton of the scoring script is put together, you can work on bringing in the pickle files. 

 

We can hardcode the file path that the pickle files are saved in as a string, and then use the pickle.load() function to read them in. However, the hardcoded file path isn’t going to work when you share the workflow to another machine. Instead, we are going to use an Engine Constant from Alteryx to find the file path of the workflow, and use that file path to locate the pickle object.

 

To do this:

 

1. Add a Text Input tool to your workflow. Create a column called “test” and put a placeholder value in the first row.

 

 

textInput.png

 

 

 

2.  Connect the Text Input tool to a Formula tool. In the Formula tool, create a new column called WorkflowPath_ (the name doesn't matter, it can be anything you want), and set the value to [Engine.WorkflowDirectory]. This will create a column with the file path of the workflow as the value in the first row.

 

 

formulatool.png

 

 

 

3. Connect the Formula tool to your Python tool. In the cell where you are reading in your input data stream, add a new line to read in your second input, which now contains the file path to your workflow. You can then create variables for the file path of each model object you need to read in (they should exist in the same directory as your workflow) by concatenating the read-in filepath with the file name. Using these file path strings, you can read in and deserialize the objects to use in your Python script. 

 

# Read in data
data = Alteryx.read("#1")

# Read in file path created with Formula tool
filepath = Alteryx.read("#2")

# Concatanate the file path with the file name
var_dir = filepath["WorkflowPath_"][0] + "encodedVar.pkl"
model_dir = filepath["WorkflowPath_"][0] + "mushroomLog.pkl"

# Read in and desearialize pickle objects
encodedVar = pickle.load(open(var_dir, 'rb'))
mushroomModel = pickle.load(open(model_dir, 'rb'))

 

With the pickle files being read in and integrated, you should be able to run your workflow and have predictions returned from the Python tool. If you’re successful at this point, you’re ready to package up the workflow to share.

 

 

Packaging the Workflow for Sharing

 

We are going to do one last tricky thing to attach the pickle files to the workflow, to make sharing the workflow super easy. We are going to add the pickle files to the workflow as workflow assets associated with the Python tool.

 

To do this:

 

1. Go to the Python tool’s configuration window.

 

2.  Click on the little box icon in the left-side toolbar in the configuration window.

 

2019-07-17_8-49-51.png

 

 

 

If you don’t see this icon, navigate to Options > User Settings > Edit User Settings and under the Advanced tab check the Display Asset Management in Properties Window Option. Click Apply.

 

 

2019-07-17_8-48-35.png

 

 

3. In the Assets window, select the Add File(s) option, and select any pickle files you want to be included with the workflow.

 

 

2019-07-17_10-08-32.png

 

 

4. They should now appear under user-added assets. Ta-da!

 

 

2019-07-17_10-08-09.png

 

 

Now, when you export your workflow to share with another user, the pickle files will be automatically included in the exported file.

 

As a final step, go to Options > Export workflow to create an exported workflow (.yxzp file). In the Export Workflow modal window, you should see the file paths for your pickle file(s). Make sure they are selected and click save!

 

 

2019-07-17_8-56-13.png

 

 

 

Boom, you’re ready to share your exported workflow with anyone and everyone (with an Alteryx version >= 2018.3, otherwise, there will be no Python tool).  

 

giphy (2).gif

When they open the packaged workflow on their machine, the pickle files will automatically be extracted to the same directory the workflow is in. Because we are using the Engine Constant in the formula tool, the Python tool knowns exactly where to find them. It’s almost too easy... 🙂

Sydney Firmin

A geographer by training and a data geek at heart, Sydney joined the Alteryx team as a Customer Support Engineer in 2017. She strongly believes that data and knowledge are most valuable when they can be clearly communicated and understood. In her current role as a Sr. Data Science Content Engineer, she gets to spend her days doing what she loves best; transforming technical knowledge and research into engaging, creative, and fun content for the Alteryx Community.

A geographer by training and a data geek at heart, Sydney joined the Alteryx team as a Customer Support Engineer in 2017. She strongly believes that data and knowledge are most valuable when they can be clearly communicated and understood. In her current role as a Sr. Data Science Content Engineer, she gets to spend her days doing what she loves best; transforming technical knowledge and research into engaging, creative, and fun content for the Alteryx Community.

Comments
Alteryx
Alteryx

Impressive!

Would definitely try this one out.. Thanks Sydney 

Alteryx Certified Partner
Alteryx Certified Partner

Absolutely fantastic.. You are a gifted story teller making the complex easy in a very articulate way

import pandas as pd
  
 def load_dataset(filename, filetype='csv', header=True😞
  
 '''
 Loads a dataset from file
  
 Parameters:
 -----------
 filename: str
 Name of data file
 filetype: str
 The type of data file (csv, tsv)
  
 Returns:
 --------
 DataFrame
 Dataset as pandas DataFrame
 '''
  
 in_file = open(filename)
 data = []
 header_row = ''
  
 # Read the file line by line into instance structure
 for line in in_file.readlines():
  
 # Skip comments
 if not line.startswith("#"😞
  
 # TSV file
 if filetype == 'tsv':
 if header:
 header_row = line.strip().split('\t')
 else:
 raw = line.strip().split('\t')
  
 # CSV file
 elif filetype =='csv':
 if header:
 header_row = line.strip().split(',')
 else:
 raw = line.strip().split(',')
  
 # Neither = problem
 else:
 print 'Invalid file type'
 exit()
  
 # Append to dataset appropriately
 if not header:
 data.append(raw)
 header = False
  
 # Build a new dataframe of the data instance list of lists and return
 df = pd.DataFrame(data, columns=header_row)
 return df

 

Some basic method how to embed a model in a workflow with a python.For additional information visit: https://bit.ly/2OUHlbu

ACE Emeritus
ACE Emeritus

Well done!  quick question: when deployed, does the deployment process provide an isolated Python (or R) environment containing the required packages (including package version)? Or would we need to adopt a best practice of ensuring we train on clients utilizing the same packages (and versions) as those found on the target server? 

 

Sr. Data Science Content Engineer
Sr. Data Science Content Engineer

Hi @JohnJPS,

 

Currently, the Python tool and the R tool do not create isolated environments for individual workflows. There is a single virtual environment for the Python tool (and a single R instance for the R tool) that is used for the tools across all workflows on a given machine. You can add packages to the Python and R tool environments, but you will need to repeat the installation of the additional packages on every machine that you want to run your workflow on (e.g., the Server the workflow is deployed to).

 

There are some changes being made to the virtual environment system for the Python tool, so stay tuned.  

 

Thanks, 

 

Sydney