Data Science

AndrewKramer · ‎02-14-2019

For years, data scientists have struggled to deploy their models in a timely manner before they become obsolete. Traditionally, models must be manually recoded, a time-intensive process that can take months, if not longer, to complete. Alteryx Promote solves this model deployment challenge by allowing data scientists to quickly turn complex Machine Learning models into a RESTful API from the development environment of their choice.

In the example below, I am working with a generic marketing campaign dataset for a bank. The goal is to predict which customers are likely to respond to a marketing campaign. Using Jupyter Notebook, now integrated into Alteryx Designer, I can quickly load and visualize my data.

data = pd.read_csv('bank.csv', sep=';')
data.head()

   age          job  marital  education default  balance ... duration campaign pdays  previous poutcome   y
0   30   unemployed  married    primary      no     1787 ...       79        1    -1         0  unknown  no
1   33     services  married  secondary      no     4789 ...      220        1   339         4  failure  no
2   35   management   single   tertiary      no     1350 ...      185        1   330         1  failure  no
3   30   management  married   tertiary      no     1476 ...      199        4    -1         0  unknown  no
4   59  blue-collar  married  secondary      no        0 ...      226        1    -1         0  unknown  no

Since Scikit-Learn, the popular Python Machine Learning package, does not accept character variables, I will cast these variables to numeric via the creation of dummy variables. For the variables marital and education, the code below creates one binary variable for each level of the variable. For example, if the individual is divorced, then marital_divorced = 1. If the individual is not divorced, then marital_divorced = 0.

#Create Dummy Variables
data = pd.concat([data, pd.get_dummies(data[['marital','education']], dtype='int64')], axis=1) 
data['y_num']=(data['y']=='yes').astype(dtype='int64')

new_cols = ['marital_divorced', 'marital_married',
            'marital_single', 'education_primary', 
            'education_secondary','education_tertiary', 
            'education_unknown'] 

data[new_cols].head()

    marital_divorced	marital_married	marital_single	education_primary	education_secondary	education_tertiary	education_unknown
0	           0	              1	             0	                1	                  0	                 0	                0
1	           0	              1	             0           	0	                  1	                 0	                0
2	           0	              0	             1	                0	                  0	                 1	                0
3	           0	              1	             0	                0	                  0	                 1	                0
4	           0	              1	             0	                0	                  1	                 0	                0

With the preprocessing complete, I am ready to train my model. Using the train_test_split function in Scikit-Learn, I create a training set with 70% of my data, and a test set with the remaining 30%. After building the Gradient Boosting Model, I scored the test dataset and got an error metric of 11.2%

#Create Training and Validation Sample
x_col = ['age','duration','balance', 'marital_divorced', 'marital_married',
         'marital_single', 'education_primary', 'education_secondary',
         'education_tertiary', 'education_unknown']

y_col = ['y_num']

x_train, x_test, y_train, y_test = train_test_split(data[x_col], data[y_col], test_size=.3)

#Train Model
gb = XGBClassifier()
gb.fit(x_train,y_train)
pred = gb.predict(x_test)

#Misclassification
error = 1 - accuracy_score(y_test, pred)
print("XGBoost Error: ", round(error,4))

XGBoost Error:  0.1127

After finalizing my model, I want to save its state in a binary form called Pickle. The binary file can be loaded in other Python sessions to score new data without having to re-train the model.

filename = 'objects/bank_xgboost.sav'
dump(gb, filename)

['objects/bank_xgboost.sav']

After creating the model, the real challenge is effectively deploying it at scale. Ideally, the business would like to integrate this model into a set of internal applications, including R Shiny, iOS Applications, and Web Browsers. Alteryx Promote allows us to turn this predictive model into a RESTful API that can be called from any application that understands REST. The model no longer needs to be re-coded into multiple languages, and applications can be lightweight since Alteryx Promote handles the scoring.

Alteryx Promote uses Docker Swarm to create a cluster of Machines that can simultaneously handle large volumes of requests. The load balancer sends arriving requests to one machine in the cluster, which then takes the data and returns a score to the requesting application.

To deploy this model in Alteryx Promote, I first create a Python script called promote_xgb.py that I will use to instruct Promote how to score my model. After loading the necessary libraries, I provide the username, API Key, and Environment variables needed to access the Promote Environment. Using the Promote function, I can create a connection to my Alteryx Promote cluster, represented by the variable p.

import pandas as pd
import promote
from sklearn.externals.joblib import load

#Connections Parameters
username = "username"
api_key = "api_key"
env = "env"

#Connection to Server
p = promote.Promote(username, api_key, env)

Next, the script loads the pickled file into memory. Note that the .sav file must be in a subdirectory called objects. From there, I create a dictionary containing test data to score the model. This data will be used to test the results when deploying the model to Promote.

filename = './objects/bball_reg.sav'
model = load(filename)

test_data ={
  "R": 700,
  "R_G": 4.6,
  "OPS": 0.78,
  "RBI": 3.42,
  "SLG": 0.418
}

Lastly, I create a scoring function, called bank_xgb_score, defining how my model is to be scored. Every time Alteryx Promote receives a request, this function is used to return a result. This function accomplishes four important tasks:

Converts JSON into Pandas Data Frame for scoring
Converts character variables into Numeric
Add missing columns to ensure all features are present
Returns a prediction

#Scoring Function
#df2 allows a json response to be scored
def bank_xgb_score(df):
    
    #Create DataFrame
    data_keys = list(df.keys())
    data_values = list(df.values())
    data = pd.DataFrame(data=[data_values], columns=data_keys) 
    
    #Independent variables that must be in the model to score
    x_col = ['age','duration','balance', 'marital_divorced', 'marital_married',
             'marital_single', 'education_primary', 'education_secondary',
             'education_tertiary', 'education_unknown']

    #Add required dummy variable levels
    data_enc = pd.get_dummies(data[['marital','education']], dtype='int64')
    data_enc = data_enc.reindex(columns = x_col[3:], fill_value=0)
    data_final = pd.concat([data, data_enc], axis=1)
    
    prediction = (model.predict_proba(data_final[x_col])[:,1]).astype('float64')
    return {"P_yes":round(prediction[0],4)}

With the scoring function created, I can use the deploy function to publish my model in Alteryx Promote. This function creates a model called Bankxgb in the Alteryx Promote UI.

p.deploy("bankxgb", bank_xgb_score, testdata=test_data)

Before executing this script, we need to create a file called requirements.txt defining the Python libraries needed to score the model. This allows Promote to install these Python dependencies upon initialization of the Docker Containers using Docker Swarm.

promote
pandas
scikit-learn
xgboost

After all this preparation work, my folder structure is as follows. Be sure you have an objects folder containing your pickled file, a python (.py) script to deploy the model, and a requirements.txt file containing your Python library dependencies. The file structure is as follows:

bank-model/
├── objects
│   └── bank_xgboost.sav
├── promote_xgb.py 
└── requirements.txt

Upon executing the promote_xgb.py file, we will see in the log that the HTTP request was sent successfully. We will need to load the Alteryx Promote UI to check the model’s status.

After finding the model in the landing screen, we can see that the deployment was successful. We can also see useful metadata about the model, such as recent modifications, name, and replicas.

We now need to test the model to ensure it provides accurate results. In the Execute tab, we can take the sample data from the promote_xgb.py file and test the model. Promote uses the scoring function to properly process the raw user input and return a prediction. Here, this individual has a 3.6% chance of responding to this marketing campaign, so he or she is probably not worth our time to go after.

Now, with a working model, we can now call this RESTful API Endpoint from any environment that supports REST. Alteryx Promote provides working samples for popular options including Python, R, and Node.js, making it very easy to deploy this model in any enterprise application or web application. Using the R code, we can score a different individual on-the-fly.

In conclusion, Alteryx Promote turns complicated Machine Learning models into a simple RESTful API that virtually all systems can call. While this example shows custom Python Models, Alteryx Promote allows you to deploy many types of models including R, TensorFlow, and H20. Enterprises can now spend more time building models that provide value to an organization.

Data Science

Deploying Custom Python Models with Alteryx Promote