Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Data Science

Machine learning & data science for beginners and experts alike.
AndrewKramer
Alteryx Alumni (Retired)

For years, data scientists have struggled to deploy their models in a timely manner before they become obsolete.  Traditionally, models must be manually recoded, a time-intensive process that can take months, if not longer, to complete. Alteryx Promote solves this model deployment challenge by allowing data scientists to quickly turn complex Machine Learning models into a RESTful API from the development environment of their choice.

 

In the example below, I am working with a generic marketing campaign dataset for a bank. The goal is to predict which customers are likely to respond to a marketing campaign.  Using Jupyter Notebook, now integrated into Alteryx Designer, I can quickly load and visualize my data.

 

data = pd.read_csv('bank.csv', sep=';')
data.head()

age job marital education default balance ... duration campaign pdays previous poutcome y 0 30 unemployed married primary no 1787 ... 79 1 -1 0 unknown no 1 33 services married secondary no 4789 ... 220 1 339 4 failure no 2 35 management single tertiary no 1350 ... 185 1 330 1 failure no 3 30 management married tertiary no 1476 ... 199 4 -1 0 unknown no 4 59 blue-collar married secondary no 0 ... 226 1 -1 0 unknown no

 

Since Scikit-Learn, the popular Python Machine Learning package, does not accept character variables, I will cast these variables to numeric via the creation of dummy variables. For the variables marital and education, the code below creates one binary variable for each level of the variable. For example, if the individual is divorced, then marital_divorced = 1. If the individual is not divorced, then marital_divorced = 0.

 

#Create Dummy Variables
data = pd.concat([data, pd.get_dummies(data[['marital','education']], dtype='int64')], axis=1)
data['y_num']=(data['y']=='yes').astype(dtype='int64')

new_cols = ['marital_divorced', 'marital_married',
'marital_single', 'education_primary',
'education_secondary','education_tertiary',
'education_unknown']

data[new_cols].head()

marital_divorced marital_married marital_single education_primary education_secondary education_tertiary education_unknown 0 0 1 0 1 0 0 0 1 0 1 0 0 1 0 0 2 0 0 1 0 0 1 0 3 0 1 0 0 0 1 0 4 0 1 0 0 1 0 0

 

With the preprocessing complete, I am ready to train my model. Using the train_test_split function in Scikit-Learn, I create a training set with 70% of my data, and a test set with the remaining 30%. After building the Gradient Boosting Model, I scored the test dataset and got an error metric of 11.2%

 

#Create Training and Validation Sample
x_col = ['age','duration','balance', 'marital_divorced', 'marital_married',
         'marital_single', 'education_primary', 'education_secondary',
         'education_tertiary', 'education_unknown']

y_col = ['y_num']

x_train, x_test, y_train, y_test = train_test_split(data[x_col], data[y_col], test_size=.3)

#Train Model
gb = XGBClassifier()
gb.fit(x_train,y_train)
pred = gb.predict(x_test)

#Misclassification
error = 1 - accuracy_score(y_test, pred)
print("XGBoost Error: ", round(error,4))

XGBoost Error:  0.1127

 

After finalizing my model, I want to save its state in a binary form called Pickle. The binary file can be loaded in other Python sessions to score new data without having to re-train the model.

 

filename = 'objects/bank_xgboost.sav'
dump(gb, filename)

['objects/bank_xgboost.sav']

 

After creating the model, the real challenge is effectively deploying it at scale. Ideally, the business would like to integrate this model into a set of internal applications, including R Shiny, iOS Applications, and Web Browsers. Alteryx Promote allows us to turn this predictive model into a RESTful API that can be called from any application that understands REST. The model no longer needs to be re-coded into multiple languages, and applications can be lightweight since Alteryx Promote handles the scoring.

 

Alteryx Promote uses Docker Swarm to create a cluster of Machines that can simultaneously handle large volumes of requests. The load balancer sends arriving requests to one machine in the cluster, which then takes the data and returns a score to the requesting application.

 

promote_crop.png

To deploy this model in Alteryx Promote, I first create a Python script called promote_xgb.py that I will use to instruct Promote how to score my model. After loading the necessary libraries, I provide the username, API Key, and Environment variables needed to access the Promote Environment. Using the Promote function, I can create a connection to my Alteryx Promote cluster, represented by the variable p.

 

import pandas as pd
import promote
from sklearn.externals.joblib import load

#Connections Parameters
username = "username"
api_key = "api_key"
env = "env"

#Connection to Server
p = promote.Promote(username, api_key, env)

 

Next, the script loads the pickled file into memory. Note that the .sav file must be in a subdirectory called objects. From there, I create a dictionary containing test data to score the model. This data will be used to test the results when deploying the model to Promote.

 

filename = './objects/bball_reg.sav'
model = load(filename)

test_data ={
  "R": 700,
  "R_G": 4.6,
  "OPS": 0.78,
  "RBI": 3.42,
  "SLG": 0.418
}

Lastly, I create a scoring function, called bank_xgb_score, defining how my model is to be scored. Every time Alteryx Promote receives a request, this function is used to return a result. This function accomplishes four important tasks:

 

  1. Converts JSON into Pandas Data Frame for scoring
  2. Converts character variables into Numeric
  3. Add missing columns to ensure all features are present
  4. Returns a prediction

 

#Scoring Function
#df2 allows a json response to be scored
def bank_xgb_score(df):
    
    #Create DataFrame
    data_keys = list(df.keys())
    data_values = list(df.values())
    data = pd.DataFrame(data=[data_values], columns=data_keys) 
    
    #Independent variables that must be in the model to score
    x_col = ['age','duration','balance', 'marital_divorced', 'marital_married',
             'marital_single', 'education_primary', 'education_secondary',
             'education_tertiary', 'education_unknown']

    #Add required dummy variable levels
    data_enc = pd.get_dummies(data[['marital','education']], dtype='int64')
    data_enc = data_enc.reindex(columns = x_col[3:], fill_value=0)
    data_final = pd.concat([data, data_enc], axis=1)
    
    prediction = (model.predict_proba(data_final[x_col])[:,1]).astype('float64')
    return {"P_yes":round(prediction[0],4)}

 

With the scoring function created, I can use the deploy function to publish my model in Alteryx Promote. This function creates a model called Bankxgb in the Alteryx Promote UI.

 

p.deploy("bankxgb", bank_xgb_score, testdata=test_data)

 

Before executing this script, we need to create a file called requirements.txt defining the Python libraries needed to score the model. This allows Promote to install these Python dependencies upon initialization of the Docker Containers using Docker Swarm.

 

promote
pandas
scikit-learn
xgboost

 

After all this preparation work, my folder structure is as follows. Be sure you have an objects folder containing your pickled file, a python (.py) script to deploy the model, and a requirements.txt file containing your Python library dependencies. The file structure is as follows:

 

bank-model/
├── objects
│   └── bank_xgboost.sav
├── promote_xgb.py 
└── requirements.txt

 

Upon executing the promote_xgb.py file, we will see in the log that the HTTP request was sent successfully. We will need to load the Alteryx Promote UI to check the model’s status.

 

model_Deployment.png

 

After finding the model in the landing screen, we can see that the deployment was successful. We can also see useful metadata about the model, such as recent modifications, name, and replicas.

 

Login_Screen.png

 

We now need to test the model to ensure it provides accurate results. In the Execute tab, we can take the sample data from the promote_xgb.py file and test the model. Promote uses the scoring function to properly process the raw user input and return a prediction.  Here, this individual has a 3.6% chance of responding to this marketing campaign, so he or she is probably not worth our time to go after.

 

Results.png

 

Now, with a working model, we can now call this RESTful API Endpoint from any environment that supports REST. Alteryx Promote provides working samples for popular options including Python, R, and Node.js, making it very easy to deploy this model in any enterprise application or web application. Using the R code, we can score a different individual on-the-fly.

 

R_Code.png

 

In conclusion, Alteryx Promote turns complicated Machine Learning models into a simple RESTful API that virtually all systems can call.  While this example shows custom Python Models, Alteryx Promote allows you to deploy many types of models including R, TensorFlow, and H20. Enterprises can now spend more time building models that provide value to an organization.

Andrew Kramer

Andrew Kramer is a Solutions Architect at Alteryx focusing on Analytics, Machine Learning, and statistical programming. He works daily with Alteryx customers to help them do more with Analytics.

Andrew Kramer is a Solutions Architect at Alteryx focusing on Analytics, Machine Learning, and statistical programming. He works daily with Alteryx customers to help them do more with Analytics.