Alteryx Promote vs. Concept Drift, Data Drift Drift, Model Decay, Model Retrain

Question

"So you have constructed the most bad-**bleep** predictive model based on the painstakingly prepared data set.
Just like with anything else in our lives (unfortunately) nothing lasts forever though.
The predictive ability of your model decays over time. How can you approach this problem and fix it with Alteryx Server & Promote?”

I would like to share this short write-up based on FAQs from customers I get asked relatively often when talking about Promote.

This time shortly on the topic of Alteryx Promote vs Concept Drift, Data Drift Drift, Model Decay, and Model Retrain.

Note to self: I love sharing stuff with the Alteryx Community. But I must also admit that over time I have grown to enjoy beating my team lead @ShaanM to the number of posts.

Predictive Modeling

Predictive modeling is about building models from historical datasets and then using the models to make predictions for the new data.
This could, for instance, mean building a classification model to predict which customers are likely to respond to our marketing campaigns in the future, so we can increase the effectiveness of the marketing targetting, decrease expenses and increase the sales bottom number.

Data change over time

The data you are working with and use for your predictive model can change over time.
This may cause less and less optimal business decisions as time goes based on predictions from such a model for the same data.Back to the marketing campaign example from above - as the marketing team uses the model to drive their decisions for quite a while now, customer's purchasing patterns and marketing response behavior also (could have) changed. The model may no longer capture new reality anymore.
Drift
Drift in data science refers to the fact that your model loses its predictive-ability. Technically speaking, changes in the relationships between input and output data occur over time. There are actually two types of drift.  We talk about data drift and concept drift - generally, these are also the reasons for your model decay.With data drift - collected data evolve over time potentially introducing unseen patterns and variations in the data.
And, with concept drift, the interpretation of data changes over time even though the distribution in the data does not.

"In most challenging data analysis applications, data evolve over time and must be analyzed in near real time. Patterns and relations in such data often evolve over time, thus, models built for analyzing such data quickly become obsolete over time. In machine learning and data mining, this phenomenon is referred to as concept drift.”
Source: Whitepaper on drift by dept of Computer Science, Finland, 2014

What to do about drift?
To fix the data drift, new data needs to be labeled to introduce new classes and the model retrained.
To fix the concept drift, the old data needs to be re-labeled and model retrained.Continuous re-labeling of old data and retraining of models can still be an expensive exercise.
You just don’t want to guess when the model goes stale. We want to be able to track/ monitor concept drift and act on it as needed.

How to monitor (concept) drift and retrain your model?
Let's assume you build a model on labeled data and obtained the model performance metrics, say the f1-score for test data using the same model. As part of your business decision, you defined your least acceptable F1 score to be 0.925.
Now, you get another test set of labeled data to check how your model performs.  The test set is compared with predictions from the latest model at hand.
When the f1-score of the sample falls below a threshold (0.925 here) we trigger a re-label/re-train task.
The model needs to be re-trained using the updated labels so as to recover its predictive ability.

How to approach this with Promote & Server
Let's say that you are planning to deploy your models using Alteryx Promote. And you want to be able to track drift of your models.
Based on the previous paragraph, detecting drift would be all about having your model score a (random) test set consisting of labeled data and checking how your model performs.Getting the performance metrics of your modelThis could easily be done for instance using Alteryx Designer and sending, say, 1000 records of your test set against the Promote model using the Score tool.
Obviously, it is up to you to decide what are your performance metrics. This may be for instance F1 score, AUC, etc. You will need to calculate these metrics yourself as every single customer and model will have different needs.Storing the performance metrics over timeThe calculated results can then be pushed to your database or some other type of persistence store.Moreover, together with Alteryx Server, you could schedule your workflow to check the performance of your models daily (or anything).This all then lets you utilize the visualytics tool of your choice (Alteryx, Tableau, PowerBI, …) to plot the performance metrics of your model over time.You could then have your data science team monitor the model performance and drift/ decay centrally from your dashboards built on the top of this data.Simple Example of Getting the performance metricsObviously extremely oversimplifying here but here is an indicative workflow how this could work:
Model rebuild and redeploy with Server
Also, in combination with Alteryx Server, if you get to the point that your least acceptable model performance criteria are not met, you could trigger the model rebuild and redeploy.
This would, of course, depend on whether your model is based in Alteryx, Python or R but for all these types you should be able to achieve this without too much effort.
Simple sample of rebuild/ redeployAgain, a very simple indicative workflow could be built using runner tools. The workflow all the way on the left would just retrieve your model's performance results, compare them against your minimal acceptable values and, if less than you want them to be, trigger a model retrain and model redeploy (which will differ based on the approach used to create the model - Designer, vs R, vs Python.

DavidM · Answer

@joshuaburkhow thanks for very positive feedback. much appreciated.

I think that typically the least acceptable score would be something required by the wider team including the business side who would agree on what that should look like.

Surely possible the model gets better and better and at some point you may increase those thresholds for sure.

joshuaburkhow · Answer

Thanks @DavidM I had this marked for a while to read and this is really slick. I especially like the last part where you can essentially automate that entire check, retrain, and redeploy WHILE also having visibility over time where it drifts and gets corrected 🙂

Question is it likely or common in that when a model gets retrained it gets better to the point we have to change the least acceptable score upward? How would we know?

Thanks again!

Joshua