Alteryx Data Science Design Patterns: Combining Models

Question

Data-Science Design Patterns:  Combining Models

The last set of model-construction techniques we need to review, before embarking on our study of data-science design patterns proper, is techniques for combining several models into a single (more complex) model.  Many state-of-the-art models use model-combination techniques to improve substantially on the accuracy of their sub-models.  So it’s not surprising that model combination remains an area of active research, or that combined models often win data-science competitions.

Space vs. Speed

Fundamentally, there are two ways to improve accuracy in a given modeling exercise.  One way is to gather more data.  The other way is to build a better model.  Gathering more data requires more storage space.  At least when building a better model translates to building a more complex model, this usually involves more (model fitting and model execution) CPU time.  So we can see the choice between data gathering and model building as an instance of the old computer-programming tradeoff between space and speed.  In particular, when we consider using a model-combination technique, it’s important to ask ourselves two questions:

* What level of accuracy is good enough?
* Are we likely to achieve our target accuracy more easily by gathering more data or by building a more complex model?

It is nowadays a big-data truism that often, gathering more data trumps building a more complex model.  But the two are not mutually exclusive.  You may need to do both!

There are two main classes of model-combination techniques.  Ensemble models combine several sub-models, all of which solve the same formal problem.  There are a few well-defined classes of ensemble techniques; these are the focus of this blog post.  Compound models combine several sub-models, each solving a different part of a given problem.  (For now you can think of a compound model as a pipeline or daisy chain of models.)  Many real-world models are compound models, so many of our future blog posts will describe compound-model design patterns.

Ensembles

We classify ensemble techniques according to how they create sub-model diversity.  There are three common sources of sub-model variation, within a given ensemble technique:

* induction algorithm
* features
* training dataset.

This blog post doesn’t try to cover every possible combination of techniques across these areas.  Rather, we’ll use these distinctions to explain some of the most popular ensemble techniques.

Aggregation Rules

Ensembles can also differ in their aggregation rule, their method of combining results from sub-models.  Here are brief summaries of a few popular rules.

Classification Rules

The two most common aggregation rules for (supervised) classification are simple and weighted majority vote.  Note that both require a tie-breaking heuristic.

Majority Vote

One chooses the mode of the sub-models’ predictions.

Weighted Majority Vote

One weighs the sub-models’ predictions by a model-quality or -relevance criterion, and selects the prediction having the most weight.

Regression Rules

The two most common aggregation rules for regression are simple and weighted averaging of the sub-models’ predictions.  The approaches to weighted averaging are as above for classification.  Note that we contrast prediction averaging with model averaging (see below).

Popular Ensemble Techniques

A handful of ensemble techniques are in common use.  We review them briefly here.

Boosting

Boosting combines many weak sub-models having a common type of induction algorithm into a single strong ensemble (where a sub-model is weak if the correlation between predicted and actual value is only slightly better than that achieved by random guessing).  Typically a boosting algorithm constructs its ensemble model by iterating adding sub-models to the ensemble in progress, changing sub-model weights and data-point weights at each iteration in a way that focuses newly added sub-models on fitting data points that the ensemble has so far handled poorly.  AdaBoost is the most popular boosting meta-algorithm.

Boosting has some very desirable mathematical properties.  In particular, if the sub-models all perform better than random guessing, the boosting algorithm converges on a strong model.  However, boosting can be sensitive to noise in the training set’s dependent-variable values.

Bagging

Bagging is short for bootstrap aggregation.  Like boosting, bagging varies the training set while holding the type of induction algorithm constant.  The technique generates a set of bootstrap samples from the overall training set, fits the same induction algorithm to each sample, and then combines the predictions of the fitted models.  (Different bagging algorithms may use different aggregation rules.)

Bagging can improve the performance of algorithms that are unstable in the sense that small changes in the training set can yield very different predictions.  But it can degrade the performance of stable algorithms.

Random Subspace

The random-subspace method (sometimes termed feature bagging).  Here we use bootstrap sampling to create a collection of feature subsets, and we fit the same type of induction algorithm to each subset.  (The bootstrap sampling means the same feature can appear in several sub-models’ feature sets; the feature sampling is sampling with replacement, across the sub-models.  Of course, the same feature can appear at most once within any given sub-model’s feature set, making the feature sampling for a given sub-model sampling without replacement.)  A random-subspace method may fit the sub-models to the entire training dataset, or it may fit different sub-models to different subsets of the training set.  For example, the most popular version of the random-forest model (the version implemented in Alteryx’s Forest Model tool) combines feature bagging with ordinary (training set) bagging.

Stacking

Stacking (sometimes called blending) trains several sub-models having arbitrary induction algorithms and feature sets on the entire training set, and then trains another algorithm to decide which sub-model to use in predicting a given data point’s dependent-variable value.  Stacking generally outperforms the performance of each sub-model, and is frequently used in winning entries of data-science competitions.

Bucket of Models

Bucket of models is like stacking, but it chooses a sub-model for an entire problem rather than for a single data point.  There are three ways to view bucket of models:

* The ensemble method defines a small, manually fixed set of sub-models having different features sets and/or different induction algorithms, but a common measure of model fitness. For each new problem, the method trains each sub-model and chooses the sub-model having optimal fitness.

* When the bucket of models only includes all models having a fixed set of input variables (superset of possible features) and a fixed type of induction algorithm, the model selection reduces to a model-fitting problem. Basic multi-level cross-validation tuning (see our previous post) with grid search is an example.

* When the ensemble method defines an entire space of possible sub-models (including possible features and different induction algorithms), we arrive at the broad notion of model selection described in our previous blog post. Alteryx has a partnership with Data Robot, whose tool uses exhaustive search to perform model selection.  Genetic algorithms are another interesting approach to this type of model selection.

How Many Sub-Models?

Whether and how to limit the number of sub-models in an ensemble is an important question, and an area of current research.  The most basic problem is data dredging.  If an ensemble method selects sub-models from a sufficiently large hypothesis space (space of possible sub-models), it is likely to find a combination having high out-of-sample fitness, even if there is no true relationship between the ensemble and the data population.  Such false relationships are termed spurious.  Exhaustive search and large model spaces both increase the risk of finding a spurious relationship.  So, as you consider using an ensemble technique, think about whether the ensemble’s hypothesis space and search algorithm will conduct a sufficiently narrow search to avoid data dredging.

Onward to Actual Design Patterns!

If you’ve read the first seven Alteryx Data-Science Design Pattern blog posts, you’ve learned all of the essential concepts you need to know, to understand

* how real-world data-science design patterns are constructed
* why they work well for their use cases
* when and how to apply them to new problems.

Starting with our first post in January of 2017, we’ll begin exploring real-world design patterns, and showing you how to implement them in Alteryx and R.

Happy holidays!

ToddM · Answer

Some of the algorithms DataRobot uses are ensembed (e.g. random forest).  I’m not clear whether DataRobot experiments with ensembling patterns on top of its algorithm library.

Todd Morley
Director of Analytics Products
303.413.8218 (office) 720.560.8901 (cell)
tmorley@alteryx.com | www.alteryx.com

AlbertP1 · Answer

Hi Todd,

Thanks for the post. Would you know if the Data Robot Tools from the Gallery also perform Ensemble Modeling?

Albert

Darroch · Answer

Thanks for writing this. I'm looking forward to your January post as I've not seen any examples within the Alteryx environment.

Darroch