This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
The last set of model-construction techniques we need to review, before embarking on our study of data-science design patterns proper, is techniques for combining several models into a single (more complex) model. Many state-of-the-art models use model-combination techniques to improve substantially on the accuracy of their sub-models. So it’s not surprising that model combination remains an area of active research, or that combined models often win data-science competitions.
Space vs. Speed
Fundamentally, there are two ways to improve accuracy in a given modeling exercise. One way is to gather more data. The other way is to build a better model. Gathering more data requires more storage space. At least when building a better model translates to building a more complex model, this usually involves more (model fitting and model execution) CPU time. So we can see the choice between data gathering and model building as an instance of the old computer-programming tradeoff between space and speed. In particular, when we consider using a model-combination technique, it’s important to ask ourselves two questions:
What level of accuracy is good enough?
Are we likely to achieve our target accuracy more easily by gathering more data or by building a more complex model?
There are two main classes of model-combination techniques. Ensemble models combine several sub-models, all of which solve the same formal problem. There are a few well-defined classes of ensemble techniques; these are the focus of this blog post. Compound models combine several sub-models, each solving a different part of a given problem. (For now you can think of a compound model as a pipeline or daisy chain of models.) Many real-world models are compound models, so many of our future blog posts will describe compound-model design patterns.
We classify ensemble techniques according to how they create sub-model diversity. There are three common sources of sub-model variation, within a given ensemble technique:
This blog post doesn’t try to cover every possible combination of techniques across these areas. Rather, we’ll use these distinctions to explain some of the most popular ensemble techniques.
Ensembles can also differ in their aggregation rule, their method of combining results from sub-models. Here are brief summaries of a few popular rules.
The two most common aggregation rules for (supervised) classification are simple and weighted majority vote. Note that both require a tie-breaking heuristic.
One chooses the mode of the sub-models’ predictions.
Weighted Majority Vote
One weighs the sub-models’ predictions by a model-quality or -relevance criterion, and selects the prediction having the most weight.
The two most common aggregation rules for regression are simple and weighted averaging of the sub-models’ predictions. The approaches to weighted averaging are as above for classification. Note that we contrast prediction averaging with model averaging (see below).
Popular Ensemble Techniques
A handful of ensemble techniques are in common use. We review them briefly here.
Boosting combines many weak sub-models having a common type of induction algorithm into a single strong ensemble (where a sub-model is weak if the correlation between predicted and actual value is only slightly better than that achieved by random guessing). Typically a boosting algorithm constructs its ensemble model by iterating adding sub-models to the ensemble in progress, changing sub-model weights and data-point weights at each iteration in a way that focuses newly added sub-models on fitting data points that the ensemble has so far handled poorly. AdaBoost is the most popular boosting meta-algorithm.
Boosting has some very desirable mathematical properties. In particular, if the sub-models all perform better than random guessing, the boosting algorithm converges on a strong model. However, boosting can be sensitive to noise in the training set’s dependent-variable values.
Bagging is short for bootstrap aggregation. Like boosting, bagging varies the training set while holding the type of induction algorithm constant. The technique generates a set of bootstrap samples from the overall training set, fits the same induction algorithm to each sample, and then combines the predictions of the fitted models. (Different bagging algorithms may use different aggregation rules.)
Bagging can improve the performance of algorithms that are unstable in the sense that small changes in the training set can yield very different predictions. But it can degrade the performance of stable algorithms.
The random-subspace method (sometimes termed feature bagging). Here we use bootstrap sampling to create a collection of feature subsets, and we fit the same type of induction algorithm to each subset. (The bootstrap sampling means the same feature can appear in several sub-models’ feature sets; the feature sampling is sampling with replacement, across the sub-models. Of course, the same feature can appear at most once within any given sub-model’s feature set, making the feature sampling for a given sub-model sampling without replacement.) A random-subspace method may fit the sub-models to the entire training dataset, or it may fit different sub-models to different subsets of the training set. For example, the most popular version of the random-forest model (the version implemented in Alteryx’s Forest Model tool) combines feature bagging with ordinary (training set) bagging.
Stacking (sometimes called blending) trains several sub-models having arbitrary induction algorithms and feature sets on the entire training set, and then trains another algorithm to decide which sub-model to use in predicting a given data point’s dependent-variable value. Stacking generally outperforms the performance of each sub-model, and is frequently used in winning entries of data-science competitions.
Bucket of Models
Bucket of models is like stacking, but it chooses a sub-model for an entire problem rather than for a single data point. There are three ways to view bucket of models:
The ensemble method defines a small, manually fixed set of sub-models having different features sets and/or different induction algorithms, but a common measure of model fitness. For each new problem, the method trains each sub-model and chooses the sub-model having optimal fitness.
When the bucket of models only includes all models having a fixed set of input variables (superset of possible features) and a fixed type of induction algorithm, the model selection reduces to a model-fitting problem. Basic multi-level cross-validation tuning (see our previous post) with grid search is an example.
When the ensemble method defines an entire space of possible sub-models (including possible features and different induction algorithms), we arrive at the broad notion of model selection described in our previous blog post. Alteryx has a partnership with Data Robot, whose tool uses exhaustive search to perform model selection. Genetic algorithms are another interesting approach to this type of model selection.
How Many Sub-Models?
Whether and how to limit the number of sub-models in an ensemble is an important question, and an area of current research. The most basic problem is data dredging. If an ensemble method selects sub-models from a sufficiently large hypothesis space (space of possible sub-models), it is likely to find a combination having high out-of-sample fitness, even if there is no true relationship between the ensemble and the data population. Such false relationships are termed spurious. Exhaustive search and large model spaces both increase the risk of finding a spurious relationship. So, as you consider using an ensemble technique, think about whether the ensemble’s hypothesis space and search algorithm will conduct a sufficiently narrow search to avoid data dredging.
Onward to Actual Design Patterns!
If you’ve read the first seven Alteryx Data-Science Design Pattern blog posts, you’ve learned all of the essential concepts you need to know, to understand
how real-world data-science design patterns are constructed
why they work well for their use cases
when and how to apply them to new problems.
Starting with our first post in January of 2017, we’ll begin exploring real-world design patterns, and showing you how to implement them in Alteryx and R.