Alteryx lets “citizen” as well as professional data scientists construct rich analytical business models. The same wealth of tooling that makes this richness possible also means that modelers face many model-design decisions. Citizen data scientists especially may be unaware of the alternatives, or of the subtleties that determine which alternatives best fit a given business problem. One way to tackle this complexity effectively is to apply data-science design patterns.
In general, a design pattern is a reusable solution to a given class of design problems that is recognized by experts as an effective approach to those problems. An antipattern is a common approach to a given class of design problems that experts recognize as risky, ineffective, or counterproductive. A collection of design patterns and antipatterns for a given domain is sometimes called a pattern language. (See the Wikipedia articles on design patterns and antipatterns to learn more.)
A data-science design pattern (DSDP) is a design pattern for a data-science design problem. This blog series will present a pattern language for practicing data science on the Alteryx platform. Learning this pattern language will help you build better models with less effort, especially on Alteryx.
The most general DSDP prescribes the parts of a predictive model. So let’s start by learning that pattern. A predictive model should have eight parts:
The first four parts together prepare the model’s input data. The second four parts construct the model using the prepared input data. The rest of this blog explains and illustrates the first four parts. We’ll devote a separate post to explaining and illustrating each of the second four.
Many disciplines contribute to data science. As a result, there are often several words for the same data-science idea. For example, a model’s raw input data may be called (among other things) input variables, source variables, independent variables, attributes, or dimensions. We’ll stick with input variables. This is the data you explore (and perhaps collect), before you change it in any way, and before you decide which parts of it matter for a specific modeling problem. Usually the input data appears in a table. Each row (other than the heading row, if any) represents instances of a group of things you want to model (often called the population). Each column (other than the columns identifying the population members) contains an input variable.
For example, you might be a medical researcher having access to an electronic medical record (EMR) database containing patient data collected by primary-care physicians. The available EMR input variables might look like this:
Gender | Height (in) | Age | Weight (lbs) | BMI | % Body Fat | IQ |
0 | 66.97 | 28 | 194.9 | 30.5 | 30.4 | 106 |
0 | 66.40 | 38 | 196.5 | 31.4 | 31.0 | 92 |
1 | 65.71 | 78 | 142.5 | 23.2 | 30.0 | 102 |
0 | 70.18 | 31 | 177.9 | 25.4 |
21.7 | 99 |
0 | 68.46 | 22 | 164.7 | 24.7 | 22.8 | 118 |
1 | 62.83 | 34 | 170.9 | 30.4 | 28.8 | 93 |
0 | 74.81 | 45 | 184.0 | 23.1 | 21.2 | 77 |
0 | 70.53 | 30 | 154.8 | 21.9 | 16.9 | 96 |
0 | 71.35 | 44 | 234.6 | 32.4 | 29.5 | 83 |
1 | 63.36 | 78 | 116.0 | 20.4 | 24.6 | 98 |
Table 1: Sample Input Variables
(Yes, it would be odd really to include IQ in an EMR. You’ll see below that it’s a good example of an irrelevant variable, which is why it’s part of this example.) In future posts we’ll explore design patterns that can help you decide what data sources and datasets are likely to contain legitimate input variables for a given problem.
Quite often you have more input variables than are useful in making a single type of prediction. For example, a model predicting a person's percent body fat might only need height, weight, age, and gender; adding the other variables (IQ, say) to the model might not improve its predictions. Choosing a good set of input variables is surprisingly tricky, for several reasons. Here are four of the most important:
There are design patterns that can guide you in deciding when and how to address these issues, in part by choosing a good set of input variables. It’s important to recognize now that your decisions about input variables, variable transformations, and induction algorithms are related. Failure to make these decisions together often amounts to an antipattern we’ll review later.
You may have noticed that we did not include an algorithm for choosing a model’s induction algorithm, even though we mention above that variable selection is related to that choice. Until recently the choice of induction algorithm has usually been manual. Model-selection (and model-combination) algorithms are however an area of active research. If you know enough about data science to be familiar with these developments, you probably don’t need to read this blog series.
A common approach to variable selection is to use a simple algorithm that has a built-in variable-selection method to choose the input variables for another algorithm lacking its own built-in variable-selection method. This approach is the filter variable selection pattern. Random-forest algorithms are often used as variable-selection filters. In Alteryx, the Forest Model predictive tool displays one of the two variable-importance plots generated by R’s randomForest package. Figure 1 below includes both plots, for a model that uses all of the other variables in the table above to predict percent body fat:
Figure 1: R randomForest Variable-Importance Plots
Note the differing estimates of variable importance—in this case generated by a single algorithm. (Note too the IQ doesn’t matter by either measure!)
In this blog series you’ll learn use cases for filter variable selection and its alternatives. You’ll also learn which algorithms make good variable-selection filters.
Many Alteryx users will be familiar with some common transformations. A transformation is a rule for changing a variable’s value. There are several very general transformation DSDPs:
There are many DSDPs within these general patterns. Future posts will explore some of the most useful transformation patterns.
Transformations generally have one of the following motivations:
Of course, here as elsewhere, when we describe a DSDP, we’ll always tell you what motivates the pattern, so you know when the pattern applies.
We will call the result of applying transformations to input variables, derived variables. A model’s features are the variables we actually input to the induction algorithm. The features can include (raw) input variables and derived variables.
In our next post we’ll explain a model’s functional form. (Every model has one; some just hide it better than others!)
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.