Data Science

ToddM · ‎10-12-2016

Design Patterns

Alteryx lets “citizen” as well as professional data scientists construct rich analytical business models. The same wealth of tooling that makes this richness possible also means that modelers face many model-design decisions. Citizen data scientists especially may be unaware of the alternatives, or of the subtleties that determine which alternatives best fit a given business problem. One way to tackle this complexity effectively is to apply data-science design patterns.

In general, a design pattern is a reusable solution to a given class of design problems that is recognized by experts as an effective approach to those problems. An antipattern is a common approach to a given class of design problems that experts recognize as risky, ineffective, or counterproductive. A collection of design patterns and antipatterns for a given domain is sometimes called a pattern language. (See the Wikipedia articles on design patterns and antipatterns to learn more.)

A data-science design pattern (DSDP) is a design pattern for a data-science design problem. This blog series will present a pattern language for practicing data science on the Alteryx platform. Learning this pattern language will help you build better models with less effort, especially on Alteryx.

Predictive-Model Form

The most general DSDP prescribes the parts of a predictive model. So let’s start by learning that pattern. A predictive model should have eight parts:

input variables
variable-selection algorithm
variable transformations
model features
functional form
induction algorithm
fitness function
fitting algorithm.

The first four parts together prepare the model’s input data. The second four parts construct the model using the prepared input data. The rest of this blog explains and illustrates the first four parts. We’ll devote a separate post to explaining and illustrating each of the second four.

Input Variables

Many disciplines contribute to data science. As a result, there are often several words for the same data-science idea. For example, a model’s raw input data may be called (among other things) input variables, source variables, independent variables, attributes, or dimensions. We’ll stick with input variables. This is the data you explore (and perhaps collect), before you change it in any way, and before you decide which parts of it matter for a specific modeling problem. Usually the input data appears in a table. Each row (other than the heading row, if any) represents instances of a group of things you want to model (often called the population). Each column (other than the columns identifying the population members) contains an input variable.

For example, you might be a medical researcher having access to an electronic medical record (EMR) database containing patient data collected by primary-care physicians. The available EMR input variables might look like this:

Gender	Height (in)	Age	Weight (lbs)	BMI	% Body Fat	IQ
0	66.97	28	194.9	30.5	30.4	106
0	66.40	38	196.5	31.4	31.0	92
1	65.71	78	142.5	23.2	30.0	102
0	70.18	31	177.9	25.4	21.7	99
0	68.46	22	164.7	24.7	22.8	118
1	62.83	34	170.9	30.4	28.8	93
0	74.81	45	184.0	23.1	21.2	77
0	70.53	30	154.8	21.9	16.9	96
0	71.35	44	234.6	32.4	29.5	83
1	63.36	78	116.0	20.4	24.6	98

Table 1: Sample Input Variables

(Yes, it would be odd really to include IQ in an EMR. You’ll see below that it’s a good example of an irrelevant variable, which is why it’s part of this example.) In future posts we’ll explore design patterns that can help you decide what data sources and datasets are likely to contain legitimate input variables for a given problem.

Variable-Selection Algorithm

Quite often you have more input variables than are useful in making a single type of prediction. For example, a model predicting a person's percent body fat might only need height, weight, age, and gender; adding the other variables (IQ, say) to the model might not improve its predictions. Choosing a good set of input variables is surprisingly tricky, for several reasons. Here are four of the most important:

There may be several sets of equally good input variables. So if you’re choosing variables one at a time, which variables you should include may depend on which variables are already included.
A variable’s usefulness and riskiness may depend on your choice of induction algorithm (see below).
A variable’s usefulness may depend on how you transform it (see below).
Variable-selection methods may disagree about a variable’s importance.

There are design patterns that can guide you in deciding when and how to address these issues, in part by choosing a good set of input variables. It’s important to recognize now that your decisions about input variables, variable transformations, and induction algorithms are related. Failure to make these decisions together often amounts to an antipattern we’ll review later.

You may have noticed that we did not include an algorithm for choosing a model’s induction algorithm, even though we mention above that variable selection is related to that choice. Until recently the choice of induction algorithm has usually been manual. Model-selection (and model-combination) algorithms are however an area of active research. If you know enough about data science to be familiar with these developments, you probably don’t need to read this blog series.

A common approach to variable selection is to use a simple algorithm that has a built-in variable-selection method to choose the input variables for another algorithm lacking its own built-in variable-selection method. This approach is the filter variable selection pattern. Random-forest algorithms are often used as variable-selection filters. In Alteryx, the Forest Model predictive tool displays one of the two variable-importance plots generated by R’s randomForest package. Figure 1 below includes both plots, for a model that uses all of the other variables in the table above to predict percent body fat:

Figure 1: R randomForest Variable-Importance Plots

Note the differing estimates of variable importance—in this case generated by a single algorithm. (Note too the IQ doesn’t matter by either measure!)

In this blog series you’ll learn use cases for filter variable selection and its alternatives. You’ll also learn which algorithms make good variable-selection filters.

Variable Transformations

Many Alteryx users will be familiar with some common transformations. A transformation is a rule for changing a variable’s value. There are several very general transformation DSDPs:

Replace missing (null) and invalid values with surrogates (usually estimates of the null or invalid values). Alteryx’s Impute Values tool provides one way to impute missing values.

Map one number to another. For example, we might want to center and scale each gender’s height. To do that, we would first subtract average height (by gender) from each patient’s raw height, and then divide the result by the standard deviation of the gender’s height. The result is often termed a Z score. The Alteryx Standardize Z-Score tool computes Z scores.

Convert many numbers to a smaller set of numbers. For example, Alteryx lets us apply principal component analysis (PCA), which is a technique for changing one set of variables into another set, where the first few variables in the second set usually carry most of the information in the first set. The Alteryx Principal Components tool performs PCA.

Convert a number to a class. (When the number is a decimal or fraction, this pattern is sometimes called discretizing the number.) For example, we might convert BMI to one of five classes: unhealthy low, healthy low, typical, healthy high, and unhealthy high. The Alteryx Multi-Field Binning tool lets us convert numbers to classes.

Convert a class to a number. For example, our researcher might convert BMI to a gender-specific percentile. The Alteryx Summarize tool can compute percentiles.

There are many DSDPs within these general patterns. Future posts will explore some of the most useful transformation patterns.

Transformations generally have one of the following motivations:

Many algorithms can’t compute when an input value is null or invalid.
Some algorithms can fail, especially with large datasets, when the input data is improperly scaled or bounded.
Some algorithms require input data having a specific representation.
Reducing the number of variables we pass to an algorithm lets the algorithm run faster.
Changing how we represent a variable can make certain types of patterns much easier for algorithms to detect, and for humans to understand and communicate.

Of course, here as elsewhere, when we describe a DSDP, we’ll always tell you what motivates the pattern, so you know when the pattern applies.

Model Features

We will call the result of applying transformations to input variables, derived variables. A model’s features are the variables we actually input to the induction algorithm. The features can include (raw) input variables and derived variables.

Next Time

In our next post we’ll explain a model’s functional form. (Every model has one; some just hide it better than others!)