Data Science

ToddM · ‎11-23-2016

Alteryx Data Science Design Patterns: Predictive Model Form, Part Four

In our third post we learned about induction algorithms, the machinery within an analytical model that one uses to calculate a prediction (or, during model development, a retrodiction). We discussed “estimating” the values of an induction algorithm’s parameters to fit a given dataset, but we didn’t dwell on what that means.

In this post we’ll explore what it means to fit an induction algorithm to a dataset. Again, it’s important to realize that when someone refers to fitting a model to a dataset, what they really mean is fitting the model’s induction algorithm to the dataset. This distinction is frequently glossed.

Sampling the Generating Process

To discuss fitting (and other subjects, later) intelligently, we need to define some terms. Stay with me for a few paragraphs. No way around this stuff!

Recall that a model’s generating process is an empirical process that produces data. The data is a collection of data points. Each data point is a vector of values for a fixed set of variables (including the model’s dependent variable), with one value for each input variable.

The set of theoretically possible data points is the model’s sample space. A point in the sample space is termed an outcome. (Thus in data science an outcome is a possible, not actual, data point.) An event is a set of outcomes. An event occurs when one of the outcomes it contains is realized as a data point.

The generating process includes a set of individuals, the generating process’ population, that generates data points. (The population members are sometimes termed units.) In some cases each member only generates one data point. In others, each member may generate several data points. (Time-series processes are a common case.) So there is not always a one-to-one correspondence between population members and the data points they produce.

Unfortunately we lack a generally accepted term for the set of data points that a generating process has produced or will produce. Let’s call this set the data population. The data population is often too large or expensive to measure entirely. (Sometimes it is infinite, so that gathering it is impossible even in principle.) Instead, we gather a subset of the data population, termed a sample. The points in the sample are, naturally, sample points or observations. The number of sampled data points is the sample size n. Likewise, the number of input variables defining the sample space, and appearing in the data population and any sample thereof, is the sample space’s dimension p. So a sample is just an n × p table of data. (Points in the data population that are not in the sample are out of sample.)

There are many ways to gather a sample systematically. We won’t go over them today. For now just know that it’s not enough to say, for example, that we want to collect a “random sample.”

When we fit an induction algorithm to a dataset, the dataset is necessarily part or all of a sample. We usually divide the sample into two or three parts: the training, validation, and test sets. We use the training set to fit the induction algorithm, and the test set is used to test the fitted algorithm. (When we fit and then compare several induction algorithms, we use the validation set to compare performance, and then use the test set to test the induction algorithm chosen during validation.) Cross-validation is a more complicated way to use a sample to fit, validate, and test induction algorithms. We’ll return to these subjects in later posts. For now we’ll focus on training and test sets.

Finally: a retrodiction is an induction algorithm’s estimate of a sample data point’s dependent-variable value. A prediction is the algorithm’s estimate of an out-of-sample data point’s dependent-variable value. A residual is the difference between a data point’s observed dependent-variable value and its estimated (by retrodiction or prediction) value. And an error is the difference between a data point’s observed dependent-variable value and its true value.

undefined

Figure 1: Error vs. Residual

Let’s illustrate some of the above concepts with an example. Consider the population of all humans born in the decade starting January 1, 2,000 GMT, who reach 18 years of age. Define the sample space to be all possible vectors of the form (height, weight, percent body fat, age, gender), so p = 5. Let the data population be the set of actual data points of the same form occurring on the first day of each month. We sample the data population by collecting data points monthly over the entire decade from 10,000 people around the world, chosen (in some sense) at random, so n = 12 × 10 × 10,000 = 1.2 × 10⁶. (While this sample is large, it is only about 0.1% of the data population.) Finally, we divide the sample into a training set containing 70% (0.84 × 10⁶) and a test set containing 30% (0.36 × 10⁶) of the sample.

What Models Want

Enough with the dictionary, and back to fitting induction algorithms!

The essence of fitting an induction algorithm to a dataset is choosing values for the algorithm’s parameters to minimize some function of the resulting residuals over the entire data population. In practice most models pursue this goal by balancing two other goals:

Avoid underfitting: Minimize some function of the training set’s residuals.
Avoid overfitting: Minimize generalization error, some function of the estimated out-of-sample residuals.

Notice that the second goal does not refer to the test set. We’ll return to this point.

A fitness function (also termed a loss or cost function) is a real-valued function that combines the two functions above. Let’s call them the underfitting penalty function and the overfitting penalty function. A model fits an induction algorithm to a sample by searching the space of possible induction-algorithm parameter values, to find a vector of parameter values that minimizes the value of the fitness function. (The search algorithm is the model’s fitting algorithm, which we’ll discuss next time.) Thus a model’s fitness function expresses the model’s approach to the underfitting/overfitting tradeoff.

Fitting the Data We Have

By far the most popular underfitting penalty function is ordinary least squares (OLS), sometimes termed the L2 penalty. It is the sum of the squared residuals (taken over the training set when fitting an induction algorithm, or over the test test when testing the algorithm). A common alternative is least absolute deviations (LAD), the sum of the residuals’ absolute values, or the L1 penalty. Each has advantages. There are many others. Knowing what they are, what their advantages are, and what model types use them helps us apply good judgment in choosing a model type. (Later we’ll also learn about model-quality metrics we can use to compare model quality across model types having different fitness functions.)

Let’s cook up a trivial example, to illustrate. Our training dataset is n = 10 data points from the EMR dataset we’ve used in previous examples. We use the Linear Regression tool to fit an OLS model for the R formula pbf ~ gender + height_inches + age + weight_lbs.

undefined

Figure 2: Example OLS Model

An R script to fit the same formula using LAD regression is

library(L1pack)

emr_df <- read.csv("c:\\temp\\simulated_emr.100.csv")

emr_df <- emr_df[1:10,]

emr_df <- emr_df[, c(1:4, 6)]

lad(pbf ~ ., emr_df)

Figure 3: R LAD Model

(For now we’ll forego wrapping this script in Alteryx’s R tool. But it’s easy to do so!)

Table 1 below compares the two models’ parameters (coefficients):

Model	Coefficients
Model	Intercept	gender	height_inches	age	weight_lbs
OLS	75.3	-1.43	-1.18	0.134	0.147
LAD	67.6	-1.81	-1.03	0.094	0.143

Table 1: Comparison of OLS and LAD Linear Models

All but the weight_lbs coefficients are substantially different. Table 2 below compares the two models’ estimates, pointwise penalties, and overall penalties (fitness-function values):

	OLS Linear Model			LAD Linear Model
pbf	Estimate	Residual	Penalty	Estimate	Residual	Penalty
30.36986226	28.53824	1.8316237	3.3548454	29.24661	1.12326	1.12326
31.01370197	30.80003	0.2136681	0.0456541	31.01371	0	0
29.97384641	27.60312	2.3707263	5.6203432	25.94489	4.02896	4.02896
21.68451052	22.6333	-0.9487877	0.9001981	23.78684	-2.102326	2.102326
22.82748668	21.52652	1.3009703	1.6925237	22.82749	0	0
28.82829863	29.28522	-0.4569181	0.2087742	28.8283	0	0
21.23275462	19.9498	1.2829523	1.6459666	21.23276	0	0
16.87325041	18.6938	-1.8205533	3.3144143	20.03399	-3.160737	3.160737
29.49844257	31.35832	-1.8598734	3.4591291	31.94303	-2.444587	2.444587
24.56796795	26.48178	-1.9138082	3.6626618	24.56797	0	0

		Fitness:	23.90451		Fitness:	12.85987

Table 2: Comparison of OLS and LAD Residuals and Penalties

It’s interesting to observe that the LAD model fits five of the data points perfectly. On the other hand, its worst-case residual is 4.03, compared to 2.34 for the OLS model.

Notice that neither OLS regression nor LAD regression has an overfitting penalty built into its fitness function. Both therefore illustrate an important antipattern: missing overfitting penalty. Beginning modelers often inadvertently use model types that fail to penalize overfitting, because such model types are usually easy to use and understand. Ironically, we’ll now see that overfitting penalties usually work by penalizing model complexity. So the right way to produce a model that’s as simple as it should be, but no simpler, is to use a model type whose fitness function includes an overfitting penalty.

Fitting the Data We Don’t Have

An induction algorithm that overfits its training data has a low underfitting penalty value for the training data, but a high underfitting penalty value for the rest of the data population (including the test set). The fitting process has overemphasized consistency with the training data at the expense of consistency elsewhere. It would be nice if we could somehow directly measure or estimate the undefitting penalty’s value on data outside the training set. That’s what the test set is for. But once we’ve testing the fitted induction algorithm, it’s too late to change the algorithm’s parameter values. We need some way to penalize overfitting while fitting (training) the algorithm, so it also fits the test set and out-of-sample data well.

By far the most common type of overfitting penalty is some function of model complexity. This strategy is termed regularization, and the overfitting penalty is called the regularization term of the fitness function. The intuition behind regularization is that there are two kinds of patterns in the training set: patterns that also occur outside the training set (global patterns), and patterns that don’t (local patterns). Hopefully the global patterns have a stronger effect on the value of the underfitting penalty than the local patterns. When that’s true, adding model complexity first captures the global patterns and later captures local patterns. The regularization term should be just strong enough to counterbalance the incremental improvement in the underfitting penalty, at the point where additional model complexity begins to capture local patterns. When that’s also true, we minimize the fitness function’s value by balancing underfitting and overfitting at the point where the fitted induction algorithm captures all and only global patterns. The underfitting penalty is minimal over the whole data population, at this point.

Unfortunately, there is no fully general concept of model complexity. In fact, many regularization penalties include an arbitrary parameter (a coefficient or bound) that must be chosen judgmentally. This fact highlights the degree to which a model’s concept of fitness is a subjective, judgmental decision. Happily there are widely accepted techniques to guide us in setting these parameters’ values.

One component of model complexity that’s available across all model types is the number of model features. For example, regularized linear-regression models reduce the number or contributions of model features by excluding or de-emphasizing variables that make little contribution to reducing underfitting. For example, lasso regression forces the sum of the absolute values of the regression coefficients not to exceed a constant (it is an L1 penalty on the coefficients). The effect is often to “shrink” some of the coefficients to zero, thereby dropping the corresponding model feature. For example, lasso regression on all five of the other example variables to predict pdf shrinks the coefficients of the gender and iq variables to zero, reducing the set of independent model features to height_inches, age, and weight_lbs. (The R code involves some technicalities beyond this post, so we’ll skip it for now. There are many examples online.) Ridge regression penalizes the sum of the squares of the coefficients (it is an L2 penalty on the coefficients), but without forcing that sum to be bounded, so it does not perform variable selection. Elastic-net regularization combines lasso and ridge penalties.

Other examples of model complexity include the depth and number of nodes in a decision-tree model, the number of trees in a random-forest model, the number of terms and size of the exponents in a polynomial function, and statistical learning theory’s VC dimension. The important thing for now is to recognize that different model types may include different notions of model complexity, or different penalties on the same concept of model complexity, even for the same type of induction algorithm. We implicitly choose among these when we choose a model type. This makes it important to learn about the practical consequences of specific types of fitness functions. Many of the data-science design patterns we’ll encounter relate a specific fitness function to a class of modeling problems.