Data Science

SydneyF · ‎02-08-2019

A guiding principle in scientific fields and general problem solving is Occam’s razor (also known as the law of parsimony). Credited to 14^th-century friar William Ockham, all that Occam’s razor states is "simple solutions are more likely to be correct than complex ones." Razor refers to the process of distinguishing between two hypotheses by “shaving away” any unnecessary assumptions.

When there are multiple competing hypotheses (or models), the one that makes fewer assumptions will typically be the one that is selected. This is not a fundamental commandment of problem-solving or the scientific method. Rather, it is a general preference for simple explanations, in part because simpler theories are more easily tested and understood.

In the field of data science, Occam’s razor is often seen as an influence in model selection, particularly to combat overfitting. Overfitting describes modeling error caused by the model capturing the noise in a data set, instead of just describing the general pattern observed in the data. A model that is overfitted will not generalize to unseen data well because it has absorbed too much complexity from the training data, failing to distill the data down to its fundamental patterns and relationships.

Occam’s Razor and Model Overfitting

To combat overfitting, models are often simplified as a part of the training or model refinement process. This can be seen as pruning (in decision trees) or regularization.

Pruning removes sections of a decision tree that do not add significant predictive power to the overall model. Eliminating branches that are not adding information to the model reduces the overall complexity of the model, and actually improves the ability of a model to generalize. To learn more about how pruning factors into a decision tree model, check out Planting Seeds: An Introduction to Decision Trees.

Regularization applies a smoothing constraint to a model, either by fixing the number of parameters in a model or by applying a cost function to predictor variables used in the model. Regularization effectively penalizes large variable coefficients, which reduces the variance of a model while minimizing added bias. Some regularization methods perform feature selection by reducing a predictor variable’s coefficient to zero. To learn more about implementing regularization to a linear or logistic regression, read the Community article Regularization in Alteryx.

Occam’s Razor and The Curse of Dimensionality

Another way to reduce the complexity of a model is through dimensionality reduction, i.e., reducing the number of predictor variables used by the model (e.g., feature selection and feature extraction). Dimensionality reduction is the process of systematically reducing the number of predictor variables included in a model, either by eliminating variables or creating new, derived variables.

In addition to mitigating the risk of overfitting, a benefit of dimensionality reduction is that it combats the curse of dimensionality. The curse of dimensionality describes the combination of a variety of phenomena associated with high-dimensionality (i.e., a large number of predictor variables) that can ultimately cause models to perform poorly due to the model becoming overly complex.

Feature selection is based on the premise that some predictor variables are redundant or irrelevant and can be removed without losing much information. One method for feature selection is to manually reduce the predictor variables included in a model using correlation coefficients or background knowledge about what features may be relevant to the target variable (e.g., the number of bananas sold in Poughkeepsie is probably not relevant to estimating how many game consoles will sell in Des Moines).

More automated methods for feature selection include stepwise regression, which systematically deletes the worst predictor variable from a model in rounds, stopping at defined criteria; or LASSO and elastic net methods of regularization, which eliminate variables by reducing redundant and irrelevant variables’ coefficients down to zero.

Feature extraction is the process where predictor variables are transformed into a reduced set of features (sometimes called a feature vector), creating more informative and non-redundant predictor variables for a machine learning algorithm to use. A common feature extraction technique is principal component analysis (PCA). Feature extraction techniques such as PCA create combinations of provided predictor variables to sufficiently describe the given data set while reducing the number of predictor variables in the data set. Feature extraction techniques can make the predictor variables used in a model less interpretable, but they are often found to improve the predictive power of a model. The decision to use feature extraction will depend on the nature of your data set, and what is most important to your use case.

When you are thinking about your predictor variables, take time to consider if they are adding value to understanding the target variable, or if they are only introducing noise. Ideally, you want to have just enough predictors to explain your target variable well, and no more.

AIC, BIC, and MDL

Model assessment is a critical step in the process of developing data science models. Measures such as the Akaike information criterion (AIC) and the related Bayesian inference criterion (BIC) have been developed to help assess the relative quality of statistical models as a function of minimizing the amount of information lost by a model relative to the training data while accounting for overfitting.

Considered by many researchers to be an equivalent of BIC, the minimum description length (MDL) principle is a formalization of Occam’s razor based on information theory and built off the concept of Kolmogorov complexity. MDL states that the best machine learning or statistical model is the one that provides the most compact description of the data and model itself, measured by the length of the code of the model and data.

The core concepts of MDL are that regularity in a data set can be leveraged to compress the data, and that machine learning or statistical models are finding regularities in the data. An efficient statistical model or machine learning algorithm will compress the data with a function that estimates the values in the data set. MDL views models as generated descriptions of observed data. This perspective can be leveraged to compare any two models, regardless of complexity, with code length. When two models fit a data set equally well, MDL will select for the ‘simplest’ model, defined by which allows for a shorter description of the data.

MDL suggests an explanation for why too complex representations tend to overfit data. When the encoding of a model is longer than the original data, or nearly the same, nothing is gained in terms of description length. The specific data set is well described, but the model’s encoding does not simplify (compress) the data and is therefore not making any inductive conclusions about the pattern of the data. If you'd like to read more about MDL, check out Model Selection and the Principle of Minimum Description Length by Mark H. Hansen and Bin Yu and A Tutorial Introduction to the Minimum Description Length Principle by Peter Grünwald.

Occam’s Razor and You

Occam’s razor comes up in many different stages during the process of building a model. It is relevant to feature selection/feature engineering, model selection, and the way that the algorithms build and refine the models themselves. In each of these stages, the spirit of Occam’s razor is the same: “simple is better.” A simple model that fits a data set well is likely to capture the key features of that data, without assimilating too much noise. This is what makes parsimony desirable in the context of model building.

It is important to note that despite the wide popularity and acceptance of Occam’s razor, it is based on an assumption. There is little empirical evidence that demonstrates that the world is simple, or that simple explanations are generally more likely to be true than complex ones.

There is also a very real hazard of reducing complexity at the expense of accuracy. Only apply Occam’s razor when the predictive power between two models is equally good. In the words often attributed to Albert Einstein: “Things should be made as simple as possible, but no simpler.”

If you'd like to read further about Occam's razor and the world of data science, check out Chapter 28: Model Comparison and Occam’s Razor from Information Theory, Inference, and Learning Algorithms by David McKay.

Data Science

Simple is Best: Occam's Razor in Data Science