Data Science

KateP · ‎10-31-2019

Regressions are models that assess relationships between variables. Multiple regressions assess the potential power of multiple predictor variables (e.g. ball possession, corner kicks, pass completion) on a given dependent variable (e.g. goals in a soccer match, or whether the team will win or lose).

It can be tempting to add and add and add predictor variables to a linear or logistic regression in the spirit of trying to get a well-fitting model - what could go wrong?!

Well, if the predictor variables have not had their statistical significance or collinearity evaluated, there is a risk the model is going to be overfitted, confused by the complexity of so many variables, or suffer unnatural inflation of at least one estimated regression coefficient.

Enter stepwise regression. Stepwise regression helps select features (i.e. predictor variables) that optimize a regression model, while applying a penalty for including variables that cause undue model complexity. In Alteryx Designer, the Stepwise tool can be used for this process.

Feature Selection – Why?

Reducing the number of predictor variables through selection and extraction is one way to manage unnecessary model complexity. Having lots of predictor variables is not inherently bad, but you want to watch out for predictor variables that are not useful or statistically significant; meaning, variables that add complexity to your model without adding much predictive ROI. Sometimes merely adding predictor variables can inflate your R-squared value because your model is (incorrectly) equating added complexity with added value. Therefore, we do not want to add any features just for the sake of adding features. We want to select the most material and impactful features for our model.

Let’s talk about Thanksgiving. Specifically, let’s talk about a lot of people being in the kitchen during the meal preparation of Thanksgiving. Even more specifically, let’s talk about people that are not adding any culinary value for being in said kitchen. Do we love them? Yes. Do we wish they would get out of the way? Also yes.

Much like relative X, adding features/predictor variables to your model “just in case” can backfire on productivity or utility. The new variable can be irrelevant to the task or not super helpful (relative X is just in the kitchen to “see what’s up”), given that there is already another predictor variable pulling enough weight for the rest of them; i.e. they are redundant. “I want to feel useful, what can I do? Can I help you chop this last carrot?” They add confusion without adding much value.

Keep it Simple

According to Occam’s Razor, when there are multiple competing hypotheses (or models), the one that makes fewer assumptions will typically be the one that is selected.

What is one way we know if our R-squared is being faithful to the actual strength of the predictor variables in a regression, and not getting inflated due to a lot of added independent variables that do not produce value?

If you have used the Linear Regression tool in Alteryx Designer, you are already being warned with your R-squared and adjusted R-squared values in the resulting Report output. The adjusted R-squared is going to take into consideration the number of independent variables in the linear model, and it is going to penalize the model for adding additional independent variables that do not add predictive power.

What about regressions that are non-linear, like logistic regressions? In non-linear regressions, “goodness of fit” measures are referred to as a pseudo R-squared. They are referred to as “pseudo” because R-squared is really only a goodness of fit measure for linear regressions given that R-squared relies on ordinary least squares test to fit the straight line.

A logistic regression has a line more akin to an “S” because it is a sigmoid function. Therefore, an ordinary least squares test and trend line will not fit that “S” shaped curve. There are multiple different pseudo R-squareds: there is Cox & Snell, McFadden, and McKelvey & Zavoina, just to name a few. McFadden R-squared (which also ranges 0 to 1) is the most common for logistic regressions because it works well with nested models.

Logistic regressions, like some other generalized linear models, use the method of maximum likelihood in place of ordinary least squares. For the method of maximum likelihood to predict an outcome based on probability and likelihood, it needs to nest its iterations.

Although Alteryx uses a non-adjusted McFadden R-squared in its report for logistic regression, you will notice that it does provide the penalty criteria on the same line, which can be used as an indicator for unnecessary model complexity - keep reading to find out how!

Voila! Now we know why our report says “McFadden” when you use logistic regressions in Alteryx.

Take the First Step!

The Stepwise tool takes two inputs: one is the model object coming out of your regression model and one is the raw (or pre-model) data. The Stepwise tool looks at the regression model with your selected variables, and looks at the data before it got to your regression. The output will help give perspective on which independent variables should be selected, and which can potentially be removed.

You will notice that the Stepwise tool input anchors are not labelled with indicators (e.g. O or D); it does not matter which anchor takes the model object and which takes the data. The R engine will recognize the model coming in from the O output of the prior regression as an R model object. Fun fact: Model objects are also part of the process for Alteryx Promote.

Stepping in the Right Direction

What are some examples in which a predictor variable might be removed from the regression? If it has no statistical significance to the target variable, or if it is collinear with another variable.

The Stepwise tool uses two approaches for performing the stepwise regression:

Backward:

The Stepwise tool starts with the full set of variables and removes them one by one (starting with the least statistically significant) until the adjusted fit of the model cannot be improved anymore. The variable that is removed does not have an opportunity to be re-considered for being added back in as the Stepwise tool moves through this process.

Backward and forward:

The Stepwise tool removes variables one by one, but it will also test adding back variables that were previously removed. Why? There might have been a variable that was initially removed but re-adding it with a different combination of predictor variables yields a more optimal model.

Remember that both backward elimination and forward selection are assessing the variables that you had in your model leading into the Stepwise tool, not all the potential variables in the data set.

Adjusted Fit Measure

Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC) are two criteria for model selection. The BIC applies a larger penalty for model complexity than the AIC. The AIC does not take into consideration the number of records per independent variable in the model; i.e. the BIC takes the sample size of the data into account. Whether you select the BIC or AIC, generally a lower score is better. Fewer penalties = a better fitting model.

Report and Outcomes

Looking at our outcomes of the Stepwise tool we have several points of information to look at. Below are some areas that are useful to get started with:

The call is referring to the model formula in R.
The residuals are akin to how far off points are from the regression line. In the outcome, the report displays quartile information and min, max, and median of those deviating points
Coefficients reflect the estimated direction (positive or negative) and strength of the relationship between variables, the standard error, the t-value (coefficient divided by standard error), and the p-value (how statistically significant one variable is on another) are also provided in this portion of the report.

Watch Your Step

Want to use data investigation techniques to try and select some strong predictor variables upfront instead of or in tandem with using a Stepwise? Of course you do! No problem. One method is to use the Association Analysis tool to identify the association measure and find the corresponding p-values. The association measure is the method that is going to explain the strength of association between two variables, as well as potential covariance, and whether they are positively correlated (if one goes up, so does the other) or negatively associated (if one goes up, the other goes down). This is a useful way to see if a predictor variable has a strong association with the target variable. This is also useful in identifying if multiple predictor variables have a high correlation to each other; this indicates there is collinearity between them. The Association Analysis tool also has an interactive chart with an interactive thematic grid and scatter plot to visualize association.

Step Toward the Future

Stepwise regressions are easy to use in Alteryx Designer as you assess what features to use in your regression models. However, the Stepwise tool is not a complete substitution for basic data cleansing and investigation! For the best results, it is worthwhile to cross-reference your data investigation techniques with your stepwise regression to understand the feature selection and what is contributing to your regression’s predictive power.