Data Science

BridgetT · ‎06-27-2016

Note: This blog post is intended for audiences who have already learned the concepts discussed in Demystifying the R-based Tools, Part 1. If you have not yet read that post, please start there and return here once you feel comfortable with that material.

How do I prepare to evaluate my model’s accuracy?

Since my last blog post was so long, I actually left out an important step of the model-building process to simplify things a bit. But in order to accurately evaluate your model, you actually need to test it on data that you didn’t use to train it. Remember the issues with overfitting I mentioned in the last post? It turns out that training and testing your model on the same dataset often leads to overfitting. Frequently, the model learns your specific training data extremely well, and the algorithm working behind the scenes congratulates itself on a job well done. “Good job!” it thinks. “You obtained 100% accuracy on all of the data you’ve ever seen for this problem! My work is done here.” Unfortunately, real life modeling situations aren’t so easy. For instance, let’s say that we have a dataset that looks like this:

So this model fits this particular dataset almost perfectly:

Does that model truly describe the overall process generating the data? If we only test the model on the data we used to train it, it appears the model is almost completely accurate. But look what happens when we test the model on a few new data points:

That result is not nearly as encouraging.

At this point, you may be thinking, “Ok, I understand the importance of testing a model on a new dataset that was not used to train it. But I’m not as lucky as you were in that example! I just have one dataset; there’s no new data that will magically fall into my lap!” Luckily for you, Alteryx has a tool that is intended for this exact problem. It’s called the Create Samples tool. The trick is that you need to use this tool before building your model. So your workflow will look something like this:

(Note that the Create Samples tool is a different color in versions of Alteryx before 10.5. But the tool is the same in all versions.)

The Create Samples tool randomly selects a portion of your data to flow out of the E output and another portion to flow out of the V output. If the numbers you select for the Estimation sample percent and the Validation sample percent sum to less than 100, the remainder will flow out of the H output. (In this post, we will only use the E and V outputs, so the percentages we select for E and V will always add up to 100.)

The Random seed controls which records get sent to which outputs. So if you want to re-run the workflow with different records in each output, you should change the Random seed.

Here, 80% of the data will be in the E output, and the other 20% will be in the V output. The Random seed is currently at 1, but we can change it if we want to see the results with different records in each output.

How do I know I’m ready to use the validation sample?

Before we test our model on the validation sample, we need to do some basic checking using the “R” output from the model. I touched on the “R” output briefly in my previous post, but there’s a lot more to learn about interpreting the information in this output. (In fact, there’s so much to learn that I’ll still have to leave out a few things in this post. But if you’re interested in learning more, you should enroll in the Alteryx Udacity Business Analyst Nanodegree program.)

I’ll limit the discussion here to the Linear Regression tool, but you can engage in a similar process with other tools. From now on, I’ll limit this post to regression models. The steps for classification models are relatively similar, but feel free to request a future post on evaluating classification models if you’d like to learn more.

The Coefficients section of the “R” output gives valuable information about the coefficients of the linear regression model. (Recall that the model’s coefficients are the numbers that determine how much of an effect each variable has on the model.)

Before looking at this section, you should choose a significance level for your variables. A significance level is the upper bound for the probability that the coefficients obtained values as large as they did by chance. For instance, choosing a significance level of .05 means that we discard all variables that had a 5% or greater probability of having their modeled magnitudes by chance. The most typical significance level is .05, but you can choose a different level depending on your problem. In some situations, including extra variables is a bigger problem than excluding necessary variables; models for those cases generally rely on lower significance levels.

The Estimate column gives the estimated value of each coefficient. A glance at this column to see if the estimated coefficient values are sensible should suffice at this point. For example, if you see a large negative coefficient on a variable that you expect to have a positive effect, you should look over your data preparation steps and data collection process. But if your coefficients are all relatively reasonable, you can proceed.

The next column we’ll discuss is the Pr(>|t|) column. This is where we use the significance level we chose before. If Pr(>|t|) is less than the significance level, we can keep the variable in the model. Otherwise, we should discard the variable. There is one exception to this rule, which is the model’s intercept. Since the intercept is not a true variable, we generally keep it regardless of its significance level. Once we have found the significant variables, we go back to the Linear Regression model and uncheck the insignificant variables. For example, suppose we chose .05 as our significance level. Then this output would indicate that the variables LocationNortheast, LocationNorthwest, LocationSoutheast, and LocationSouthwest are all insignificant.

Thus, we return to the model configuration screen and uncheck these variables:

Note that these variables are all dummy variables created by all but one of the different values for Location. (Using categorical variables in a linear regression model requires the use of dummy variables. These variables take the value of 1 when a record’s value for that variable is true and 0 otherwise. For example, a record with location Northeast would have LocationNortheast = 1 and LocationNorthwest= LocationSoutheast = LocationSouthwest = 0.) Thus, we only need to uncheck Location. Then we re-run the workflow and return to the “R” output.

Now all of the variables are significant, so we can proceed. The next important portions of the report output are the multiple R-squared and the adjusted R-squared values. The multiple R-squared value describes the percentage of the variation in the target (dependent) variable that is explained by the predictor (independent) variables. In general, a higher R-squared value indicates a better model.

However, we have to take care not to obsess too much over the R-squared value because of our old foe overfitting. Adding more variables to a linear regression model will automatically raise the R-squared value even if the new variables aren’t significant! To account for this phenomenon, we can turn our attention to the adjusted R-squared value. This quantity penalizes models with additional variables, so it is a better measure of the model’s complete quality. Luckily, our example model has excellent values for both R-squared and adjusted R-squared.

How do I use the validation sample once I am ready?

At this point we have nearly all of the ingredients we need to test our model. We have a dataset the model hasn’t seen from the Create Samples tool and we have models that have passed a basic validation test. But how do we actually test these models on the data? Yet again, Alteryx saves the day with a tool designed to solve this problem! The Score tool takes two inputs: a valid Alteryx predictive model, and a dataset with the same fields as the data used to create this model. The tool then outputs scores, or predicted values, for the target variable in the new dataset.

The Score tool is generally used for two main purposes. The first is the problem we have here: we already know the values of the target variable, but we want to compare these values to the predicted values to evaluate the quality of the models. The second is when we want to use a validated model to predict unknown values of the target variable from a new dataset. Thus, we generally use the score tool twice within a modeling process. The first time is when we test the model, and the second time is to use the tested model to predict new data.

After dragging the Score tool onto the canvas, connect the “O” output from the model you wish to test to one of the Score tool’s inputs. Connect the validation data to the other input. (With the Score tool, it doesn’t matter which connection goes into the top input and which connection goes into the bottom one.) Your workflow should look something like this:

To configure the score tool, add the name you wish to give the scored field. There are certain restrictions on the name you can give this field. The name should start with a letter and contain only letters, numbers, and the underscore character.

In most cases, you can leave the other options as the defaults. However, there are special options you can choose with Linear Regression models. A natural log transformation is a way of preparing certain datasets so that they adhere to assumptions about Linear Regression models. But if this is the first time you’re hearing about natural log transformations, chances are your data has not undergone a natural log transformation, so you can ignore this option. The second option should have wider appeal; it allows you to include a confidence interval rather than simply a point forecast. Checking this box would give you three new output columns: the forecasted value, a lower bound and an upper bound. For instance, choosing a 95% confidence interval would mean that there is a 95% probability that your target variable falls between the lower and upper bounds.

At this point, if you have multiple models, you can use a Join Multiple tool to join them all together. Since you will have duplicated fields from the various models, you can also de-select them using the Join Multiple tool.

After running the workflow with the Score tool properly configured, we can manually look through the predictions and the actual values. But this process doesn’t really give us a cohesive picture of the model’s performance. We need two more tools to obtain this picture. First, we can use a Formula tool to obtain the squared error for each prediction. In this example where [Score_linear_regression] is our prediction and [Miles_flown] is our actual target variable, we use the following formula:

POW(([Score_linear_regression] - [Miles_flown]), 2)

If we have multiple models joined together, we create a similar formula for each model and then run the workflow to populate the metadata.

At this point, we are close to having a comprehensive picture of the performance of all of our models. But when we have situations with larger validation datasets (such as the attached example, which has 100 validation records), we can’t keep all of the percent errors in our head. Thus, we use the Summarize tool to take the average of the squared error for each model.

Finally, we use a Formula tool to take the square roots of all of these average squared errors. The resulting quantity is known as the Root Mean Squared Error (RMSE), and it is the most commonly used measure to compare regression models.

If I have multiple models that passed the validation stage, how do I choose the best one?

The first step of model selection is to compare the RMSE of each model. A lower RMSE indicates that the model’s predictions were closer to the actual values. However, a simpler model with the same RMSE as a more complex model is generally better, as simpler models are less likely to be overfit.

In this particular example, we essentially have a tie between the Linear Regression model and the Spline model. Though the Spline model has a lower RMSE, the Linear Regression model has fewer variables. Thus, these two models are both good candidates for a production modeling process. We can also combine the two of them in a model ensemble, which allows us to use the results of multiple models. Unfortunately this post has already gotten too long to describe ensembles, but feel free to ask for a future post about this topic if you’d like to learn more!

Anyway, that’s the end of the content of this post. I hope you enjoyed reading it, and feel free to provide any feedback or questions in the comments section! The attached file regression_workflow contains the entire workflow described here. A note on this workflow: the Model Customization options were left as the defaults for all three models. If you would like to learn more about these options in a future post, feel free to comment and request it! Additionally, I focused on regression models this time because my last post focused on classification models. But if you would like to know more about validating classification models, feel free to ask in the comments section. Depending on the scope of your question, I can either answer within the comments section or dedicate a new post to the topic. Finally, check out this set of Predictive Analytics Starter Kits on the Gallery. The kits will guide you through the process of data preparation, model building, and model validation for A/B testing, linear regression, and logistic regression.

Data Science

Demystifying the R-based Tools, Part 2 (Model Validation and Selection)