Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Data Science

Machine learning & data science for beginners and experts alike.
BridgetT
Alteryx Alumni (Retired)

Note: This blog post is intended for audiences who have already learned the concepts discussed in Demystifying the R-based Tools, Part 1. If you have not yet read that post, please start there and return here once you feel comfortable with that material.

 

How do I prepare to evaluate my model’s accuracy?

Since my last blog post was so long, I actually left out an important step of the model-building process to simplify things a bit. But in order to accurately evaluate your model, you actually need to test it on data that you didn’t use to train it. Remember the issues with overfitting I mentioned in the last post? It turns out that training and testing your model on the same dataset often leads to overfitting. Frequently, the model learns your specific training data extremely well, and the algorithm working behind the scenes congratulates itself on a job well done. “Good job!” it thinks. “You obtained 100% accuracy on all of the data you’ve ever seen for this problem! My work is done here.” Unfortunately, real life modeling situations aren’t so easy. For instance, let’s say that we have a dataset that looks like this:

data_only.png

So this model fits this particular dataset almost perfectly:

spline.png

Does that model truly describe the overall process generating the data? If we only test the model on the data we used to train it, it appears the model is almost completely accurate. But look what happens when we test the model on a few new data points:

 test_data.png

That result is not nearly as encouraging.

 

At this point, you may be thinking, “Ok, I understand the importance of testing a model on a new dataset that was not used to train it. But I’m not as lucky as you were in that example! I just have one dataset; there’s no new data that will magically fall into my lap!” Luckily for you, Alteryx has a tool that is intended for this exact problem. It’s called the Create Samples tool. The trick is that you need to use this tool before building your model. So your workflow will look something like this:

create_samples.png

(Note that the Create Samples tool is a different color in versions of Alteryx before 10.5. But the tool is the same in all versions.)

 

The Create Samples tool randomly selects a portion of your data to flow out of the E output and another portion to flow out of the V output. If the numbers you select for the Estimation sample percent and the Validation sample percent sum to less than 100, the remainder will flow out of the H output. (In this post, we will only use the E and V outputs, so the percentages we select for E and V will always add up to 100.)

 

The Random seed controls which records get sent to which outputs. So if you want to re-run the workflow with different records in each output, you should change the Random seed.

create_samples_config.png

Here, 80% of the data will be in the E output, and the other 20% will be in the V output. The Random seed is currently at 1, but we can change it if we want to see the results with different records in each output.

 

How do I know I’m ready to use the validation sample?

 

Before we test our model on the validation sample, we need to do some basic checking using the “R” output from the model. I touched on the “R” output briefly in my previous post, but there’s a lot more to learn about interpreting the information in this output. (In fact, there’s so much to learn that I’ll still have to leave out a few things in this post. But if you’re interested in learning more, you should enroll in the Alteryx Udacity Business Analyst Nanodegree program.)

 

I’ll limit the discussion here to the Linear Regression tool, but you can engage in a similar process with other tools. From now on, I’ll limit this post to regression models. The steps for classification models are relatively similar, but feel free to request a future post on evaluating classification models if you’d like to learn more.

 

The Coefficients section of the “R” output gives valuable information about the coefficients of the linear regression model. (Recall that the model’s coefficients are the numbers that determine how much of an effect each variable has on the model.)

 

Before looking at this section, you should choose a significance level for your variables. A significance level is the upper bound for the probability that the coefficients obtained values as large as they did by chance. For instance, choosing a significance level of .05 means that we discard all variables that had a 5% or greater probability of having their modeled magnitudes by chance. The most typical significance level is .05, but you can choose a different level depending on your problem. In some situations, including extra variables is a bigger problem than excluding necessary variables; models for those cases generally rely on lower significance levels.

 

The Estimate column gives the estimated value of each coefficient. A glance at this column to see if the estimated coefficient values are sensible should suffice at this point. For example, if you see a large negative coefficient on a variable that you expect to have a positive effect, you should look over your data preparation steps and data collection process. But if your coefficients are all relatively reasonable, you can proceed.

 

The next column we’ll discuss is the Pr(>|t|) column. This is where we use the significance level we chose before. If Pr(>|t|) is less than the significance level, we can keep the variable in the model. Otherwise, we should discard the variable. There is one exception to this rule, which is the model’s intercept. Since the intercept is not a true variable, we generally keep it regardless of its significance level. Once we have found the significant variables, we go back to the Linear Regression model and uncheck the insignificant variables. For example, suppose we chose .05 as our significance level. Then this output would indicate that the variables LocationNortheast, LocationNorthwest, LocationSoutheast, and LocationSouthwest are all insignificant.

initial_coefficients.png

Thus, we return to the model configuration screen and uncheck these variables:

linear_reg_config.png

Note that these variables are all dummy variables created by all but one of the different values for Location. (Using categorical variables in a linear regression model requires the use of dummy variables. These variables take the value of 1 when a record’s value for that variable is true and 0 otherwise. For example, a record with location Northeast would have LocationNortheast = 1 and LocationNorthwest= LocationSoutheast =  LocationSouthwest = 0.) Thus, we only need to uncheck Location. Then we re-run the workflow and return to the “R” output.

final_coefficients.png

Now all of the variables are significant, so we can proceed. The next important portions of the report output are the multiple R-squared and the adjusted R-squared values. The multiple R-squared value describes the percentage of the variation in the target (dependent) variable that is explained by the predictor (independent) variables. In general, a higher R-squared value indicates a better model.

 

However, we have to take care not to obsess too much over the R-squared value because of our old foe overfitting. Adding more variables to a linear regression model will automatically raise the R-squared value even if the new variables aren’t significant! To account for this phenomenon, we can turn our attention to the adjusted R-squared value. This quantity penalizes models with additional variables, so it is a better measure of the model’s complete quality. Luckily, our example model has excellent values for both R-squared and adjusted R-squared.

r_squared.png

 

How do I use the validation sample once I am ready?

 

At this point we have nearly all of the ingredients we need to test our model. We have a dataset the model hasn’t seen from the Create Samples tool and we have models that have passed a basic validation test. But how do we actually test these models on the data? Yet again, Alteryx saves the day with a tool designed to solve this problem! The Score tool takes two inputs: a valid Alteryx predictive model, and a dataset with the same fields as the data used to create this model. The tool then outputs scores, or predicted values, for the target variable in the new dataset.

 

The Score tool is generally used for two main purposes. The first is the problem we have here: we already know the values of the target variable, but we want to compare these values to the predicted values to evaluate the quality of the models. The second is when we want to use a validated model to predict unknown values of the target variable from a new dataset. Thus, we generally use the score tool twice within a modeling process. The first time is when we test the model, and the second time is to use the tested model to predict new data.

 

After dragging the Score tool onto the canvas, connect the “O” output from the model you wish to test to one of the Score tool’s inputs. Connect the validation data to the other input. (With the Score tool, it doesn’t matter which connection goes into the top input and which connection goes into the bottom one.) Your workflow should look something like this:

scored_workflow.png

To configure the score tool, add the name you wish to give the scored field. There are certain restrictions on the name you can give this field. The name should start with a letter and contain only letters, numbers, and the underscore character.

score_config.png

In most cases, you can leave the other options as the defaults. However, there are special options you can choose with Linear Regression models. A natural log transformation is a way of preparing certain datasets so that they adhere to assumptions about Linear Regression models. But if this is the first time you’re hearing about natural log transformations, chances are your data has not undergone a natural log transformation, so you can ignore this option. The second option should have wider appeal; it allows you to include a confidence interval rather than simply a point forecast. Checking this box would give you three new output columns: the forecasted value, a lower bound and an upper bound. For instance, choosing a 95% confidence interval would mean that there is a 95% probability that your target variable falls between the lower and upper bounds.

 

At this point, if you have multiple models, you can use a Join Multiple tool to join them all together. Since you will have duplicated fields from the various models, you can also de-select them using the Join Multiple tool.

 

After running the workflow with the Score tool properly configured, we can manually look through the predictions and the actual values. But this process doesn’t really give us a cohesive picture of the model’s performance. We need two more tools to obtain this picture. First, we can use a Formula tool to obtain the squared error for each prediction. In this example where [Score_linear_regression] is our prediction and [Miles_flown] is our actual target variable, we use the following formula:

POW(([Score_linear_regression] - [Miles_flown]), 2)

 

first_formula.png

 

If we have multiple models joined together, we create a similar formula for each model and then run the workflow to populate the metadata.

 

At this point, we are close to having a comprehensive picture of the performance of all of our models. But when we have situations with larger validation datasets (such as the attached example, which has 100 validation records), we can’t keep all of the percent errors in our head. Thus, we use the Summarize tool to take the average of the squared error for each model.

Finally, we use a Formula tool to take the square roots of all of these average squared errors. The resulting quantity is known as the Root Mean Squared Error (RMSE), and it is the most commonly used measure to compare regression models.

 

If I have multiple models that passed the validation stage, how do I choose the best one?

 

The first step of model selection is to compare the RMSE of each model. A lower RMSE indicates that the model’s predictions were closer to the actual values. However, a simpler model with the same RMSE as a more complex model is generally better, as simpler models are less likely to be overfit.

 

RMSE.PNG

 

In this particular example, we essentially have a tie between the Linear Regression model and the Spline model. Though the Spline model has a lower RMSE, the Linear Regression model has fewer variables. Thus, these two models are both good candidates for a production modeling process. We can also combine the two of them in a model ensemble, which allows us to use the results of multiple models. Unfortunately this post has already gotten too long to describe ensembles, but feel free to ask for a future post about this topic if you’d like to learn more!

 

Anyway, that’s the end of the content of this post. I hope you enjoyed reading it, and feel free to provide any feedback or questions in the comments section! The attached file regression_workflow contains the entire workflow described here. A note on this workflow: the Model Customization options were left as the defaults for all three models. If you would like to learn more about these options in a future post, feel free to comment and request it! Additionally, I focused on regression models this time because my last post focused on classification models. But if you would like to know more about validating classification models, feel free to ask in the comments section. Depending on the scope of your question, I can either answer within the comments section or dedicate a new post to the topic. Finally, check out this set of Predictive Analytics Starter Kits on the Gallery. The kits will guide you through the process of data preparation, model building, and model validation for A/B testing, linear regression, and logistic regression.

Comments
Darroch
5 - Atom

Hi Bridget,

Thanks for this great article. I tried to follow your tips, but found that I could only Join Multiple after scoring. Model outputs only gave Name and Object fields, but that may reflect my inexperience.

I recently read the following article in the series on Ensemble models by Mike Bowles, which highlights the packages available a few years ago: http://blog.revolutionanalytics.com/2014/04/ensemble-packages-in-r.html

I'm eager to hear where you take this next.

Darroch

BridgetT
Alteryx Alumni (Retired)

Hi @Darroch,

 

Thanks for your feedback! I meant to say that you should use a Join Multiple tool after scoring, but I see now that my original wording was a bit unclear. My apologies! I might write another blog post giving hands-on examples of model ensembes in Alteryx in the future, but for now @ToddM has written a more theoretical post explaining them: https://community.alteryx.com/t5/Engine-Works-Blog/Alteryx-Data-Science-Design-Patterns-Combining-Mo.... Is there anything else you'd like to see in a future blog post?

 

Thanks,

Bridget

 

Edit: Just realized you were the first person to comment on Todd's post! Well, hopefully either he or I will write one with actual examples of using ensembles in Alteryx fairly soon.

Roel
6 - Meteoroid

Hello Bridget,

 

Thank you for the informative posts. I'm noticing that with larger sets of data the R tools can become really slow. In fact much slower than inputting the data directly in R. (i.e. 300 vs 15 minutes to run one example). This is unfortunate since Alteryx is in many ways so much more flexible and user friendly.

 

Could you give an insight where to expect this behaviour (what tools to avoid) or even how to avoid this behaviour altogether ?

Darroch
5 - Atom

Hi @BridgetT,

 

Right now, I'm running my models through a Score tool and then manually creating a weighted voting (categorical variables) or average (numerical variables) through Formula tools. So that's hardly an efficient method. I'd love to see something that simplies such a process.

 

From my background reading, I was inspired by this guide: http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/

This appears to be a suitable method in R, but requires Caret and CaretEnsemble packages: http://machinelearningmastery.com/machine-learning-ensembles-with-r/

Thanks

Darroch

BridgetT
Alteryx Alumni (Retired)

Hi @Roel,

 

Yes, there are some cases where the R-based tools run more slowly than what one might expect. However, I've never seen a situation with such a big discrepancy as the one you described. What model were you running? Anyway, my three suggestions to avoid this issue are:

 

1. Don't create batch macros using R-based tools. Alteryx takes about two second to start up each R process, so repeatedly starting up R will slow things down considerably.

2. Make sure that your data is clean and in an appropriate form for predictive modeling. For instance, make sure that fields are the appropriate data type. Numbers (if they actually represent numbers in real life) should be some sort of int, double, or float type. Don't use numbers to represent categories, however. (For instance, don't use 1 to represent low, 2 to represent medium, and 3 to represent high.) Make sure to deal with null values appropriately according to the tool you're using. 

3. If you know some R, you can go into the code in the macro yourself and disable plotting and interactive report generation (if you're using a tool that generates interactive reports). With 11.0, some of the Predictive tools will have the option to disable plot generation, so you won't even need to go into the tools and do it on your own. 

 

These suggestions won't necessarily work 100% of the time, but they're a start at least!

 

Best,

Bridget

BridgetT
Alteryx Alumni (Retired)

Hi @Darroch,

 

The procedure you describe is about as good as you can get without writing any R code yourself. If you know R, and the only thing preventing you from using functions from the packages you mention is not knowing how to get them to play nicely with your Alteryx R installation, you should check out the Install R Packages App on the Predictive District of the Gallery. If you know a little R, but you'd like more guidance on using those packages, I could maybe write a future blog post explaining how to use them. :) But I can't make any guarantees right now, since my team is pretty busy with other work.

 

Thanks,

Bridget