Data Science

SydneyF · ‎08-08-2019

After the data gathering, cleansing, investigation, and feature engineering, but before starting to train models, it is critical to have a strategy for evaluating your models that doesn’t involve the data you're using to train your model. Common strategies include creating a “holdout” dataset or performing cross-validation. The effect is that you only use a portion of your data to train your models, and you’ll have unbiased data with which to tune hyperparameters and evaluate your final model.

But isn’t more data for training always better? Will I be shooting myself in the foot if I leave data out of training? Can’t I just use the training data to evaluate the model? How does cross-validation work anyway, and when am I supposed to use it?

As you’ve embarked on your data science journey and come across these concepts, you might have had questions along those lines. Today, I’d like to make the virtues of creating a validation and test data set or performing cross-validation apparent. Hopefully, I can equip you with the knowledge to set up your model evaluation efforts for success.

Why Having a Test Dataset Matters

When you feed data into a machine learning algorithm, it uses the data to identify patterns and determine how to best predict the target variable. Many algorithms have performance metrics that you can use to assess how well the model “learned” the data. However, one of the best ways to assess performance is to run labeled data through the trained model and see how it performs in comparison to the known value of the target variable.

It’s tempting to use the same data you used to train the model, or even use metrics associated with the training of the model (like r-squared) to assess your model. But on the whole, both of these strategies are a bad idea.

Using your training data to evaluate your model results in overly optimistic metrics on your model’s performance because your model has already seen the data and knows exactly how to handle it. For some algorithms (e.g., random forest) a trained model will be able to predict the values of the data it was trained with perfectly. The risk of not noticing if your model has been overfitted to your training data is also very high. If you test with the data you trained with, you have no way of verifying that your model learned real patterns in the data, and not random noise that exists just in your training dataset.

Fit statistics like r-squared can also be inflated by artifacts of the data or model unrelated to actual performance, and do not protect against the risk of overfitting.

That is why it is a best practice to do a final evaluation of your model with a separate dataset not included in the training data fed into the model, called a test data set.

Typically, any metrics (like mean square error) calculated on a test dataset will be worse than those calculated on a training dataset because the model did not see the points in the test dataset during training. Because the records in the test dataset are entirely new to the model, the performance of the model on the test records will give you an idea of how your model will "generalize," or perform on unseen data. All models are wrong in some sense, so having realistic expectations of what kind of mistakes your model might make, or what level of confidence you should have in its estimates is critical before putting a model into production.

This is also why once you use a test data set to assess your model, you shouldn’t tune or adjust the model any further. The second you make any adjustments in the context of your test dataset, the dataset has been compromised. If you alter your model to perform better with your test dataset, you are effectively manually “training” your model with test data. Although the model will likely perform better on your test dataset, the metrics you derive from the test data are no longer unbiased, and the test data can’t function as a true test.

Hyperparameter Tuning and the Second Test Data Set

In addition to holding out a test data set, it is often necessary to also hold out a validation data set. This is because there are some decisions and model features that do need to be made and adjusted that are not learned by the algorithm. These are the hyperparameters.

Hyperparameters are parameters that impact how your model learns its training data and creates a model. The hyperparameters need to be determined by the user – so in order to adjust or tune the hyperparameters, you need to be able to assess how the model is performing on data that wasn’t used for training and adjust hyperparameters until you find an optimal model. This process is called hyperparameter tuning.

Because you have adjusted your model using the validation dataset, it can no longer be used to create an unbiased evaluation of performance.

This is why you also need to holdout a test dataset. Once you’ve run the model to score your test dataset, you’re done adjusting your model.

How Much Data Should I Hold Out?

There isn’t a hard and fast rule on how much data should be withheld from training for testing and validation. It will depend on the size of your labeled data. As a general starting point, you should use at least half of your data for training.

The “holdout” method for creating data for evaluation works well (certainly better than just using your training data to evaluate a model) but there are a few limitations. Namely, if you are working with an already limited labeled dataset – if you only have 100 records to start (not advised, but bear with me), and you split your data 75%, 15%, 10%, your final evaluation metrics will be based on ten records.

With a sample that small, it is not reasonable to put too much weight into any of the metrics derived from it – what if the ten records are all extreme cases that are difficult for the model to accurately predict? What if the ten records don't include tricky edge cases? Not to mention that you’re training your model on 75 records.

This is an inherent sticking point of hold-out methods for validating a model. Ideally, you’d like to maximize your training dataset, because you want to give your algorithm every single variation in the data you'd like it to learn; but you'd also like to maximize your test dataset, because you want to make sure that the metrics you calculate are representative of how your model will perform - not just how it will perform on a lucky (or unlucky) draw.

In addition to this data size limitation, simple hold out is sensitive to what data ends up in each bin. The individual records that land in each holdout bin matters - which isn't ideal if you're trying to create the most robust, best model possible.

Cross-validation

An elegant solution to the limitations of simple hold out is found in cross-validation.

Imagine that you split your dataset into two groups, half and half. You arbitrarily labeled one split training and the other split testing, trained and evaluated a model, and then switched the two groups. You now have two sets of evaluation metrics, which can be combined for a more realistic picture of how the algorithm and training data will perform on an unseen dataset.

This is exactly how k-folds cross-validation works, where k is the number of splits you divide your data into. For each iteration, you pick out one of your subsets to be the test data, and the rest are used as training data. This process is repeated until each subset has taken a turn at being the test data, ensuring that each record in the data set takes a turn at being a test record. The validation metrics calculated for each subset are combined to give an overall estimate of the model’s performance. This has the effect of reducing the variability of the metrics – instead of running the risk of accidentally getting a really optimistic test data set, or a test dataset full of outliers, you have a combined metric.

Another common variation of cross-validation is called one-out cross-validation, which effectively works the same way as k-folds, but instead of creating subsets you just keep one data point out as a time and train a model with the rest of the data. You then calculate your metrics with each trial point combination. The benefit of this strategy is that you are maximizing your training data at each iteration.

There is a great article explaining the intuition of cross-validation published on KD Nuggets and an equally great blog post from Rob Hyndman called Why every statistician should know about cross-validation that discusses cross-validation in the context of the field of statistics.

When to Use a Holdout Dataset or Cross-Validation

Generally, cross-validation is preferred over holdout. It is considered to be more robust, and accounts for more variance between possible splits in training, test, and validation data.

Models can be sensitive to the data used to train them. A small change in the training dataset can result in a large difference in the resulting model. Cross-validation can account for this by running multiple iterations of data splits, and averaging the performance together. The final model uses all of the available training data, which is also a benefit.

A limitation of cross-validation is that it is more time consuming than the simple holdout method. Cross-validation effectively needs to repeat the process of training and testing a model for each iteration. Cross-validation

If you have a huge dataset where each target variable is well represented, holding out a validation and test data set may work well and save you a ton of time in processing. However, cross-validation is widely considered to be a better, more robust approach to model evaluation as long as it is applied correctly.

One thing to be wary of when using cross-validation (or even the holdout method) is having duplicate records in your dataset, where multiple rows (observations) have the same predictor and target variable values as one another. This will make cross-validation (or the test dataset) ineffective because if one or more of the duplicated records are left in the test dataset, the model will be able to predict that records with unrealistic accuracy. This can happen when you perform oversampling on your dataset to correct for an imbalanced class in your target variable.

Holdout Data and Cross-Validation in Alteryx

To create validation and test datasets in Alteryx, you can use the Create Samples tool! It's super easy to configure - just provide percentages for each new split of data you'd like to create and feed in your data. There are three output anchors on the tool - training, validation, and test. The Create Samples tool can be used in combination with the Model Comparison tool, (available for download from the Alteryx Gallery) which is an R-based macro that takes in one or more model objects (the "O" output of a Predictive tool) and a test data set to calculate metrics (like F1 score or mean absolute error) for any of the model algorithms included in the Predictive tool palette.

Alteryx also has a tool for cross-validation written with the R programming language. The cross-validation tool is available for download from the Alteryx Gallery.

Data Science

Holdouts and Cross Validation: Why the Data Used to Evaluate your Model Matters