If you are new to machine learning and you developed a classification model, then congrats! You might be thinking, “now what?”
That’s a great question.
With auto ML technology, model creation is more accessible than ever. The difficulty lies in determining if that model is any good. In this article, I will explore how to determine if your model is satisfactory for your business use case (spoiler: it’s not black and white).
Before I jump into how to evaluate your classification model, I want to clarify that while the examples I give in this article are all binary classification, there are also multi-class classification problems. The difference is that in binary classification, the target variable has only two values, and in multi-class, it has more than two values.
Many of the metric calculations I talk about later in the article will change slightly for a multi-class model, so make sure to look up the correct formula if that is the kind of model you are evaluating.
“Machine learning model performance is relative and ideas of what score a good model can achieve only make sense and can only be interpreted in the context of the skill scores of other models also trained on the same data.” -Jason Brownlee, machinelearningmastery.com
Since every machine learning dataset is different, success is subjective.
The only way to make evaluating machine learning models truly objective is to compare different models on the same dataset. And, like a science experiment, we need a “control group.” A control group in an experiment would be where there was no intervention, and results were measured. This is where the baseline model comes in.
Source: Unsplash
You can think of a baseline model as little to no intervention. In a classification model, this would be where you just guess whichever result occurs most (i.e., the mode)—for every observation. So… not much of a model. But it is a helpful baseline so that when you evaluate your model, let’s say it’s detecting fraud, you can say, “hey, my logistic regression model performed 40% better than if I randomly assigned transactions as fraudulent or not!” (This random assignment is the “no skill” line on the ROC curve, which I will cover in more detail later.)
Another way to establish your baseline is to look at what your business is currently doing without machine learning. Whether it’s manually checking certain criteria, using formulas (like if/then statements), or something else—compare the success rate of that process to your model.
Once you have a baseline model and other model options to compare it to, we can start to talk about success metrics. How will you score your model against the baseline? Before we review the performance metrics options, there are a few considerations to have in mind.
How comfortable are you with your model making a mistake? What would the real-world consequences be?
These are good questions to ask when thinking about how risk tolerant your use case is. And your answers can guide you on which metrics to use to evaluate the model and what thresholds to set for them.
For example, if your model is predicting whether someone has a disease or not, you are very risk averse. The consequences associated with a false negative—telling someone that they don’t have a disease when they actually do—are high.
When we talk about false negatives, true positives, and so on, it can get confusing. (The matrix of these values is even called a confusion matrix—talk about self-awareness.) So here’s a quick reference visual before we jump into calculating performance metrics, using the same example as the last paragraph:
Another thing to identify before selecting metrics to evaluate your model is class imbalance. A dataset with balanced classes would contain around the same number of observations for both positive and negative instances of the target variable.
Depending on your use case, it might not be feasible to have balanced classes. For example, if you want your model to detect spam emails, then a positive value of the target variable would mean that the email is spam. The majority of emails sent are not spam, however, so your dataset is bound to be naturally imbalanced. No need to panic! Just keep this in mind when you select a metric to evaluate your model—choose one that is less sensitive to class imbalance.
Source: Unsplash
Below are some metrics used to evaluate a classification model. This is not a comprehensive list, but it does cover the most common metrics:
After you have determined that your model performs well against the baseline model, you still aren’t finished! Now you need to evaluate the results with a test dataset. This is often done by holding out a percentage of your dataset from training so that you can use it to test your model. A more advanced method for testing is cross-valiation--this technique uses multiple iterations of training and testing with data subsets and reduces some variability that occurs when testing only occurs once.
If your model performs very well on the training dataset and not very well on the test dataset, then you have a case of overfitting—your model fits the training data so well that it cannot fit other datasets. Having a train, test, and validate split is considered best practice and helps prevent overfitting. To read more about the difference between test and validation datasets, check out this article.
If your model does not perform well on the training or test dataset, then you may have a case of underfitting. It is a good idea to look at other model options before discarding the use case.
It is my hope that now you have an idea of whether your machine learning model is a good one. If you’ve realized that your model isn’t up to snuff, look out for my next article, where I go over what to do if your model isn’t performing.
Special thanks to @minetpolsinelli, @MarkG, and @NeilR for their help with this article.
Hi, I'm Megan! I am a Sr. Content Manager at Alteryx. I work to make sure our blogs and podcast have high quality, helpful, and engaging content. As a data analyst turned writer, I am passionate about making analytics & data science accessible (and fun) for all. If there is content that you think the community is missing, feel free to message me--I would love to hear about it.
Hi, I'm Megan! I am a Sr. Content Manager at Alteryx. I work to make sure our blogs and podcast have high quality, helpful, and engaging content. As a data analyst turned writer, I am passionate about making analytics & data science accessible (and fun) for all. If there is content that you think the community is missing, feel free to message me--I would love to hear about it.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.