Imagine taking a 100question multiplechoice test and giving the right answer to 85 questions. You get a score of 85%. You must have studied and learned the material!
But maybe the reality was a little different: You’d actually forgotten to study, so you just went down your answer sheet and picked answer A for every question. Your teacher had gotten tired of putting the right answer in different places, and just stuck 85 of the answers in option A. You lucked out!
Image via GIPHY
There was probably a better way to measure your abilities than your score on this test. The same may be true of how you measure machine learning models’ prediction abilities.
The metrics — the quantitative measures of model performance — that you choose to evaluate your models matter, but with so many choices, which one should you select? You’ll definitely have to face this decision if you’re handcoding your own machine learning models. However, you might also like to know more about your options even if you’re using our awesome new AutoML tool in Designer 21.1, or for using Assisted Modeling to guide your model creation in either assisted or automatic mode. This knowledge is also helpful for using a package like EvalML.
This is a distinction you'll need if you want to delve into the optional “Advanced Parameters'' of the AutoML tool. If you'd like to explore these terms further, pop open the spoiler below.
While you don’t have to tweak the Advanced Parameters settings, you can select which objective function you want to prioritize as AutoML evaluates different algorithms and parameters. Think of “objective” here as a “goal” for the model; which measure do you want it to maximize or minimize?
With that goal in mind, the AutoML process will build and evaluate a variety of models for you. It will rank the models it creates based on the objective function, and will offer you its top choice as its output. But sometimes a particular measure of the model’s performance may be especially relevant to your use case. If so, you may want to select the best fit for your needs from the objective function list.
When the model is built, you’ll see not only how it performed with regard to the measure you chose, but also other metrics that reflect its performance in different ways. You may be interested in what those all mean, even if you didn’t choose them as your top priority for the modelbuilding process.
So, to be a little more clear: all these quantitative measures of a model’s performance can be called metrics, but only one is used as the deciding factor for AutoML’s model selection in the objective function.
Ready to move on? Let's dive in.
It depends. Don’t you love that answer?
The first step is to understand what your options are; you can then decide which one best fits your situation. First, there are different metrics for classification problems than for regression problems. (Classification problems are when you want to find the category that something best fits, out of two or more choices, like true/false or low/medium/high. Regression is when you want to find a numeric value for something, like predicting a score or a home’s value.)
We’ll check out metrics for classification here. In part 2 of this post, we’ll look at metrics for regression models.
Image via GIPHY
This post relies on the idea of a confusion matrix and what it means to have true positives, false positives, true negatives and false negatives. I’ve included a confusion matrix for each metric below, with blue text showing which results are used in its calculation. It’ll also be good to know what “balanced” and “imbalanced” datasets are.
Want a refresher on confusion matrices and balanced/imbalanced data? Click the spoiler tag for more info. Otherwise, start reviewing your many metric options.
Here’s a confusion matrix for a binary (twooutcome) classification problem with possible outcomes of “Yes” or “No”:
Prediction: Yes 
Prediction: No 

Truth: Yes 
True positive (TP) The model predicted “Yes,” and the reality was “Yes.” 
False negative (FN) The model predicted “No,” and the reality was “Yes.” 
Truth: No 
False positive (FP) The model predicted “Yes,” and the reality was “No.” 
True negative (TN) The model predicted “No,” and the reality was “No.” 
Of course, outcomes could be all sorts of things: “voter” or “nonvoter,” “default” or “no default,” “conversion” or “no conversion.” Multiclass problems could have multiple potential outcomes, like “high risk,” “medium risk” and “low risk.”
One more thing: When we discuss “balanced” datasets in the context of classification, we mean that your outcome variable is pretty evenly distributed between/among the potential options, not heavily skewed or “imbalanced” such that one or some outcomes dominate. It takes a little extra consideration to build models when your training dataset has 99 “yes” outcomes and 1 “no” outcome, for example. Be sure to do thorough exploratory data analysis so you understand the distribution of your data before you choose a model and evaluation metric(s).
Prediction: Yes 
Prediction: No 

Truth: Yes 
TP 
FN 
Truth: No 
FP 
TN 
Definition: Accuracy is the proportion of times your model predicted the right class out of all the predictions it made. Values range from 0 to 1, with higher values reflecting greater accuracy.
Important to know:
Prediction: Yes 
Prediction: No 

Truth: Yes 
TP 
FN 
Truth: No 
FP 
TN 
Definition: the average of the accuracy calculated for all classes (i.e., the proportion of correct predictions out of all predictions made). In a multiclass problem, there are different ways of calculating balanced accuracy, as explained here with links to full references. Values range from 0 to 1, with higher values reflecting higher accuracy across all classes.
Important to know:
Let’s say you have a dataset for training your model with a sample size of 100, and two potential outcomes, Yes or No. The outcome variable is imbalanced, with 85 items labeled “Yes” and 15 labeled “No.” Your first model’s effort to classify the data gives you this confusion matrix:
Prediction: Yes 
Prediction: No 

Truth: Yes 
85 
15 
Truth: No 
0 
0 
In this case, the model’s regular accuracy would be how many guesses it got right out its 100 tries: 85%. You might see that metric and think, wow, awesome! However, the model was only good at predicting the “Yes” labels, and not great at predicting the “No” labels; in fact, it got all of those wrong.
Balanced accuracy takes that notgreat performance into account, and in this case, is only 42.5% (the average of the accuracy for the individual classes in the columns above). The model is suddenly looking a lot less awesome, but you’re awesome for checking on this metric and catching the problem.
Image via GIPHY
Prediction: Yes 
Prediction: No 

Truth: Yes 
TP 
FN 
Truth: No 
FP 
TN 
Definition: For a binary classification problem, this is the proportion of times the model predicted outcome A correctly out of the total predictions of outcome A (whether correct or incorrect). For a multiclass classification problem, precision is calculated with averaging techniques. For both binary and multiclass problems, values for precision range from 0 to 1, with higher values reflecting greater precision.
Important to know:
Prediction: Yes 
Prediction: No 

Truth: Yes 
TP 
FN 
Truth: No 
FP 
TN 
Definition: The weighted average of precision and recall, and one of the most popular metrics for evaluating model performance. (Recall is the proportion of times a model predicted Outcome A when Outcome A was truly present. It can also be called “sensitivity” or “probability of detection,” both of which are more descriptive names than “recall.”) The F1 score is calculated by multiplying precision by recall, dividing that by their sum, and then multiplying by 2, or: 2 * [(precision * recall) / (precision + recall)]. This metric can also be used for multiclass problems by averaging the scores for each class. Values range from 0 to 1, with higher values reflecting more correct predictions overall.
Important to know:
Prediction: Yes 
Prediction: No 

Truth: Yes 
TP 
FN 
Truth: No 
FP 
TN 
Definition: This metric incorporates true and false positives and negatives, as well as the number of items in each class, so it can be used on imbalanced datasets. Another way to think of this metric is that it addresses all the cells of a confusion matrix, unlike some other metrics, plus the number of items in each class. The MCC can be used for binary and multiclass problems. For a binary problem, values range from 1 to 1; 1 represents perfect predictions, 0 represents predictions equivalent to random guesses, and 1 represents inverse predictions (i.e., the model is predicting the opposite outcome consistently). The value ranges change for multiclass problems depending on the data.
Important to know:
Definition: AUC is an acronym that stands for “area under the [receiver operating characteristic, or ROC] curve.” This one requires an explanation of the ROC as well; we’ll just say here that this metric looks at how likely your model is to predict the probability of model outcomes in the correct rank order. It doesn’t consider what threshold you might choose for accepting the model’s prediction of a particular class. (Here are some nice visuals for this ranking process.) Values range from 0 to 1. An AUC of 0 means the model’s predictions are all wrong, and an AUC of 1 means they are all correct. An AUC greater than 0.5 shows that the model performs better than chance, i.e., better than simply guessing.
Important to know:
Image via GIPHY
Definition: a measure that penalizes the model for incorrect predictions, but also incorporates the model’s confidence about its predictions. This metric is used for binary and multiclass classification, and is suited for models that provide the probabilities for assigning each potential class. Lower scores are considered ‘better’ with regard to model performance, but this value is not very informative if you’re looking at just one model; it is more useful for model comparison. Values can range from 0 (probabilities were perfectly predicted) to, well, infinity.
Important to know:
Remember, no one metric is right for every situation, so choose the option that makes the most sense for your particular goals and desired outcomes. It’s an important decision, but I hope this list has helped you evaluate your choices.
In the next post, we’ll talk about metrics for evaluating regression models, so stay “tuned” for that one! (Yes, that was a goofy machine learning pun. 😜 )
Blog teaser photo by Nathan Dumlao on Unsplash
Susan Currie Sivek, Ph.D., is the data science journalist for the Alteryx Community. She explores data science concepts with a global audience through blog posts and the Data Science Mixer podcast. Her background in academia and social science informs her approach to investigating data and communicating complex ideas — with a dash of creativity from her training in journalism. Susan also loves getting outdoors with her dog and relaxing with some good science fiction. Twitter: @susansivek
Susan Currie Sivek, Ph.D., is the data science journalist for the Alteryx Community. She explores data science concepts with a global audience through blog posts and the Data Science Mixer podcast. Her background in academia and social science informs her approach to investigating data and communicating complex ideas — with a dash of creativity from her training in journalism. Susan also loves getting outdoors with her dog and relaxing with some good science fiction. Twitter: @susansivek
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.