Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Data Science

Machine learning & data science for beginners and experts alike.
DrDan
Alteryx Alumni (Retired)

In my first post of this series, I presented the predicted probabilities that Oliver Wahner and I developed for teams advancing from the Group Round of this year's FIFA World Cup. In addition, we indicated which group level match-ups our models suggested would be the ones to watch. We did point out the Denmark vs. Peru match, which lived up to the models' suggestion that it would very closely contested. Another thing we pointed out was that the models indicated that Group F could be dramatic. However, we thought the drama would be between Mexico and Sweden, not between Mexico and Germany. We wish we could say we had predicted that result, but we were not alone in this (we were a smidge more bullish about Mexico's chances in the Group Round than some others before the start of the World Cup). It seems that even Mexico's home fans were surprised by the result, given that after Mexico went up 1-0 against Germany in the match, the Institute of Geologic and Atmospheric Investigations in Mexico reported seismic activity in the Mexico City area, which they indicated was likely due to Mexico's fans jumping up and down for joy. I must admit, I'm not sure what to say about Japan's win over Columbia, my guess neither does Columbia given Japan had never had a win against against any South American team in World Cup play, and had never previously beaten Columbia.

 

In this post, I'll go into the details of how we created the model. I start by discussing the nature of the prediction problem, and the appropriate modeling algorithms to address that problem type. After this, I move onto introducing the predictive metrics we used for model creation, and then compare the performance of the different models and the key relationships between the predictive metrics and the predicted probability of a win, loss, or draw for a match.

 

At the point when this post is being released, the Group Round will be roughly half-way through, which is a good time to look at how many of the "surprises" that have been reported in the press are really a surprise in this year’s World Cup. A number of those "surprises" (such as Spain and Portugal playing to a draw) were not that surprising, and while Mexico beating Germany was a big upset, it was well within the realm of the probable.

 

Creating the Model

 

Since the goal of this project is to predict the probability of each of the possible win/lose/draw outcomes of a match for what we call a "focal" team (i.e., one of the two teams competing in the match, which is randomly selected), the modeling method used needs to be able to address the three possible outcomes of a match, which makes this a multi-class classification problem. The Boosted Model, Decision Tree, Forest Model, Naive Bayes, Neural Network, and Support Vector Machine tools in Alteryx directly handle multi-class classification problems, while the Spline Model tool handles them indirectly, which is much less desirable. For this project, we decided to look at the use of the Boosted Model, Forest Model, and Neural Network model since we wanted to limit the scope of our efforts, and we believed they were likely to produce good models with this data.

 

The selection of a Modeling method(s) is important, but typically what is more important is the selection of predictor variables to use in the model(s). Ultimately, any model is only as good as the data it is based on. In many problem domain areas there are commonly used "go-to" metrics. It turns out that head-to-head sports and other competitions represent one of those domain areas. The most common go-to metric for head-to-head competitions are known as Elo ratings. The metric is named for its inventor, Arpad Elo, and was first applied to rating chess players. Many of the sports related predictions made on the popular FiveThirtyEight site make use of Elo Ratings. The exact formula used to produce Elo ratings varies a bit from sport to sport, but they are always based on a geometric decaying average of past match performances, and they are intended to be a measure of a team's or competitor's underlying strength.

 

Rather than create our own Elo ratings for international association football teams, we made use of the World Football Elo Rating site which rates national teams in international association football. The factors considered in the World Football Elo ratings are

  • The type of match (e.g., a friendly match, a World Cup qualifier, a World Cup final, etc.)
  • The actual match outcome (win/lose/draw)
  • Whether the team is playing at home
  • A pre-match, calculated win expectancy

The exact formulas used for the calculations can be found here.

 

Others, such as Nate Silver in creating ESPN's Soccer Power Index, have found that other factors beyond Elo ratings are needed to predict international association football match outcomes well. In particular, home advantage (both home country and continent) seem to play more of a role than they do in most sports, as well as the match type. Based on this, the models we investigated included the following predictor variables:

  • The difference in Elo ratings between the focal team and its opponent (Elo_Dif), and indicates that the focal team is weaker than its opponent if it is negative, and stronger than its opponent if it is positive
  • The focal team's Elo rating (Elo) which is expected to have a positive effect on the match outcome for the focal team
  • The opponent team's Elo rating (OElo), which is expected to have a negative effect on the match outcome for the focal team
  • The focal team's home advantage (Home), which indicates whether the focal team is playing in its home country (Country); its opponent's home country (OCountry); its home continent, when its competitor is not (CoAdv); its competitor's home continent, when it is not (OCoAdv); and the case when neither team has no home advantage since they're either both playing on their own continent, or neither is playing on their home continent (NoCoAdv)
  • Whether a match is part of a major tournament (Major_Tournament) such as a continental championship tournament or the World Cup itself
  • The round of the tournament (Round) which indicates whether the match is a major tournament qualifier (Q), a group round-robin match (RR), a round of 16 match (R16), a quarter finals match (QF), a semifinals match (SF) or a tournament finals match (F)

The predictors Major_Tournament and Round are included more for their possible modifying effects on the other predictors (technically their "interactions") than for their direct effects on the outcome of a match. To a large extent this is also true for the predictors Elo and OElo, since the main effect associated with Elo ratings should be captured in the difference in those ratings.

 

We adopted a train/test approach to compare different models. In total we had data from 9020 international association football matches available to us, dating from February of 1998 through March of this year. There were other international matches played in this time period, but they were either "friendlies" or very minor tournaments, which many national teams treat in much the same way as pre-season games in other sports, so were not used to create the models. The data was divided into a training set containing 6314 matches, and a test set containing 2706 matches. The match scores only include regulation and extra time goals, and do not include penalty kick shoot-out tie breakers.

 

In creating the models, we examined the effect of varying the depth of the interactions used in the Boosted Model (ultimately, a model with five-way interactions worked best for this method) and the number of nodes used in a single hidden layer of the Neural Network model (the final model contains 18 nodes in the hidden layer). The Neural Network model also required more than the default number of iterations to converge, we found that setting the iteration limit to 2000 (as opposed to a default of 100) was sufficient. No "hyperparameter tuning" was needed for the Forest Model.

 

Comparing the Predictive Efficacy of the Different Models

 

Table 1 shows the test sample goodness of fit metrics for the final Boosted Model, Forest Model, and Neural Network configurations, while Tables 2 to 4 present the "confusion matrix" (really a cross tabulation of predicted versus actual win/lose/draw classes) for each of the three models. What is apparent is that all three models do a fairly good job of predicting actual wins and losses, but have very hard time predicting draws. While not ideal, this is not surprising since a win occurs for the focal team if they beat their opponent in a 1-0 match or a 12-0 match, while a draw only occurs if both teams score exactly the same number of goals. The two summary measures of the model indicate that the Boosted Model has the best overall accuracy, while the F1 score (which places a greater weight on the difficulty of predicting draws) is best for the Forest Model, with the Boosted Model being second. The Neural Network model is second best for accuracy, but third for the F1 score. It is important to note that the differences in the values of these metrics are very small, so all three models are able to predict match outcomes roughly equally well. Ultimately, we decided to take a simple average of the Boosted Model and Neural Network predictions since they had the best overall accuracy and because we noticed some subtle differences in the two models’ predictions that we wanted to combine. We also re-trained the models (the Boosted Model with five-way interactions and the Neural Network with 18 nodes in the hidden layer) using the data from all 9020 matches.

 

Table 1. Test Sample Model Fit Measures

Model

Accuracy

F1

Accuracy Draw

Accuracy Lose

Accuracy Win

Boosted_5Way

0.6338

0.5449

0.0184

0.7988

0.8175

Forest

0.6279

0.5554

0.1254

0.7679

0.7728

Nnet18

0.6301

0.5394

0.0017

0.8046

0.8119

 

Table 2. Boosted Model Confusion Matrix

 

Actual Draw

Actual Lose

Actual Win

Predicted Draw

11

14

10

Predicted Lose

295

826

186

Predicted Win

292

194

878

 

Table 3. Forest Model Confusion Matrix

 

Actual Draw

Actual Lose

Actual Win

Predicted Draw

75

55

54

Predicted Lose

263

794

190

Predicted Win

260

185

830

 

Table 4. Neural Network Confusion Matrix

 

Actual Draw

Actual Lose

Actual Win

Predicted Draw

1

0

8

Predicted Lose

307

832

194

Predicted Win

290

202

872

 

Examining How the Predictors Influence Match Outcomes

 

All three of the modeling methods we are using are black box in nature. However, methods have been developed to “peer inside” the black box. The most widely used of these methods is known as partial dependence plots. We recently added a Partial Dependency tool to the new Laboratory District on the public Alteryx Analytics Gallery. This tool provides a consistent way of examining both the relative impact of different predictors within a model, which is identical across all predictive model types, and graphically traces out the relationship between a predictor variable and the predicted value of the target. We created these plots for all three models estimated using the training data, and for the final Boosted Model estimated using all of the available 9020 matches.

 

The impact plots (which indicate relative range in the ability of a predictor to move the target) for all three models indicate that the predictor with the greatest impact is the difference in Elo ratings between the two teams, while the Elo rating for the focal team, the Elo rating of the opposing team, the relative home field advantage of the two teams, and the nature of the round of play have roughly equivalent effects in their ability to influence the target variable, with some changes in order across the different modeling methods. The variable with the least impact is the indicator if the match is part of a major tournament. The impact plot for the final Boosted Model used to predict probabilities is shown in Figure 1.

 

 Figure 1Figure 1

The shape of the partial dependence plots for the predictors have very similar shapes across the different models estimated, which is reassuring since each model type indicates comparable effects for the predictors. If the nature of a predictor's effect drastically changes across model types, it might suggest either a problem with a predictor’s construction, or the possibility that some model types are overfitting the training sample data.

 

The effect of the difference in the Elo rating between the focal team and its opponent is consistent with expectations. Specifically, for all models, the probability of the focal team winning increases as the difference in Elo ratings moves in its favor, the probability of losing increases as the difference in Elo ratings moves in its opponent’s favor, and the probability of a draw peaks when the difference between the two team’s Elo ratings approaches zero. The partial dependency plot for the Boosted Model trained using the data from all matches is shown in Figure 2.

 

Figure 2Figure 2

The effect of relative home field advantage runs as one would expect with the highest average probability of winning occurring when the match occurs in the focal team’s own country, and the lowest probability of winning occurring when the focal team is playing in its opponent’s country. Consistent with this, the focal team’s highest probability of losing occurs when it is playing in its opponent’s own country, and the lowest occurs when it is playing in its own country. Figure 3 shows the partial dependence plot for relative home field advantage for the Boosted Model trained on all matches.

 

 Figure 3Figure 3

The effects of the focal team’s own Elo rating and its opponents Elo ratings are similar (higher Elo ratings for the focal team enhance the chance of winning, while higher Elo ratings of its opponent increase the chance of losing), but are not completely consistent across different models, which is not very surprising since we believe that much of the effect of these variables is in modifying the effect of the difference in Elo ratings and relative home advantage. Since we believe the effect of the tournament round (round robin, semifinals, and so on) and whether the match was played as part of a major tournament are mainly of value for their modifying effects on other predictors.

 

Looking at the Surprises in More Detail

 

As indicated at the start of this post, Mexico’s win was a major surprise. Our combined models gave Mexico a 14.4% chance of beating Germany, and an 18.4% of playing Germany to a draw. While a 14.4% probability of winning the match is low, it is not on the order of a lottery ticket being the winner in the next Powerball drawing, or someone being hit by lightning. In fact, over a ten-game stretch, we would expect Mexico to beat Germany between one and two times, and to play them to a draw between one and two times. Japan's victory over Columbia was a bit more of an upset, with predicted chance of 9.3% of occurring, and a predicted chance of 29.0% of the match ending in a draw. Japan's probability the match was low, but counter to press reports, it was actually conceivable.

 

The other matches that were considered to be surprises included Portugal playing Spain to a draw (which we predicted had a 31.8% chance of occurring, along with a 27.2% chance of Portugal winning the match), Iceland playing Argentina to a draw (our prediction was that this outcome had a 25.1%  chance, along with a 22.1% chance that Iceland would win the match), Switzerland playing Brazil to a draw (our prediction had a 25.1% chance of this occurring, along with an 18.7% chance that Switzerland would win the match), and, finally, England’s close (2-1) match with Tunisia (where our models gave England a 67.0% chance of winning, which means Tunisia had a 33.0% of winning or playing to a draw, which seems perfectly consistent with a 2-1 victory for England).

 

The upshot of all this is that while Mexico’s win over Germany was something of an upset, Japan's win over Columbia was definitely an upset, the both were well within the range of what can be expected in a World Cup tournament. Every other “surprise” in this tournament has pretty much been business as usual for international association football. Why so many things have been considered a “surprise” so far in this World Cup can likely be chalked up to our human desire for simple certainties in a complex, uncertain world. Put another way, life is a probability distribution, get over it.

 

What’s Next?

 

In this post we have discussed how we came up with our probabilities of a win, loss, or draw for each match. In the final post of this series, I discuss how one goes from individual match probabilities of win/lose/draw to obtaining the probabilities that an individual team will advance out of the Group Round of the World Cup to the knockout rounds. To do this, we make use of custom R code within an Alteryx workflow to simulate the Group Round 100,000 times. In addition, we will continue to look at the actual results of the Group Round within the context of what was predicted prior to the start of the World Cup.

Dan Putler
Chief Scientist

Dr. Dan Putler is the Chief Scientist at Alteryx, where he is responsible for developing and implementing the product road map for predictive analytics. He has over 30 years of experience in developing predictive analytics models for companies and organizations that cover a large number of industry verticals, ranging from the performing arts to B2B financial services. He is co-author of the book, “Customer and Business Analytics: Applied Data Mining for Business Decision Making Using R”, which is published by Chapman and Hall/CRC Press. Prior to joining Alteryx, Dan was a professor of marketing and marketing research at the University of British Columbia's Sauder School of Business and Purdue University’s Krannert School of Management.

Dr. Dan Putler is the Chief Scientist at Alteryx, where he is responsible for developing and implementing the product road map for predictive analytics. He has over 30 years of experience in developing predictive analytics models for companies and organizations that cover a large number of industry verticals, ranging from the performing arts to B2B financial services. He is co-author of the book, “Customer and Business Analytics: Applied Data Mining for Business Decision Making Using R”, which is published by Chapman and Hall/CRC Press. Prior to joining Alteryx, Dan was a professor of marketing and marketing research at the University of British Columbia's Sauder School of Business and Purdue University’s Krannert School of Management.

Comments
ADerbak
11 - Bolide

Great article, @DrDan!

 

 I was fortunate enough to hear you talk about this at Inspire 2018, but it’s great to digest this information a second time. Not to mention seeing the predicted vs actual results and your explanations surrounding them.

 

Looking forward to the next article!

-AD