Engine Works

Under the hood of Alteryx: tips, tricks and how-tos.

With the Rugby World Cup upon us, I took to the challenge of predicting the results of all the matches. We got our hands on over 10 years’ worth of international test data from our friends over at Opta and set about building a predictive model. Following the methodology that is nicely outlined in the How to Become a Citizen Data Scientist series, a simple workflow leveraging a linear regression model was born.


For those curious, these are the general steps I took (if you’re looking for the exact features we used for the predictions, you’ll have to wait until we see if we’re right):

  1. Take nearly 1000 rugby match XML files and read them in using a wildcard
    • Parse these out using the XML Parse tool and split each match into two rows, one for each team involved.
    • Prep the data to create relevant fields around points difference and match events (e.g.: tries and penalties scored).
    • Infer home, away or neutral ground, based on the match location.


  1. Blend this data with historic ranking data scraped from the World Rugby website
    • Enrich the match data set with rankings information to determine the difference in rank between each side.


  1. Generate features or variables that may be used
    • Create many new fields that may be effective in predicting the result.
    • Use of Multi-Row Formula tools to determine in % or points scored in that teams’ last 5 or 10 games (for example).
    • These are also created for the fixtures from the World Cup that we are going to score.


  1. Leverage the Data Investigation toolset to understand the variables’ effectiveness
    • As mentioned in this series and the Citizen Data Scientist webinars – build an understanding of the variables and which may be effective for prediction.
    • As a result, the features that were less effective were removed.
    • Here is an example of a scatterplot showing the relationship between the ranking difference and the points difference in the match:


  1. Run the data into multiple models and select the most appropriate
    • I tried Linear Regression, Forest Model and Decision Tree and used the Model Comparison tool to evaluate them.


  1. Picking Linear Regression, the model was built, and the pool games were scored
    • The outcome is a prediction on the points difference for both sides of the game which can then be averaged for the predicted result.
    • I built out a graphic of these predicted results using the reporting toolset which can be seen below.



The reporting output showing all 40 Pool games and the predicted winner and winning margin is shown below. Note: kick-off times are shown in UK time (GMT+1).



Stay tuned for when the knockout stages come around as I will be running the model again to predict the final stages and the eventual winner of the tournament! 


Have your own predictive model?  Share it on this thread and we can test our predictive powers together!

13 - Pulsar
13 - Pulsar

Interesting! I'm pretty sure we will see a few upsets that your modelling didn't predict??


Georgia to beat Fiji........ 


What do the group tables look like if all of the above plays out?


Did you include any bookmakers odds as a variable?


Interestingly @Joe_Lipski, Jim Hamilton said outright at the event last week that he reckons there will be no upsets (a la Japan against SA) this time round.


However it's tough to imagine that there won't be any results that go against what the historic trend says. You look at Wales and Ireland and they're just not in the shape that their results over the last year or so would suggest - Wales could easily slip up against Fiji.


Didn't reference bookies odds in the model - just used World Rugby's historic rankings. The actual rankings used as variables had little effect on the model's output though.


Who's your money on?