We're excited to announce that we'll be partnering with Credly starting October 19th - see what this means and read the announcement blog here!

Engine Works

Under the hood of Alteryx: tips, tricks and how-tos.
wdavis
Alteryx
Alteryx

Hello and welcome to the first blog post to follow on from my new video series, Will's Weekly Wonders, where I look to take data from my own personal life and interests and explore what is possible to do with it in Alteryx. I am a big football/soccer fan and five weeks ago I took on the challenge to see if I could predict whether the team I support, Reading FC, would make it into the playoffs to hopefully ease my nerves for the remaining league games.

 

I was devastated, the model successfully predicted that they would finish outside the playoffs in 7th position and end the season early for me and all other Reading fans. For those who are not familiar, Reading currently play in the EFL Championship which is the division beneath the Premier League. In this league, the top 2 teams will go up automatically to the Premier League, whilst those that finish 3rd – 6th go into the playoffs to compete for the 3rd spot in promotion. At the time of creating the model, Reading sat in 6th place and then proceeded to fall out.

 

The purpose of this was really to see what was possible to achieve in Alteryx when I had half a day spare, a hypothesis and a real passion that I could quantify in data. Overall, the model was a success, it correctly predicted the teams who would finish 1st and 2nd and get promoted automatically, as well as the 4 teams who are now going to compete in the playoffs over the coming weeks. That being said, when diving into all the predicted results, the accuracy does start to fall somewhat. However, the big success for me was how quickly I was able to build a Data Science framework within Alteryx and a platform to help me progress in any future sporting prediction endeavors.

 

So, onto the workflow itself. As you can see from the screenshot beneath, this was split up into 6 sections/macros, with each macro representing one of the teams pushing for the playoff positions and the trophy at the end. The data was sourced from a GitHub repository which contained all the results for the current season.

wdavis_0-1620809705063.png

 

Data Prep & Cleanse – Bournemouth (7th)

 

The first part of the workflow involved cleaning up the original data set with all information stored in one field with comma separated data. The key part to this was to get each result on to 2 separate lines, so that you could analyse the performance of each team individually when this would be fed into the model.

 

Tools used: Text to Columns, Dynamic Rename, Select, Record ID, Transpose, Filter

 

wdavis_1-1620809705066.png

Cleaned output data.

 

Feature Engineering – Reading (6th)

 

Once the dataset had been cleaned and prepared it was time to start the process of feature engineering which is creating new variables from the raw data, based on my (self-professed) footballing knowledge.

 

The features that were created:

  • Form – a numeric score of how many points that team had accumulated in their 5 previous results. In soccer, you are awarded 3 points for a win, 1 for a draw and 0 for a loss. So this metric ranged from 0 – 15
  • Points Per Game (PPG) – Their average points tally prior to each fixture
  • Goals Per Game (GPG) – Their average goals scored in a game prior to each fixture
  • Conceded Per Game (CPG) – Their average goals conceded in a game prior to each fixture

As you can see, not an extensive list and there is definitely scope to add more features in! The dataset was also further prepared, by providing information on the result of each fixture across each individual line item.

 

Tools Used: Formulas, Multi-Row Formula, Select

 

wdavis_2-1620809705068.png

Output from Feature Engineering and further Cleansing.

 

Data Blending & Creating Samples – Barnsley (5th)

 

I now wanted to supplement my original dataset with some additional data which I would be able to add in as features into the model. In recent years, Expected Goals (xG) has become an increasingly popular metric with which to analyse football games, so I have sourced this information from a dedicated website to append on.

 

The next step was to join the dataset to itself, so that for each fixture you could compare the main team’s performance, with that of the opposition. So that for instance, we could model how much a factor your own Form was vs the Opposition Form for each game.

 

I then also split the data into Results (the games that had already been played) and Fixtures (those which I wanted to predict), so that I had my sampled datasets ready to feed into the model and those which to feed into the prediction.

 

Tools Used: Input Data, Select, Find Replace, Join, Filter

 

wdavis_3-1620809705069.png

Output from Data Blending and Sample Creation

 

Model Building with Assisted Modeling – Swansea (4th)

 

The next part was to feed this into the Assisted Modeling tool and let the Machine Learning perform its magic. For clarification here, I have no Data Scientist background and have learnt everything I know since being at Alteryx and using the software in the last 2 years. So, for me, the Assisted Modeling comes in very handy!

 

For those who are new to the Intelligence Suite and Assisted Modeling, you go through 7 steps, which I will outline below...

 

  1. Select Target Variable and Machine Learning Method
    1. In my case the Target = Result
    2. ML Method = Classification, with 3 possible options; Win, Lose, Draw
  2. Select the Automation Level
    1. I opted to go through ‘Step-by-Step’ to understand more about my model. There is also the option for ‘Automatic’ where you are not required to make any changes or decisions
  3. Set Data Types
    1. AM made recommendations on all my variables, adjusting them to either Numeric or Categorical. You have the option to adjust them, if necessary
    2. Some variables, ‘Game Played’, was dropped as it was a unary value (all the same value)
  4. Clean Up Missing Values
    1. There were no missing values in the dataset, so no changes required
  5. Select Features
    1. This is where I was able to start iterating on my models. From the outset, there were 23 ‘features’ being fed into the model. Some were easy to remove, such as ‘RecordID’ which I know has no relevance to the outcome of a football match
  6. Select Algorithms
    1. In each instance, I selected all 3 models to be scored against each other; Decision Tree, Random Forest and XGBoost.
    2. The benefit of Alteryx’s speed to process meant that including all 3 had no performance issues and allowed me to get a much greater comparison
  7. The Results
    1. You are presented with a leaderboard which provides you all the information on the models you have run: Accuracy, Balanced Accuracy, Log Loss and M-statistic
    2. I can also go and check the configuration and overview of each individual model
    3. It’s a great way to quickly compare and contrast when iterating
    4. After the first run I am able to ‘create new model’, which took me back to step 2 and I was able to follow through the process, but in each instance I was just amending the features that were used in the model

 

I didn’t spend as long as I would traditionally like in the preparation steps here and could certainly have benefitted from further prototyping and iterating to progress to a more accurate model, however the focus was on producing a framework and generating some results for the latest fixtures!

 

Tools Used: Assisted Modeling, Transformation, Classification, Fit

 

Model Results & Cleanse – Brentford 3rd

 

I was now able to feed into the prediction the next fixtures in order to get a % likelihood of the 3 possible outcomes: Win, Draw or Loss. The model would then automatically select the value that had the highest % likelihood.

 

Tools Used: Predict, Select, Sort, Summarize, Transpose, Join, Filter

 

wdavis_4-1620809705071.png

Output of the results that had now been predicted.

 

Final Standings

 

Blending the historical data with the predicted values then enabled me to produce a final table, with all predicted games taken into account alongside a view of the next round of fixtures for each team.

 

I used the Reporting tools within Alteryx in order to produce a final PDF table, to provide a much more visual representation of the outcomes in the model itself and ultimately produce something that is nice and easy to interpret for the final users. The ability to combine multiple reporting elements into the PDF report is crucial for this.

 

Tools Used: Formula, Multi-Row Formula, Join, Union, Summarize, Sort, Record ID, Filter, Table, Image, Chart, Layout, Render

 

wdavis_5-1620809705089.png

 

Summary

 

This was a really fun project for me that allowed me to explore completely new datasets, build a model and test hypotheses in only a short amount of time! Whilst the model hasn’t proved to be 100% accurate, I never anticipated this to be the case and wanted to use it as a means to creating a framework for anyone else out there who would like to make predictions on sporting events.

 

I’m sure many people reading this could go on and create/develop this work substantially more and who knows, maybe for next season we might have the perfect model to predict championship football matches!

 

I’ve attached the workflow alongside each of the individual macros so that you can see the workings behind the scenes at every step of the way. Of course, the macros themselves are there purely for show in this instance and to create a nice visual workflow (this did all start as one long and messy workflow!). Note that you'll need Alteryx Intelligence Suite installed in order to run the workflow.

Comments
Thableaus
17 - Castor
17 - Castor

Awesome post @wdavis! Loved the way you did, so organized and well thought.

 

Will get some of these insights to give it a try in the Brazilian League. Although football is so unpredictable I would say.

Joe_Lipski
13 - Pulsar
13 - Pulsar

@Thableaus that’s what makes it the best sport in the world!