Weekly Challenges

Solve the challenge, share your solution and summit the ranks of our Community!

Also available in | Français | Português | Español | 日本語

Want to get involved? We're always looking for ideas and content for Weekly Challenges.


Challenge #18: Predicting Baseball Wins

ACE Emeritus
ACE Emeritus

I will be at Inspire! Let's definitely meet up!! I'll send you a message :)


17 - Castor
17 - Castor
Sadly I'm missing Vegas this year but will be in London for Inspire EU. Already making sneaky squirrel plans tho' for Vegas Inspire 2018
8 - Asteroid

This was a difficult, but fun challenge.  Solution is similar to others, but probably could be condensed.


15 - Aurora


11 - Bolide

I liked this one a lot! Solution attached.

14 - Magnetar

Here's my solution.  I spotted an anomaly in the dataset in that CHC has 161 games historically but we're asked to assume 162 games for the prediction.  I corrected it using a formula tool but it actually makes no difference at all to the scores (even when you look to multiple decimal places).  Interesting!


**EDIT: Thinking about this, Games isn't included in the model so why would it make a difference anyway? Eh Jamie? What were you thinking? Anyway, I'll do the right thing and leave this here as a memorial to that one time I got it wrong.**




17 - Castor
17 - Castor
8 - Asteroid

Tough but a useful push into learning some of the core tools of Alteryx. Score is the one to remember by the looks of things!

7 - Meteor

This was a challenge that I burned too many brain cells over, for sure.


UPDATE: I challenged Alteryx and myself to find a better solution than a multivariate linear regression model, so I've included tests of five separate supervised algorithms and used the model comparison tool to select the best model. The winner was the Neural Net using the three principal components. There are advantages to using PCA results with neural nets, as can be found in an answer here: https://stats.stackexchange.com/questions/67986/does-neural-networks-based-classification-need-a-dim.... The neural net model has a mean absolute error of LT 2, which is both fantastic and raises some skepticism that it's overfitting and even memorizing the data, and the root mean squared error is still 2 1/2x less than the Random Forest model to almost 4x less than any other model.


The team statistics are normally distributed, a plus. But, there are a lot of highly correlated attributes as baseball statistics are a mix of raw compilations and computed statistics; for example, OPS is On-base Percentage (OBP) + Slugging Percentage (Slug). Highly correlated variables create collinearity, which makes regression models freak out by making some variables more important or less important than they should be. There are a number of ways to handle highly correlated predictor variables, but to follow this challenge's directions I highlighted the top 10 variables (which as a data scientist I would not do), and then I noted that those variables created collinearity; for example, Runs and Runs per Game are really the same and highly correlated with one another. I handled this problem by creating principal components, and found that 2 PCs accounted for 93+% of the variance within the data, an incredibly high coverage.


My wins predictions are somewhat different from most other solutions, but I believe this is true because my regression model's R-squared of 0.3726 and adjusted R-squared of 0.3002 indicate the reduced collinearity created by the tightly correlated top 10 variables, where other solutions indicated results of an R-squared of 0.5571 and adjusted R-squared of 0.3241, indicative of an internal collinearity issue due to the adjustment. The model is confused as to which variables are the real indicators of predicted wins.


8 - Asteroid

This was new for me! Had a look at the solution to help me along the way.