community
cancel
Showing results for 
Search instead for 
Did you mean: 
Do you have the skills to make it to the top? Subscribe to our weekly challenges. Try your best to solve the problem, share your solution, and see how others tackled the same problem. We share our answer too.
Weekly Challenge
Do you have the skills to make it to the top? Subscribe to our weekly challenges. Try your best to solve the problem, share your solution, and see how others tackled the same problem. We share our answer too.
Unable to display your progress at this time. Please try again a little later, or contact an administrator if you continue to see this error.
Announcement | Get certified today - take the Alteryx Designer Core and Advanced exams on-demand now!

Challenge #18: Predicting Baseball Wins

I will be at Inspire! Let's definitely meet up!! I'll send you a message :)

NJ

Aurora
Aurora
Sadly I'm missing Vegas this year but will be in London for Inspire EU. Already making sneaky squirrel plans tho' for Vegas Inspire 2018
Asteroid

This was a difficult, but fun challenge.  Solution is similar to others, but probably could be condensed.

Week18Challenge.png

Alteryx Certified Partner

Workflow

Asteroid

I liked this one a lot! Solution attached.

Alteryx Certified Partner

Here's my solution.  I spotted an anomaly in the dataset in that CHC has 161 games historically but we're asked to assume 162 games for the prediction.  I corrected it using a formula tool but it actually makes no difference at all to the scores (even when you look to multiple decimal places).  Interesting!

 

**EDIT: Thinking about this, Games isn't included in the model so why would it make a difference anyway? Eh Jamie? What were you thinking? Anyway, I'll do the right thing and leave this here as a memorial to that one time I got it wrong.**

 

Spoiler
challenge_18.png

 

Spoiler
Capture.PNG
Alteryx Certified Partner

Tough but a useful push into learning some of the core tools of Alteryx. Score is the one to remember by the looks of things!

Highlighted
Meteor

This was a challenge that I burned too many brain cells over, for sure.

 

UPDATE: I challenged Alteryx and myself to find a better solution than a multivariate linear regression model, so I've included tests of five separate supervised algorithms and used the model comparison tool to select the best model. The winner was the Neural Net using the three principal components. There are advantages to using PCA results with neural nets, as can be found in an answer here: https://stats.stackexchange.com/questions/67986/does-neural-networks-based-classification-need-a-dim.... The neural net model has a mean absolute error of LT 2, which is both fantastic and raises some skepticism that it's overfitting and even memorizing the data, and the root mean squared error is still 2 1/2x less than the Random Forest model to almost 4x less than any other model.

 

The team statistics are normally distributed, a plus. But, there are a lot of highly correlated attributes as baseball statistics are a mix of raw compilations and computed statistics; for example, OPS is On-base Percentage (OBP) + Slugging Percentage (Slug). Highly correlated variables create collinearity, which makes regression models freak out by making some variables more important or less important than they should be. There are a number of ways to handle highly correlated predictor variables, but to follow this challenge's directions I highlighted the top 10 variables (which as a data scientist I would not do), and then I noted that those variables created collinearity; for example, Runs and Runs per Game are really the same and highly correlated with one another. I handled this problem by creating principal components, and found that 2 PCs accounted for 93+% of the variance within the data, an incredibly high coverage.

 

My wins predictions are somewhat different from most other solutions, but I believe this is true because my regression model's R-squared of 0.3726 and adjusted R-squared of 0.3002 indicate the reduced collinearity created by the tightly correlated top 10 variables, where other solutions indicated results of an R-squared of 0.5571 and adjusted R-squared of 0.3241, indicative of an internal collinearity issue due to the adjustment. The model is confused as to which variables are the real indicators of predicted wins.

 

Asteroid

This was new for me! Had a look at the solution to help me along the way.