Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Data Science

Machine learning & data science for beginners and experts alike.
NickJ
Alteryx Alumni (Retired)

If you’re a grizzled veteran on the analytics conference circuit you get to see a lot of presentations from vendors, thought leaders and customers.

 

The “cult of the new” is alive and well I can assure you, and analyst buzzwords sprout in spring along with the release of magic quadrants, waves and other two-by-two matrices.

 

That sounds suitably ‘grizzly’ but honestly, it’s always fascinating to see new takes on data-driven success and it’s equally interesting to pay attention to the stories that survive from year-to-year.

 

Remember how everyone still holds up Minority Report as a prescient view of the future? That film was released in 2002. Nearly two decades ago. There are kids entering college soon who WEREN’T EVEN BORN THEN. Yet, it still holds court in the public’s imagination as containing the kinds of innovation and ‘future-tech’ that we’ve demanded/clamoured for ever since.


 

Minority Report: The future…as seen from the past.Minority Report: The future…as seen from the past.

 

 

 

In analytics circles, there are other memes that have survived for even longer. The great-granddaddy is probably the (mostly?) apocryphal story of the correlation between beer and diapers (which I’ll leave for others to explore the historical evolution of that tale), but I want to explore some events that happened the year that Minority Report was released, and was turned into an international bestseller the following year: Michael Lewis’ Moneyball.

 

 

Michael Lewis’ Moneyball.Michael Lewis’ Moneyball.

 

 

Moneyball: Revisited

 

If you’re one of the few that hasn’t followed the story, let me give a recap. For those of you that don’t like baseball? Take a deep breath...

Once upon a time...The Oakland Athletics were a great ballclub. In the 1970s, the 80s and the early 1990s.


Oakland A’s legends: Jim ‘Catfish’ Hunter, Jose Canseco and Dave StewartOakland A’s legends: Jim ‘Catfish’ Hunter, Jose Canseco and Dave Stewart

 

 

Skip forward to 2002, and the team was struggling to compete in an era of big salaries and a ‘juiced’ playing culture (the impact of steroids still having repercussions in the sport today, with legendary contemporary players still excluded from the baseball Hall of Fame in Cooperstown).

 

Key players had left the team, headed for big-market clubs such as the New York Yankees. Billy Beane, the General Manager of the A’s realised that he couldn’t hope to match the spending power of the Yankees, the Dodgers, the Giants and others. Instead, he turned to shrewd analysts (sabermetricians - from SABR: the Society for American Baseball Research - to give them a more accurate and niche nickname) who were taking their analysis of the sport to a whole new level.

 

While other teams were still content with decades-old performance measurements (such as RBI - Runs Batted In and ERA - Earned Run Average), Beane’s analysts tapped into the latest baseball research to apply newer metrics to better understand nuances in the game.

 

The state of the art had boiled the game down to its statistical fundamentals: what factors lead to a win? Turns out it’s really simple: score more runs than the other team. PhD-level insight, right there.

 

 

Introducing Sabermetrics

 

Sabermetrics guru Bill James (who releases a statistical baseball handbook annually) explored the relationship of a player’s performance with runs to create a statistic called ‘Runs Created’ (RC) which assesses how much each player’s effort at the plate leads to a run being generated.

 

After deeper analysis, James (and others) looked into the contributing factors that cause a run to be created - to cut a long story short (well, short-ish), there’s a derived statistic called ‘On-Base Percentage’ that’s hugely influential in generating a run.

 

On-Base Percentage (or OBP) is a measure of a batter’s success in reaching first base safely - either by hitting the ball and reaching the base before the ball, or through a walk (four balls in a plate appearance) from the pitcher. OBP says nothing about a player’s power to hit home runs or other fancy things; it’s simply a measure of getting ‘on-base,’ so by inference also includes a measure of a player’s patience at the plate (a strikeout or put-out would lower a player’s OBP).



On-base Percentage, with more acronyms than a 1990s BI VendorOn-base Percentage, with more acronyms than a 1990s BI Vendor

 

 

 

With this simple insight (that OBP drives RC), Beane’s general manager skills jumped into action. He began scouring the major leagues for players that were traditionally overlooked for not having power or star-abilities. By being overlooked, Beane also found them substantially under-valued in the market, and gradually built a team that exploited the marginal value in this somewhat esoteric statistic.

 

The performance of this new-look Oakland A’s?

 

Playoff berths in both 2002 and 2003, with a team salary that was a fraction of their postseason peers (Oakland’s $44 million against the Yankees $125 million in 2002).



Talk of the baseball world: the 2003 Oakland A’sTalk of the baseball world: the 2003 Oakland A’s

 

 

 

And this little story grew into an internationally best-selling book and a well-crafted film starring Brad Pitt and Jonah Hill.


Moneyball – Brad Pitt’ best film (according to Rotten Tomatoes)Moneyball – Brad Pitt’ best film (according to Rotten Tomatoes)

 

 

Beyond the Popular Meme: of Pirates and Pitching

 

This is where most analytics memes finish up: the happy ending. But what happened next? Did David continue to slay Goliath in professional baseball?

 

No. Of course not. That’s the snag with marginal value - the market is efficient, and it adjusts. Some markets are more situationally-aware than others, but in the small world of baseball, other general managers quickly took notice and adapted their front office to include a similar arsenal of talent and tools to bring teams back to ‘analytic parity.’

 

From 2006 (where the A’s topped the American League West) to 2007 (where they collapsed to 3rd in a four-team league, and remained for several years), the market had moved and new advantages needed to be found.

 

This is where the story changes focus and we move to Pittsburgh, to the Pirates.

 

The Pirates in 2011 were in more of a mess than Oakland - as a team, they’d not had sustained success since the 1970s.

 

With baseball trailblazing ahead with data and analytics innovations, both amateur and professional sports analysts had access to more advanced statistics than ever before, but added to that a near-firehose of data from companies such as PITCHf/x and MLB Advanced Media.

 

It was now possible not only to analyse a baseball’s speed from a radar gun, it was now possible to understand the ball’s entire journey from leaving the pitcher’s fingers until it reached the plate: in three dimensions and in near real-time. Machine learning classification labels the pitch (correctly) as a curveball to a cut-fastball and everything in-between.


Pitchf/x (and other data vendors) offer an incredible variety of new data for baseball analysts.Pitchf/x (and other data vendors) offer an incredible variety of new data for baseball analysts.

 

 

In this era of ‘Big Data Baseball’ (as written by Travis Sawchik in the book of the same name), the Pirates developed their analytics strategy around some new insights: it’s not just about runs created, it’s also about runs prevented. Although a run created is inherently more valuable, there’s still marginal value in mining the depths of preventing a run from being created.



Big Data Baseball – Travis SawchikBig Data Baseball – Travis Sawchik

 

 

 

And so the Pittsburgh management went to work, looking explicitly for defensive catchers with a sharp eye for ‘framing’ pitches (that is, working the strike zone around a batter in such a way that the umpire calls marginal balls as strikes more often than not. Every strike gets you closer to an out, and every out reduces the chance of a run being created. It’s not cheating. It definitely IS bending the truth, but it’s all part of the game...)


8 another gif.gif

 

 

Other analytic insights meant that the Pirates also targeted pitchers who could induce a batter to hit the ball on the ground - preferably at one of the defensive field players. Some of this requires raw talent, and the Pirates prioritised great defence when signing new players, but marginal impact also comes from how you position players on the field.

 

Again, traditional baseball ‘knowledge’ tells you that players pretty much occupy the same ‘zones’ on the field as they have done for over a century. However, ‘defensive shifting’ moves players around, clustering them to the left or to the right depending on the historical patterns of the batter. Data leading to insight (hitting pattern analysis in the form of ‘spray charts’ and hiring ground-ball pitchers), leading to action in the form of defensive shifts and leading to better outcomes (more outs through ground balls). It’s the analytic journey, folks!


Spray Charts used in MLB ScoutingSpray Charts used in MLB Scouting

 




A fielding shift in action during an MLB gameA fielding shift in action during an MLB game

 

 

 

Again, Pittsburgh made a real impact with this change in strategy: from a winning percentage of 0.488 in 2012 (i.e. they won just under 49% of their games), to several seasons of 0.540 winning percentage (or much higher). A real movement of the performance needle.

 

Exploiting Marginal Value through Analytics: 2019 Edition

 

Fast forward a few more years, and how are baseball analysts exploiting marginal value to earn their team the win? There are two major trends in the game that are genuinely reshaping baseball: strikeouts and pitching staff.

 

A trend that’s been going for well over a hundred years now is the endless rise in strikeouts. From James’ 2019 handbook, in 1898 there were 2.31 strikeouts per game in the majors, in 1918 it was 2.89.

 

By 1938, it’s 3.41.

 

1958?  It’s 4.95.

 

1978 - 4.77, and shows a small downturn, but by 1998 we’re at 6.56 and in 2017 we’re at 8.48.



Bill James – the original sabermetrician with his latest annual review.Bill James – the original sabermetrician with his latest annual review.

 

 

The reason behind this is long-term in nature - high-strikeout pitchers are always highly prized. However, high-strikeout batters don’t necessarily suffer from the same stigma. As James says, home runs are so precious that they must always be accompanied by a ‘bodyguard of strikeouts.’ Teams, generally, are not looking to avoid hiring the next Mickey Mantle, they’re looking to find them.


giphy[1].gif

 

 

At some point though, strikeout rates will get so high that batter effectiveness and strikeouts will have to diverge, and we might be reaching that point soon. In 2018, the ten teams that struck out the most had a collective record of 92 games BELOW 0.500 (i.e. bad). The ten teams that struck out the least were 103 games OVER 0.500 (i.e. good).

 

This trend could mean that analysts start to prioritise hitters who don’t strike out as a new source of marginal value.

 

In terms of pitching staff, managers (driven by data) are now turning the ball over to their relief staff far earlier than in previous years - either to exploit pitcher/batter matchups, to offer ‘predictive maintenance’ (i.e. to reduce ‘parts failure’ in their star fireballer) or to keep the opponents guessing. Again, the long-term trend is clear: in 1918, 63% of games were complete games for a single pitcher. In 2018? Only 1%...the complete game pitching matchup is now something to be truly savoured.


June 26th 2019 – Mike Minor throws a complete game for the Rangers over the Tigers: a rarity in the modern game.June 26th 2019 – Mike Minor throws a complete game for the Rangers over the Tigers: a rarity in the modern game.

 

 

Competitive Advantage in the Margins: An Analytics Tale

 

So, I’ve told my tale of Moneyball (and more). It’s an analytic story of finding marginal value and competitive advantage that’s made possible through volume and variety of data. Through exploration and allowing baseball amateur sleuths to help the professionals by asking disruptive and innovative questions. By creating new models to explain behaviours and deliver a literal game-changing performance.

 

And it’s the cautionary tale of the market shifting to adjust - a continual battlefield, where a competitive advantage can be fleeting. However, with data literacy, endless curiosity and an off-the-field strategy that pushes its outcomes through analytics, the next great discovery will turn one spring-training dream into a post-season reality!