Inspired by our podcast with @ouij, and yesterday’s follow-up regarding the Pythagorean Win Expectancy, math-lover Jared Kobe @SCviaDC offers the following deeper look at the math of Pythag. I’ve interjected my notes in red just to make sure we are all following along.
The basic question Jared asks is how good has this Pythagorean Win Expectancy been over the course of Baseball History? My look at just 2012 is fraught with room for error. By looking at more seasons (all of them) and looking at some better statistical indicators, we can get a better picture of just how good Pythag is/isn’t.
The Pythagorean Win Expectancy formula produces the probability that a team will win a baseball game. It does so by essentially dividing 1 by 1 plus the square of the rate you give up runs for each run you score. To illustrate what this rate means, take the 2012 Nationals who gave up 594 runs and scored 731. This gives us a rate of .812 runs allowed per runs scored, meaning the Nats allowed 13 runs for every 16 they scored. That’s the number/fraction that gets squared and added to 1, which finally gets divided into 1 to give you the probability of winning.
Baseball Reference has since revised the original formula to better fit the historical data. Instead squaring the runs allowed per runs scored rate, they take the rate to the power of 1.83 (approximately the square root of 3.35). I included the revised formula in my calculations for comparison sake.
I looked into how Bill James came to derive the formula from Frank’s link, and he based this win probability by putting runs scored on a Weibull distribution. From there, I tried to understand how the rest the actual formula was derived… but let’s just say that Weibull distribution is much more fun to say than it is to try to calculate or understand. Note: So we won’t be trying that.
So, without a function that calculates the probability to compare the data points against, we are stuck with looking at how off the calculated expected wins were from the actual wins to judge how well the Pythagorean Win formula works. I used Baseball Reference to get the Wins, Losses, Runs Scored, and Runs Allowed for each team, ever. I ignored games listed as “Ties” (mainly because it seems those games were essentially rain-outs made up later in the season.) I then calculated the Pythagorean Wins using the formula and multiplying it by the total of wins and losses (games played). So just like yesterday’s post, but you know…with every team in every season ever.
Now, determining the number of expected wins can be a little tricky from a comparison standpoint. Baseball teams win games in discrete numbers; multiplying the number of games in a season by a probability will rarely, if ever, give you a whole number. Therefore, there is a problem with rounding the expected win total, because having .2 wins is a little weird. Frank ignored this issue in his post, which is fine because simple is usually best. And I didn’t want to make the wrong choice. I tried rounding up to next number, rounding down to the next number, and simply rounding to the nearest integer. From this, I would guess that when James (and later Baseball Reference) modeled the formula, they rounded to the nearest integer.
The difference between the Actual Wins and the Expected Wins was calculated by subtracting the Expected Wins from the Actual Wins, so a negative number is “underperforming” and a positive number is “overperforming” based on the Expected wins. Note: Again, same as I did yesterday.
Here are the stats for all the seasons: