12 Jul 2020

Chasing .400

The odds of someone hitting .400 this season may be better than you think...

With baseball starting back up in a couple of weeks, the shortened 60-game season brings up an interesting question: what are the odds that someone hits for a .400 batting average, a feat last accomplished in 1941 by Ted Williams? Pitching has greatly improved since then, so the probability in a regular 162-game season is astronomically low, but you would have to think that the shortened season improves those odds significantly. It’s much easier to believe someone could have a flukishly hot 60-game stretch compared to an entire 162-game season. I am obviously not the first person to think about this question, but I thought it would be fun/helpful to break things down in a game theory mindset. To simplify things a bit, let’s look at this in three steps:

Given a player’s career batting average and assuming four at bats per game, what are the odds that player ends up with a season batting average of .400?
Given the distribution of career batting averages among current MLB players, what are the odds of at least one player hitting .400?
Does accounting for the variance in the number of at bats per game change these probabilities at all?

Starting with point #1, let’s assume that a player gets \(N_{ab}\) at bats over the course of the season and that their career batting average, \(p\), is representative of the underlying probability of them getting a hit in each at bat. With these assumptions, we can approximate the distribution of the total number of hits over these at bats, which we’ll call \(N\), using a binomial distribution:

\[P_h(N,N_{ab},p) = \frac{1}{Q} \frac{N_{ab}!}{N!(N_{ab} - N)!} p^{N} (1 - p)^{(N_{ab} - N)}\] \[Q = \sum_{n=0}^{N_{ab}} \frac{N_{ab}!}{n!(N_{ab} - n)!} p^{n} (1 - p)^{(N_{ab} - n)}\]

This is easily converted to a distribution of season batting averages by dividing the input number of hits by the number of total at bats (240 in the case of a 60-game season). The graph below illustrates the probability of a player getting a certain season batting average given their career batting average (different color circles), and as you can see, these distributions can get pretty wide.

It’s entirely realistic for a player to hit 50 points above their career batting average this year, but this means that even with the shortened season, somebody has to be in at least the low .300’s to even have an iota of a chance of hitting .400 for the season. In fact, if we sum up the probability of season averages above .400, a player with a career average of .305 (that of Mike Trout) only has a 0.11% chance of accomplishing the feat. Keep in mind that this is a significant improvement over the 0.000013% chance he would have in a full 162-game season… In case you have a particular player in mind, the interactive graph below plots the probability of hitting .400 as a function of a player’s career batting average (blue dots) and even allows you to change the number of games in the season via the slidebar at the bottom.

These odds may seem negligible, but when you think about it, with ~270 players in the entire league, it seems reasonable to think that at least one of them might get hot and come close, right? This brings us to point #2: let’s try to calculate the probability of at least one MLB player hitting .400. To start, we can formalize the probability of a player hitting .400 using the same definitions laid out above:

\[P_{.400}(N_{ab},p) = \sum_{n=\lceil 0.4 N_{ab} \rceil}^{N_{ab}} P_{h}(n,N_{ab},p)\]

From there, the probability of an individual player not hitting .400 would be \(1 - P_{.400}\), the probability of everyone not hitting .400 would be the product of those values for every single player, and finally, the probability of at least one player hitting .400 would be the complement of that product:

\[P_{tot}(N_{ab}) = 1 - \prod_{p} [1 - P_{.400}(N_{ab},p)]^{N_p}\]

where \(N_p\) is the number of players with a career batting average of \(p\). The last piece of the puzzle is deciding what subset of players we should be looking at to produce a realistic view of this season’s collection of MLB talent. We definitely don’t want to include any young players that had a fluky cup of coffee in the majors to cover for an injured teammate. To account for this, we’re only going to consider players that started at least half of the 2019 season (according to stats pulled from Baseball Reference). For the 195 players who meet this criteria, after all of this complicated math, the probability that at least one of them hits for .400 is a whopping… 3.4%. Probably not going to happen, but again, an astronomical improvement over the 0.002% chance in a regular 162-game season (second column of the first table below). These numbers are pretty close to those of Tom Tango, so it sounds like we’re doing something right!

The one final wrinkle that might affect these odds is removing the assumption of exactly four at bats per game, as brought up in point #3. Someone very well could get five at bats one day and three at bats the next, so how do we account for this variation accumulating over the course of the entire season? We can start by creating an empirical distribution of the number of at bats per game from the 195 “typical” players mentioned above. Since the distribution of the sum of two random variables is the convolution of their respective distributions, a season-long distribution would simply be the single-game distribution convolved with itself 59 times. If we call this season-long distribution \(S(n)\), the probability of at least one player hitting .400 is just a weighted sum over all possible values for the number of at bats:

\[P_{rand} = \sum_{n=0}^{\inf} S(n) P_{tot}(n)\]

Ultimately, accounting for this only increases the probability to 4.4% when applied to the same 195 players as above (third column of the first table above), but it does affect things! I think the logic is that some players will wind up with fewer than 240 at bats over the course of the season, requiring fewer hits to reach .400 and making it easier to join the ranks of Rogers Hornsby, Shoeless Joe Jackson, and Ty Cobb. The red circles in the second figure illustrate a generic player’s probability of hitting .400 when accounting for the variance in at bats over a season, and as you can see, the difference is noticeable but miniscule. Even if we look up each player’s respective at bats per game distribution and use that to produce individualized \(S(n)\) functions, the probability of at least one player hitting .400 still sticks around 3.8%. The second table above lists the individual probabilities for every player in the league when we account for their respective at bats per game distribution.

Honestly, I hope it happens! I know some people are going to roll their eyes at it and say it doesn’t count, but look at the odds we just calculated. Even with the shortened season, even if someone barely misses the mark by a few points, it’s still a pretty tremendous feat. Maybe it will be forgotten sooner than most records, but it’ll be a fun piece of baseball ephemera, like Charles Radbourn’s 59 wins in a season, Red Barrett’s 58-pitch complete game, and Fernando Tatis’ two grand slams in the same inning. Baseball is full of weird anomalies, so I hope someone meets the mark, even if it does have an asterisk next to it.

The python code and data used to generate the values and figures above are available on my GitHub page.