by: Jeff Adams – November 12th, 2018

From the cramped confines of Fenway Park to the spacious interior of Kauffman Stadium, MLB ballparks vary greatly. With no official limitation on outfield dimensions, fence depth is open to determination by the teams who build and inhabit the stadiums. In a sport so wrapped up in rules, both written and unwritten, this is a novelty. Each park is product of a couple different factors. Teams have to negotiate between available space and specific offensive and defensive goals. It is common knowledge that the size of fair territory has a great impact on the game. Terms such as “short porch” label certain areas of a given field as increasing the probability of certain offensive outcomes based on the park dimensions. For example, right-center field at AT&T Park in San Francisco is 421 feet away from home plate and carries the title of being a “triples alley.” The amount of open ground out there allows the batter-runner to have an extra second or two to advance to the next base on a ball that plugs the gap. Any fan or player is keenly aware of the effect that dimensions can have on a ball in play.

Major League Baseball, at its core, exists for two reasons: entertainment and profit. Each team and ballpark carries a duty to both fans and executives to provide exciting games, while winning enough of them to keep fans buying tickets. Over the entire course of baseball’s history, a home run has been touted as the single part of the game that satisfies both duties. Without a doubt, the home run is an integral part of the game and may have even saved the game from falling obscurity with the rise of sluggers like Sosa, McGwire, and Bonds. Today, we are seeing more home runs (and strikeouts) than ever before. The conventional wisdom is that smaller parks amplify and larger parks weaken a team’s offensive production. The driving factor behind this thought is that more home runs occur in smaller parks and this has been proven in every analysis of park effects.

Since the home run is usually the focal point for research into ballpark dimensions, fence depth and fence height are the only factors that have been extensively analyzed. There has been little research conducted into the tangible effect that the outfield itself has on offensive production. Today, I want to look into the effect of outfield square footage on offensive production. To quantify this effect, I will look at game-level data to find the total bases achieved by singles, doubles, and triples. By excluding home runs, I can derive the effect of outfield square footage on the batted balls that fall in the outfield, rather than beyond it.

## Assumptions

The primary assumption of this paper is that there is a theoretically ideal size for an MLB outfield that does not assist either the pitcher or the hitter. In the interest of maintaining a balanced game, a mean park size must have evolved to accommodate the talent in the Major Leagues. Therefore, based on this assumption, any ballpark that dramatically varies from this ideal size should have an effect on the games played in that ballpark. Thus, I expect to find a greater magnitude effect of outfield size on total bases when the given ballpark skews from this theoretically ideal park size. Additionally, the smallest parks will have a negative effect on the number of singles, doubles, and triples hit, while the largest parks will have a positive effect. This will be due to the fact that there is less (more) room in the smaller (larger) park for any batted ball to fall in for a hit. However, for the parks that exist about the mean, I expect the effect to either be negligible or significantly zero.

Game data was obtained from Retrosheet^{1} and contained game logs from all regular season games played from 2010 to 2017. The dimension data was derived from Clem’s Baseball, a phenomenal resource that has diagrams of current and historical MLB parks. I assumed a common infield depth (from home plate to the edge of the infield dirt) of one hundred fifty feet in each park. Games played in Citi Field prior to 2015 were not included in this sample as the dimension data I gathered was calculated after the fences were moved. The final number of observed games in the data set was 19,023 games.

## Visualizing the data

Based on outfield size, each game was placed into one of three categories: small, average, or large. These categories were derived from the distribution of outfield square footage. The population mean was 92,179 square feet with the standard deviation being 3,582 square feet. The large and small categories are the outfields that lie outside one standard deviation above and below the population mean, respectively. The average category captures the outfields that lie within one standard error of the population mean. The plot below sets total bases from non-home run offensive production against outfield square footage:

The data appears, at first glance, to back up my assumptions. There is a slight downward trend in the small group and a noticeable uptick in offense in the large group. The average group is relatively flat, but appears to actually drag the overall average total bases down. Additionally, it appears to support my idea of an ideal park that has no effect on the game with the smoothing curve intersecting the average line twice and coming close a third time. This ideal outfield might be able to exist at a couple different sizes.

## Digging deeper

Now that we’ve seen some indication that my assumptions will hold, we need to go a little deeper on the causal effect of outfield size. To do this, a simple Ordinary Least Squares regression was run using the observations from the aforementioned groups. The small and average groups showed a negative correlation, while the large group correlated positively. In order to improve the fit and validity of the model, I sought to root out possible Omitted Variable Bias by bringing in game-level data that might be correlated with both total bases and outfield area. Omitted Variable Bias occurs when there are factors that correlate with the dependent variable and at least one independent variable. I ran tests to determine which of the game level variables from the Retrosheet data was significantly correlated with both total bases from non-home runs and outfield size. This resulted in adding 15 additional independent variables to my models for each group.

### The Cramped Confines – Small Group

The small group of fields offers the expected result. Outfield size survives the regression and remains significant. In absolute magnitude, this group shows the greatest association between total bases and outfield size. With 3,892 observations and an adjusted R-squared value of .93, this model allows for solid causal inference. Since the coefficient of outfield size (ofsqt) is negative, for every thousand foot increase in outfield size we can expect to see total bases from non-home runs decrease by .15 bases at a game level.

### The Typical Fields – Average Group

This group has by far the most number of parks and therefore the most number of games observed at 12,060 games. Outfield square footage is significant in this group at 90% confidence. The magnitude of the coefficient is negative, but nearly zero. For every thousand foot increase in size, we should expect a drop in total bases of .02. With an adjusted R-squared of .94, we can be pretty confident in saying that parks about the mean have little significant effect on a given game.

### The Spacious Outfields – Large Group

This group offers the first real challenge to my hypothesis. Over the 3,071 games measured, outfield square footage was positive, but insignificant. This goes against my original hypothesis and says that changes in size within this group leads to effect on game play. This is actually pretty enlightening as it raises the question of whether there are important factors that exist only in large parks that counteract the effect of increasing the outfield size.

## So what does it all mean?

Based on the data from the 2010-2017 MLB regular seasons, total bases tend to show a positive relationship with outfield size. However, the regressions demonstrated that the small and average groups were the only ones to demonstrate a significant relationship. The large group ran counter to my original hypothesis and hasn’t shown any clear indications of a causal relationship.

Overall, it is clear that outfield size has an effect on the game, but it may be different and a little harder to tease out than I originally anticipated. It’s possible that my groupings were a little off, due to some abnormalities in the data. Despite challenging some of my group-specific hypotheses, this analysis has provided heavy support for the existence of a theoretically ideal size for a Major League ballpark, where the field has no impact on a given game.

As mentioned before, the large group’s failure to demonstrate a relationship might be indicative of necessary factors that I did not include in this model. I included no player data, as this was meant to be a simple exploration into game-level effects, but outfielder talent would definitely affect the number of hits that fall in the outfield. For example, the 2015 Royals outfield put up a cumulative UZR rating of 36.8, which was dramatically better than the 2015 Red Sox outfield that put up a rating of -7.9. Obviously, the Red Sox outfield has become the best in the league in recent years , but 2015 is a good example of a team in a large outfield employing more defensively-talented players.

This model has a few limitations. The first is its lack of control for weather effects that could impact player performance. For example, adding temperature, cloud cover, and wind data to the study may demonstrate the effect of weather on the game of baseball. Second, altitude has been shown to have a dramatic effect on the game, but primarily in one park. Coors Field was in the large group and may have skewed some of the results. Third, the dimension data calculates the outfield area in a cumulative manner, so the effect of the three fields (center, left, and right) individually is lost.

In the end, this was a relatively simple approach to a complex relationship between batted balls in the outfield and the size of the outfield itself. By incorporating more granular data from Statcast and controlling for player-specific tendencies, we might be able to get better understanding of how the field impacts the game.

- The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at “www.retrosheet.org”.↩
- Photo via Vividseats.com