Statcast Skills to Team Wins

by Matthew de Marte – February 2nd, 2018

Since its introduction in 2015, statcast has changed the way we consume Major League Baseball. Every home run we now get an exit velocity, launch angle and projected distance. Every pitch that is thrown you can look up the velocity and spin rate. Since the introduction of statcast, many teams have used statcast to identify players with underrated skill-sets. Some players have also used these numbers to their advantage to change their careers. While statcast metrics are thrown around all the time, one thing I have wondered is which skill correlates best to winning games. The best teams may not necessarily be the best regarding statcast data. I was curious to see which skills at a team level correlate best to team wins.

To answer my question, the first thing I had to do was decide which statcast metrics I wanted to pick to find their correlation to team wins. Unfortunately, there are no metrics that pertain to defense at a team level. There are outs above average and catch probability, but these metrics only apply to outfielders. I decided on the following metrics:

  • Average exit velocity
  • Average launch angle
  • Barrel % of total PA
  • Average fastball velocity
  • Average velocity of all pitches
  • Average velocity of just 2 and 4 seam fastballs
  • Average fastball spin-rate
  • Average spin-rate of all pitches
  • Average spin-rate of off-speed pitches

Note I choose average fastball velocity, and that of only 2 and 4 seamers. Statcast considers 4 seam fastballs, 2 seam fastballs, cutters, sinkers, and splitters all as fastballs. I wanted to isolate just the 2 and 4 seam fastball velocities because, in my opinion, those are more true fastballs. One last metric I wanted to use in this study was BB%. While there is no statcast metric for plate discipline, I think BB% is another metric that represents a skill, so I wanted to include it in this study.

The goal of this model is to see which of the metrics I choose will correlate best to a teams win total in a season. The sample size of games I choose was the 2017 Major League Baseball season. I decided my best way to answer this question was to run a linear regression model. All of the models I run will be using R-studio. To start for those readers who may not know what a linear regression model is, and some of the key terms associated with it the following paragraph will outline it, so you better understand this analysis.

This paragraph is an intro to linear regression for those who do not know of it. So if you understand linear regression modeling feel free to skip this section. In statistics, linear regression is a linear approach to modeling the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. The next term you must understand in regression modeling is R-squared. R-squared is a statistical measure of how close the data are to the fitted regression.  0% indicates that the model explains none of the variability of the response data from the variables. 100% indicates that the model explains all of the variability of the response data from the variables. The last term I will introduce is p-value. P-value helps you determine the significance of the results. Anything .05 or less is deemed statistically significant. Anything more means the variable or model is thrown away. These definitions should help you understand the model I produced and grasp its meaning better.

To begin, I wanted to run simple regression equations for each skill I choose above and get the R-squared value and p-value. With only one variable in each model, R-squared values will inevitably be low, but I wanted to see which skills on their own correlate best to winning.

R^2 (%) 14.6 0.38 7.7 33.4 2.2 1.8 1.7 10.2 13.4 24.31
P .065 .775 .191 .003 .494 .532 .541 .129 .078 .054

Looking at the p-values, we know anything above .05 is deemed insignificant. BB% was the only metric considered to be significant when correlated to wins by itself. Of course, this does not mean anything because no team is built on only one skill. Every team has hitters, and every team has pitchers. So none of this matters I just thought it would be fun to show which metrics correlate best on their own. Judging by this table if you could pick one skill for your team to win in Major League Baseball in 2017, it would be the team that walks in the highest % of PA.

Now that we understand linear regression a little bit, it is time to put all of the variables together. To begin, I had to create a model including every skill and see how they correlate to wins. If anyone is curious to see what that spreadsheet looks like or would like it for their own work you can access that HERE.

To begin, I had to decide which variables to use. While every variable would be great to use, not all of them will help build a better model. Some variables if not nullified will weaken the model. To figure out which variables to use, I used backward selection in Rstudio. Backward selection is a series of commands in Rstudio that will produce which variables in the dataset will build the strongest model. The following metrics were deemed insignificant to the model: Barrels, vFB, vAP, vFB(2&4), ASFB, ASAP. So now to get the final model we have to build a model where the only variables we use are exit velocity, LA, BB%, and average spin on off-speed pitches. The following table shows the coefficients and p-values for each metric. The coefficient of each variable is the constant value that each variable has in the final equation.

Metric Coefficient P-value
Exit Velocity 6.32938 .0154
BB% 728.57346 .0004
LA -2.46765 .0414
Average Spin on Off Speed 0.04386 .1147

The next table displays the final equation and the R-squared value.

R-squared Linear Regression Equation
62.3% Wins = -600.67824  +  6.32938*EV + 728.57346*BB% + 0.04386*ASOS -2.46765*LA

The R-squared value our model created was 62.3%! This is much better than I anticipated. This means how a team performs in exit velocity, launch angle, BB%, and their average spin on off-speed pitches explains 62.3% of the variability in team wins during the 2017 season. Average spin on off-speed pitches had a p-value above the .05 threshold. While Rstudio suggested that this variable was necessary to include in the model it is actually an insignificant variable. There are a few things to remember here. While this model deemed these metrics more important than others regarding team success that does not mean all of the metrics that were deemed insignificant are useless in evaluating a player and team. Adding to that there are so many other numbers, skills, and advanced metrics that were not used in this model at all that are useful and valuable information. What this model is saying that if a team hits the ball hard, elevates the ball, walks a lot, and has great spin on off-speed pitches from their pitchers there is a good chance that team is successful. The following table shows how many wins we would predict each team to have based off this model, their actual win total, and the difference between them.

Team (AL) Pred Actual Difference Team (NL) Pred Actual Difference
Indians 95 102 -7 Dodgers 102 104 -3
Twins 83 85 -2 Cardinals 80 83 -3
Yankees 92 91 1 Diamondbacks 92 93 -1
Athletics 79 75 4 Pirates 77 75 2
Mariners 75 78 -3 Nationals 85 97 -12
Angels 84 80 4 Reds 63 68 -5
Astros 88 101 -13 Giants 73 64 9
Red Sox 88 93 -5 Mets 73 70 3
Blue Jays 82 76 6 Rockies 83 87 -4
Tigers 80 64 16 Marlins 74 77 -3
Rangers 82 78 4 Padres 64 71 -7
Rays 84 80 4 Phillies 80 66 -14
White Sox 63 67 -4 Brewers 84 86 -2
Orioles 73 75 -2 Cubs 91 92 -1
Royals 64 80 -16 Braves 75 72 3

Looking at this table and seeing the predicted versus actual wins, you see the models weaknesses. There are five teams that the model was off by at least ten wins for. A big reason for this is because first, the model does not account for any sort of defense, base running, or park-adjustment. Luck can play a factor. There are still a ton of variables that we are not accounting for in this model that play a huge role in a teams performance.

Lastly, before evaluating the overall performance of this model, and before I discuss how to potentially improve it, I wanted to display a scatter plot with a line of best fit comparing predicted wins versus actual wins. The plot was computed using Minitab.


There is definitely evidence of a linear relationship, but as you can see by the scattered points, it is not as strong as the linear regression model produced earlier may have indicated.

After viewing this the final indicator of how the model performed is the Benchmark error rate. The benchmark error rate is an error rate based upon a model with only the response variable and no independent variables. After establishing the benchmark error rate which was .507 we have to compare that to the actual error rate the model produced. A good model is represented by an error rate that is lower or significantly lower than the benchmark error rate. The actual error rate produced by this model was .507, the exact same as the benchmark. So we can come to the conclusion this model although it produced a high R-squared value, and a scatterplot with evidence of a linear relationship is not a strong model.

One of the biggest weaknesses I believed that caused this is the small sample size. When working in R, you always partition the data. I partitioned the data by making 80% of the data the training subset, and the other 20% the test subset. In this sample size, there were only 30 teams. Generally, larger datasets are used in R to create a much larger test subset in which to compare the training data you actually use to create the model. This small sample set means the 6 teams chosen to be in the test subset could have been outliers in the dataset or not very close to the mean, throwing off the entire dataset in general. Even if I used the entire statcast era for my dataset, that would still only be about 90 teams, which is still not a lot large sample size.

Another weakness was that I only ran one model. For the purpose of this piece I only wanted to use one particular model. In reality, the best way would have been to produce thousands of models and get the averages of them.

I used team wins as my response variable. I also could have run another model comparing the independent variables to a team’s Pythagorean win total. While team wins is obviously the outcome that happened, it is not always the true representation of a team’s performance. Luck does play a role in a teams win total and Pythagorean win total does a better job accounting for this, so it would have given a different scope to view this problem out of.

The last weakness of the model is the fact that no metrics accounting for defense or baserunning were used. These are two parts of the game that are highly valuable to a team and could make a large difference in accounting for their overall play. While statcast does have sprint speed, there is no way currently from the baseball savant team to see what the average sprint speed for a team is.

Overall the conclusions I came to is that during the 2017 season the skills that correlated most to team wins are exit velocity, launch angle, BB%, and average spin of an off-speed pitch. While there may not be a strong relationship that exists between the variables, there is enough to say a team that performs well in all four metrics has a reasonable chance to be successful.  I hope you all enjoyed this piece, and for those fans who have not thought about the game this way, this sparked some questions, and curiosity about how to evaluate performance. If anybody would like to see or use the code, I used in producing this model it is attached just below this paragraph! If anybody has any further questions, comments, or suggestions even for how I could improve this project or future projects using coding do not hesitate to reach out to! I hope this is the beginning for me in writing pieces using analytical tools like this, and any suggestions on how I can improve or subjects to explore would be much appreciated!


#Manaage The Data

df <- read.csv(“C:/Users/mdemarte1/Documents/Statcast.csv”)

#Manaage The Data

df$Team.ID = NULL

df$X = NULL



N = nrow(df)

trainingSize = round(0.8*N)

trainingCases = sample(N, trainingSize)

train = df[trainingCases, ]

test = df[-trainingCases, ]


#Build the model

model1 = lm(Wins ~ 1, data=train)

model3 = lm(Wins ~ ., data=train)

step(model3, scope = list(“lower” = model1),

      direction =”backward”)


#Manage The Data

df$Barrel = NULL

df$vFB = NULL

df$vAP = NULL

df$vFB.1 = NULL





N = nrow(df)

trainingSize = round(0.8*N)

trainingCases = sample(N, trainingSize)

train = df[trainingCases, ]

test = df[-trainingCases, ]


#Build the model

model = lm(Wins ~ ., data=train)


#Makes Predictions

predictions = predict(model, test)



observations = test$Wins

errors = observations- predictions


mape = mean(abs(errors)/observations)

rmse = sqrt(mean((errors)^2))


rmse_benchmark = sqrt(mean((observations – mean(observations))^2))

mape_benchmark = mean(abs((observations – mean(observations))/observations))



Leave a Reply