by: Matthew de Marte – November 21st
Ever since I launched SOTG, I dreamed of the perfect article to write about Shohei Ohtani. Subjectively, he has been one of my favorite players since I first heard of the supposed “Japanese Babe Ruth,” and objectively he is one of the most talented and captivating players I have ever seen. I also share a special bond with my aunt, who was born and raised in Hiroshima, over Japanese ball players, which only heightened my fandom for Ohtani. Now one thing that we at SOTG pride ourselves on is when we write an article it is not just listing statistics from our favorite websites to backup an opinion. It is actual statistical analysis involving programming languages, data visualizations, and rigorous analysis to enhance our personal development as quants and writers, and provide the best possible content for you, our readers, as well. Most topics I brainstormed when writing about Ohtani were the former, just really cool stats about him to backup my opinion, why I thought he was a transcendent talent.
On Monday November 12th, Shohei Ohtani made history, becoming just the 4th Japanese born player to win the Rookie of the Year award. His main competition was slugging third basemen of the New York Yankees, Miguel Andújar. Andújar had a fine season, but a quick surveillance of advanced metrics, and adding in the fact Ohtani provided ~ 50 good innings as a pitcher as well, and one can quickly deduce that Ohtani was the rightful recipient of the award. While what Shohei Ohtani accomplished in 2018 was nothing short of historic, and I would go as far in saying it was revolutionary, this article is not going to focus on his fantastic season, but rather look at it through the lense of predictive modeling in R. As a New Yorker, I have heard the gripe of many Yankee fans, whether it be on the radio, in person, or the dreaded #YankeesTwitter. I have decided to take it upon myself, using machine learning to decide who was the better (more valuable) hitter once and for all in 2018!
So let’s set the parameters. Quickly lets look at how Ohtani and Andújar performed in some key areas as a hitter in 2018.
|Player||PA||wOBA (xwOBA)||Batting Runs||fWAR|
Speaking objectively, Shohei Ohtani was a better, more valuable hitter than Miguel Andújar was in 2018. We do not really care about that though! What I care about is based on what they actually did (batted ball profile, strikeouts, walks) who would be projected to actually be the better hitter!
The Data and Model Setup
The model I set up will analyze every batted ball produced in the MLB in 2018. Using a Random Forest model it will analyze various characteristics of batted balls to predict their result. A random forest is superior than a Logistic Regression model for this situation for a few reasons. First, is the random forest generates as many models as you’d like (in this case 100), and finds the averages of each model to give one superior model in the end. In doing so the model will not be overfit, and I can use a response variable with five levels as well (out, single, double, triple, home run). After running a random forest model on the data I will make predictions and make a subset for the projected results of every batted ball for Ohtani and Andujar in 2018. Then after calculating Andujar and Ohtani’s projected statistics, calculate their projected wOBA, Batting Runs, and the offensive component of the fWAR to see who is projected to be a better hitter based on their expected output.
Time to take a look at the model! Here is the output from the Random Forest model. Note the confusion matrix has five levels. These are the five levels of wOBA where: 0 = out, .9 = single, 1.25 = double, 1.6 = triple, and 2 = home run.
The confusion matrix at the bottom of the output is what we are focusing on. Each row represents observed outcomes, and the columns display predicted outcomes. For example, the column (0,0) has 60408 batted balls in it meaning the model predicted 60480 outs that were actual outs in games. The column (0,2) has 271 batted balls, meaning 271 batted balls that were predicted to be home runs were actually outs. The class.error column represents the error rate for each column. The OOB estimate of error rate of 12.3 % is the models overall error rate, meaning that it was accurate at predicting 87.7 % of batted balls. That’s pretty good! By leaving some room for error this model probably credited some hitters for unlucky outs, and discredited hitters for “lucky” hits, giving us a better indication of their true performance on batted balls. Our benchmark error rate for this situation is 29.6 % (league wide BABIP). This is because we can predict every ball hit in play to be an out and be accurate on 70.4 % of batted balls. The random forests error rate of 12.3 % is much lower than our benchmark and shows the performance of this model is useful and can be used for the Ohtani and Andújar debate!
Applying the Model to Ohtani and Andújar
Now that we have a working model, let’s complete the task this article set out to accomplish – projecting Ohtani and Andújar’s batted balls to see who was a better, more valuable hitter in 2018! After creating a random forest model, the process of getting the predictions for Ohtani and Andújars batted balls are pretty simple. Predictions are made and then the results for each of the sensational rookies can be exported. Note that sacrifice flies are not predicted in this model and thus are considered outs, and I will add sacrifice flies to a player’s observed outs from the 2018 season to keep things consistent. Let’s take a look at how the predicted data compared to the actual data.
Not much actually changed from the observed results to what actually happened! Ohtani was projected to make just one more out and hit for slightly more power, and Andujar faced similar projections! Time to calculate their new wOBA, Batting Runs, and Wins Above Replacement. The first step is to recalculate their weighted on-base averages. After that the players wOBA is turned to wRAA (weighted Runs Above Average) which becomes Batting Runs which then becomes WAR. WAR is calculated by Batting Runs/Runs per Win. Note, WAR is only considering their offensive production. The following table compares Ohtani and Andújar’s projected offensive metrics based on the random forest model:
|Metric||wOBA||wRAA||Batting Runs||Offensive WAR|
Evaluating the Data and Conclusions
It appears that our model agrees with the voters and picked Shohei Ohtani as the rightful Rookie of the Year. We projected him to have a sizable advantage in wOBA over Andújar, and when translated to Offensive WAR, Ohtani is again the more valuable player. This doesn’t even factor in Ohtani’s performance on the mound, and Andújar’s less than stellar defense at third. Miguel Andújar had a fantastic season, but whether the results are projected or observed, Shohei Ohtani was the better hitter, and player, and the rightful ROY. I hope you all enjoyed this article! For anyone who is interested in furthering this discussion, wants to take a look at the code, or wants to use the dataset for their own work, please reach out to email@example.com ! For your viewing pleasure, please enjoy Shohei Ohtani absolutely obliterating a baseball into the stratosphere!
— MLB Pipeline (@MLBPipeline) April 7, 2018
Enjoy his filthy pitch arsenal well!
Shohei Ohtani is pretty good at this pitching thing. pic.twitter.com/UWygQGtkqz
— MLB (@MLB) May 21, 2018
Photo Credit to Kyodo News