By: Jamie Weiss

Sports betting is an extremely popular hobby these days; however, at the same time, betting in baseball is pretty rare. The reason for this is simple, it is the hardest sport to find trends and predict winners with high accuracy compared to every other sport. If we can use data science and engineer some sort of framework to analyze Vegas lines and predict winners for the MLB, we can both make a lot of money, and learn a lot about the game itself.

First, it is important to understand the difficulty of this task at hand. If it were easy to use outside tools to help us make money, it would have already been done and Vegas would not exist. The beginning step to tackling this daunting problem is to set our goals and lay out a plan for success. As a data science engineer, I specialize in a technology called Artificial Intelligence, specifically the Machine Learning subset of AI. If you’re new to this kind of stuff, a simple google search will give you a basic idea of how everything works. Nevertheless, here are a few sentences to summarize what you need to know for this project: AI, in general, is a tool that is becoming increasingly more popular these days and is the technology behind new innovations such as: self driving cars, iPhone facial recognition, automated chatbots, and much more. AI trains computers to think like humans and analyze millions of data points at one time, making connections between the data and eventually predicting an outcome. A basic supervised model is used to predict an outcome for a certain event. The way it can predict outcomes is by looking at previous data with correct outcomes (*usually *the more ‘previous data’ the better) and then predicting future outcomes given similar data. To make this a little easier to understand think of a basic math function with input *x *values which yield output *y *values. Sometimes there are many *x *values (let’s use *x _{1}, x_{2}, x_{3}, … x_{n}*) that yield a single

*y*outcome. If we are trying to solve a simple problem like classifying flowers, our

*x*values will be different important metrics about the flower (like stem length, number of petals, size of leaves, etc…), and our

*y*value will simply be the correct answer – type of flower. If we

*train*our model on tons of correct, historical, data (x’s and true y values), then when we have an unknown flower and we input its stem length and number of petals etc… we can have our model predict or

*classify*the type of flower it is. Hopefully this real example helped with the understanding of Artificial Intelligence as a whole, if you’re still in the dark, think about a computer studying for a test by reading a textbook with the solutions to problems, then during the test you get problems without solutions, and based on your studying, you try your best to solve these problems. We solve problems like this everyday; however, the benefit of AI is the ability to crunch more data than the human mind can process in an instant, and then make connections that we, as humans, can’t comprehend.

We must frame our problem in a way that will take in *x’s *and output *y’s *so we can write an Artificial Intelligence model. Clearly, if we are trying to predict the outcomes for MLB games, our x values need to be metrics that will affect the outcome of a game. Just from a basic standpoint, some of the metrics would include: which teams are playing, who the starting pitchers are for each team, the current win or loss streak of the team, which key players are injured, etc… Just from this list you can see there are a ton of metrics we can use to analyze the outcome of a game. We do however, have a single metric that encompasses all of these values and it’s one that has been perfected over a long period of time: the Vegas line. In a Vegas line, you find the probability of each team winning that takes in all these factors we talked about before. If we use this probability as one of our *x* values, we can take into account all the other metrics that we talked about as well as others that Vegas uses that we don’t know about.

So we have one – really good x value, but just one is not going to cut it. In fact, if we used just this metric to predict outcomes of games, we would be picking the favorite every time. If we guess the favorite every time, we would pick the correct winner with a high percentage; *however*, this means we will lose money in the long run – because the payouts for favorites are much lower than underdogs. Thus, we need other metrics to help us pinpoint when we think the underdog has a higher chance to win than Vegas gives them. Also, in turn we can discover, using these metrics, when the favorite will have a higher chance to win than Vegas predicts. As you can see this solution is beginning to come together. We need to pick metrics and calculate our own probabilities, or our own ‘Vegas lines’, then compare them to the ones that Vegas gives out and then bet on whichever team has a better chance of winning. Let’s simplify this by using an example: if the Mets are playing the Yankees and the lines are Mets +150 and Yankees -150, that is the same as saying the Mets have a 40% chance of winning and the Yankees 60%. Now, if our model outputs the Mets at 45% and yankees at 55%, we would bet on the Mets, because our probability for them is better, while the Yankees one is worse. While in the long run, the Yankees will win more games than the Mets, the value is better because the Mets payout is higher, and, in the long run, you will end up winning money. To calculate our own probabilities, we create a model to predict the winner, using metrics about the lines and other ideas that will be highlighted later, then predict the outcome. If we run this simulation thousands of times, we can find the probability of each team winning based on our simulation results of predicting the same game.

So we still need more metrics for our data. I’ll spare you the time and my thought process for each of the metrics (because it took awhile to decide and find the historical data) and just tell you what the dataset looks like.

Most of these metrics are pretty self explanatory, but to clarify:

- game_id is just a reference for me to make sure I don’t accidently lose any data.
- H_team_id and A_team_id are just 3 letter codes for the teams that are playing
- H_ML_open and A_ML_open are the opening vegas lines for each team of each game
- H_prob_open and A_prob_open are just the lines converted to implied probabilities
- H_ML_close and A_ML_close and H_prob_close and A_prob_close are the same things just for the closing line
- H_diff and A_diff are the differences between the opening probabilities and closing probabilities
- H_streak and A_streak are the current winning or losing streak a team is on
- win is a binary variable representing which team won the game 0 = away team win, 1 = home team win

We are almost done! Once again I’ll spare you the coding talk about how I actually built the model to calculate probabilities, but for those who are interested: I used a Deep Neural Network leveraging the scikit-learn framework with a hidden layer size of 5. I had the model output probabilities with a simulation N = 100,000. In other words, I simulated each game 100,000 times and found the probabilities based on these simulations. The total amount of data was just under 10,000 and for testing, the data was split where testing data made up 2% of the total data (which leaves 98% for training). The data was split using a random state generator, so I had the ability to archive specific results as needed. Before looking at games real-time, I can simulate betting based on past results. I used some basic mathematical maximization techniques to find certain thresholds I can use these to determine which team to bet on given our calculated line versus the Vegas line. Based on this math, I also determine which games to not bet on, because our line is too close to the Vegas line. By the end of the calculations, we end up getting bets for about 75% of the games on a given day (on average).

As the 2019 MLB season begins, I am curious how the picks would perform over the course of a full season. Thus far, the models hypothetical ROI (return on investment) would be about 20%. I will be releasing a more detailed analysis of the results and how I plan to improve it in my next post. Here is a screenshot of what the AI looks like when it runs.

Hopefully, based on what I’ve written here, you can try to solve a similar problem on your own. I’ve learned a lot about Vegas and what they take into account in their lines through the process which has helped me predict games with better accuracy than before. I am really interested in feedback so if you have any questions or inquiries, feel free to send me an email at studentsofbaseball@gmail.com.