Joseph Bae

No, I'm not a millionaire yet.

Well it's been over a year since my last blog post, but that isn't entirely unexpected. It's been a busy first year of the PhD with grant submissions, publications, and courses, and I haven't found a subject worth writing about in that time. I also have not beaten the streak and therefore am not a multi-millionaire, but that probably isn't a surprise either. Nonetheless, I'm now back with an update on my progress towards that goal, as well as some next steps and code for you to play around with if you are so inclined.

In my introduction post I hinted at my general approach of training a machine learning (ML) model to predict which MLB batters might be most likely to get a hit on a given day. I settled on using the following variables or "features" used by the model for this prediction:

Batting Average (AVG)
On Base Percentage (OBP)
Slugging Percentage (SLG)
Contact Percentage (Contact %)
Batter Playing at Home or Away (Home)
Batter vs. Pitcher Matchup Average (MatchupAverage)
Batter Batting Average in Most Recent 5 Games
Ballpark Hit Probabilities (BallparkNumber)
Pitcher Earned Run Average (era)
Pitcher Hits per 9 Innings (h9)
Pitcher Strikeouts per 9 Innings (k9)
Batting Average Against Pitcher (avg)

Next, came one of the most difficult parts of any machine learning project: data acquisition. In my own research and in my work on this project, gathering data has been the most painstaking and lengthy process, but it's also often the most important. I could, and maybe will, write an entire post on the issues I came across while obtaining the data for this project, but I'll hold off for now. Suffice it to say that I am currently pulling data from multiple official and fan-supported baseball statistics sources including Baseball Savant and Fangraphs. There does exist a Python package for interacting with some of this (though I found it to mostly be a bit unwieldy for my purposes). As a result, I mostly make use of the Python "requests" library to pull data from these websites as well as the MLB API (also unwieldy, but at least it's stuff that I wrote myself).

This single function required ~140 lines of code to work consistently.

But let's back up a little bit. I've talked about what variables I want to look at to predict which batters will get a hit, but how exactly am I training a machine learning model? Well my initial approach was to do the following:

Collect these 12 pieces of information for top batters during each season from 2013 through 2021.
Input each of these variables into a logistic regression model to learn parameters for each variable to predict whether a batter got a hit on a given day.
Use the trained model to choose batters that are likely to get a hit for each day of the 2022 season.

The devoted baseball fan or the critical informatician might read the above and find several bones to pick with me. Some of these critiques I have probably addressed and some I have definitely not; this is an early approach and there's absolutely room for improvement (otherwise I'd be swimming in my 5.6 million dollar cash pool by now). Jokes aside, the above framework is a very naive, but still generic way that one might go about approaching the issue of predicting hits. At the moment, my model has achieved an all time high streak of just 18, 39 short of what I need. And while that sounds pretty sub-par, it actually outperforms most other approaches: here, here, and here (big caveat, I need to perform more testing to rigorously make this claim, but I'm pushing further testing of the model until I've fully settled on it's design.)

This post is a short one, mostly because it started as an intro to the next post (coming soon) which is a subject I'm a tad more interested in writing about. But first, there's one more topic I want to briefly discuss. If you go through my code, you'll see that my initial experiments trialled multiple different machine learning models. You can play around with each and see which you like most, and even I will continue to do so in the future when I really make a concerted push for the prize. However, the model I'm currently settled on makes use of "logistic regression". This is a very simple approach, especially compared with using something like a neural network, but there's a key reason for why it might be more interesting to use than more sophisticated models. That reason is the subject for my next blog post on whether interpretability in artificial intelligence matters.

Finally, I promised some code and you shall now have it. Here is my Github repository with the functions and notebooks I've created for this project. This project is still far from complete or fully polished, but I think what I have can still be useful for people looking to model MLB data.