There's an argument to be made that hitting a baseball is the hardest thing to do in sports. Over the past 10 years, the major league batting average has hovered around 0.25, meaning that just a quarter of at bats result in a baseball being hit. A really good batter might average somewhere in the .300s, and a batter has not ended the season with an average over .400 since 1922.
There are plenty of reasons why hitting a baseball in the MLB is difficult. Take a fastball coming at you at 90 mph. That's about 475,000 feet a second. At 60 feet and 6 inches from the pitcher's mound, home plate will be crossed by the fastball in 460 milliseconds. Just blink your eyes, and you might miss it. There's also the variety of pitches a batter might need to contend with, the fact that not all pitches will be strikes, the relative surface areas of the extremely fast moving ball and bat, and on and on. Suffice it to say that hitting a baseball with a bat is difficult, and even the pros aren't going to be successful in the majority of their at bats.
Now consider this record. In 1941, Joe DiMaggio was able to get a hit for 56 games straight, a streak that has been unbroken since. In fact, the record is considered unbreakable (see an interesting discussion here, due to several factors in the game's evolution over the last 80 years. Nonetheless, DiMaggio's streak is certainly impressive when one considers that even with an excellent batting average of 0.400 and an optimistic average of 4 plate appearances per game, the probability of getting at least one hit for 56 consecutive games is less than half a percent. The odds of achieving the goal I'm about to outline are not enormously better.
In the spirit of increasing engagement in baseball and having a little fun, the MLB has hosted a "Beat the Streak" competition for the last 20 years in which fans can choose a batter that they believe will get a hit every day of the regular MLB season. If someone is able to string together 57 correct guesses, they're able to claim a cash prize of 5.6 million dollars. The highest streaks ever obtained over the last 20 years have been at 51 games, impressive but not quite within range of the coveted prize. Multiple calculations have been performed to determine just how hard it might be to Beat the Streak, and vary between 1 in 4.3 million and 1 in 17 billion (there are a lot of factors at play, some that I hope to discuss in future posts.) Ultimately, it's a nearly impossible task with odds comparable to winning the lottery.
I've been trying to Beat the Streak for something like the last 5 years now. It's entertaining and fun to try my luck, but I think the highest streak I've come up with over that time is something like 16. Like many others that have made the attempt, I've employed several strategies over this time including trusting my gut, leveraging the matchup and analytics data the Beat the Streak app provides to make educated guesses, going with players I personally like, and just randomly choosing players that seemed like they might get lucky. Obviously, I'm not yet a millionaire, but let's move on to my new plan.
I've decided that my next move is to use some machine learning approaches to see if I can create a model predicting who is most likely to get a hit on any given game day. Though I'm hopeful this approach will improve my chances at the jackpot, I realistically think that this is a good project to build my experience in working with the MLB API, creating time-series machine learning models, and tackling non-medical problems.
I'll end with just a flavor of how I've already begun to approach this task and my plans moving forward. First, as always, getting data is critical to any machine learning task. To that end, I've created some functions to pull baseball statistics and historical records from multiple sources including the MLB and third-party statistics websites. I've also begun creating methods to parse this data to find statistics that I think will actually matter for my purposes. For instance, baseball fans might know that different ballparks might be more or less hitter-friendly (just consider the high elevation at Coor's field and the way balls fly further and higher). Creating a model that gives some preference to batters about to play away against the Rockies is just one variable that might ultimately help me produce a game-winning approach. I also will explore both machine learning and deep learning models in addition to single-timepoint and time series modelling. Hopefully, some time in the near future all of this will result in a 5.6 million dollar paycheck.