Vous êtes sur la page 1sur 3

xBABIP: Creating Expected Batting Average on Balls In Play Using Multiple Linear Regression

By Kevin Antonevich
In baseball, batting average on balls in play, or BABIP, can be used as a method to
validate individual hitter performance. A high BABIP for a specific hitter that contrasts their
career trends often results in the players season being labeled as lucky, and the opposite is true
for a player with an outlying low BABIP. Using an expected BABIP model, or xBABIP, one
can determine if a players abnormal BABIP was due to chance, or a legitimate change in skill.
With assistance from a math teacher at my high school, I created a reliable, statistically
significant, xBABIP model using multiple linear regression. This can be used to validate hitter
performance after 300 plate appearances at the Major League level. The data used to construct
the model is from 2007-2013, and was collected from FanGraphs and Baseballheatmaps.com.
Through the development of the model, statistics (variables) were added and subtracted due to
their statistical significance, or insignificance, as determined by t-value analysis. After creating
many models, a complete formula was finalized. By entering data into my formula from a single
season, I generated each qualified players xBABIP for that season. I could then compare their
xBABIP and BABIP, and determine if a player's BABIP from that season was supported by his
peripheral statistics. These comparisons give insight on expectations for future performance. A
key detail here is that the data used to create the formula was not the data that was used to
generate the outliers outliers were identified from data from a single season.
From my model, I determined that the statistics that are significant in determining
xBABIP are:

GB/FB; the ratio of groundballs to fly balls


LD%; line drive rate
IFFB%; infield fly ball percentage
HR/FB: ratio of home runs to fly balls
IFH%; infield hit percentage
Spd; speed score
Average fly ball distance
Oppo%; percentage of balls hit to the opposite field

All of these statistics are significant on the 99.9% level from t-value analysis, and their
relationships with BABIP all make logical sense. The R-squared for my model is .693, and while
it is not perfect correlation, it indicates that my results are legitimate from a statistical
perspective. Additionally, my xBABIP model has a confidence interval of .00183, and variance
this small has a negligible impact on a players xBABIP as a whole.
Before I started making predictions, I tested the models accuracy by entering data from
the 2012 season to see if the expected change took place for the regression candidates in 2013.
(Regression candidates were identified as players whose difference between their xBABIP and
BABIP for one season was greater than two standard deviations.) Of the 22 players whom I
identified as regression candidates in 2012, all but one players BABIP changed significantly in
2014.KevinAntonevich.AllRightsReserved.

the expected direction in the following season. The same process was then applied to players in
each individual season between 2007 and 2011, and regression was again predicted with high
accuracy. These tests indicate that my xBABIP model is a reliable method for predicting
BABIP regression.
During spring training, I repeated the process using data from the 2013 season. I
identified 21 players who I expected to regress in the 2014 season. However, Jose Iglesias and
Yuniesky Betancourt were identified as outliers but are not playing this season, so my sample
has dropped to 19 players. As of August 15, 16 of these players have regressed in their expected
direction (Fig. 1). Additionally, two of the players who have not regressed in their expected
direction (Will Middlebrooks and J.P. Arencibia) have had limited playing time this season. As
they get more plate appearances, their 2014 BABIP should move toward their 2013 xBABIP.
This test provides further evidence for my models predictive legitimacy, and demonstrates that
it can be used as an effective method for identifying regression candidates. (One important point
is that my model is not just selecting the players with the highest and lowest BABIP in the league
each season. For each player that my model identifies as an outlier, there is another player with a
similar BABIP that my model has shown to be supported by his peripheral statistics.)
In addition to predicting regression over the course of a whole season, my xBABIP
model can be used to provide insight on a players trade value around the trade deadline. For
players who had 300 or more plate appearances around July 31, I generated each players
xBABIP using the same process as described above. From this, 15 players were identified as
outliers, and two of these players were impacted by moves around the trade deadline (Fig. 2).

Allen Craig, who was traded from the Cardinals to the Red Sox, has been underperforming according to his xBABIP. Despite his disappointing 2014, he should rebound
in 2015, and be a solid contributor for the Red Sox.
When the Yankees acquired Stephen Drew to play second base, Brian Roberts lost his
starting job and was sent to the minors, and eventually released from the team. However,
my model indicates that he was under-performing. Had the Yankees given him more
time, he may have finished the season strong, making the acquisition of Drew
unnecessary.

Overall, my xBABIP model can be used to identify regression candidates in baseball


with high accuracy. Identifying outliers is important in adjusting predictions and
expectations for players, and can be used to gain insight on a players value near the trade
deadline.

2014.KevinAntonevich.AllRightsReserved.

Figure 1: Pre-Season Regression Candidates for the 2014 Season, With Updated 2014 BABIPs

*Yuniesky Betancourt has retired, and Jose Iglesias will miss the season with shin fractures.

Figure 2: Regression Candidates at the 2014 Trade Deadline

2014.KevinAntonevich.AllRightsReserved.

Vous aimerez peut-être aussi