Probability Formulas and Statistical Analysis in Tennis

Journal of Quantitative Analysis in
Sports
Volume 4, Issue 2 2008 Article 15
Probability Formulas and Statistical Analysis

in Tennis
A. James O'Malley, Harvard Medical School
Recommended Citation:
O'Malley, A. James (2008) "Probability Formulas and Statistical Analysis in Tennis," Journal of
Quantitative Analysis in Sports: Vol. 4: Iss. 2, Article 15.
Available at: http://www.bepress.com/jqas/vol4/iss2/15
DOI: 10.2202/1559-0410.1100
2008 American Statistical Association. All rights reserved.
Probability Formulas and Statistical Analysis
in Tennis
A. James O'Malley
Abstract
In this paper an expression for the probability of winning a game in a tennis match is derived
under the assumption that the outcome of each point is identically and independently distributed.
Important properties of the formula are evaluated and presented pictorially. The accuracy of this
formula is tested by comparing observed proportions against predicted values using data from the
2007 Wimbledon Tennis Championships. We also derive expressions for the probability of several
other milestones in a tennis match including winning a tiebreaker, winning a set, winning a match,
and recovering from a break of serve down to win a set. The resulting "tennis formulas" are used
to evaluate the implications of possible rule changes, to demonstrate how broadcasts of tennis
matches could be made more interesting and informative, and to potentially improve a player's
chance of winning a match.
KEYWORDS: probability, sport, statistics, tennis, tennis formula, winning
Author Notes: Appreciation is extended to Mike Beaser of the Sloan School at the Massachusetts
Institute of Technology for helpful comments on an early draft of the manuscript.
O'Malley: Probability Formulas and Statistical Analysis in Tennis
1. Introduction
Tennis lends itself to the construction of probability formulas because the score
has a fixed number of hierarchically structured states; points are nested within
service games, which are nested within sets, which are nested within a match (see
Section 2). Thus, a natural approach to modeling a tennis match is to first develop
methods for predicting the outcome of a game given probabilities for individual
points, then predict the outcome of a set given probabilities for the outcome of
each game, and finally the outcome of the match given probabilities for the
outcome of each set. We derive expressions for these events and also evaluate
formulas for the probabilities of other events such as a player making a comeback
to win the match from various deficits.
The assumption at the heart of this paper is that the event that a server wins a
point is independent and identically distributed (iid) over the entire match. This
assumption requires that the result of any one point has no bearing on the result of
subsequent points. It also requires that the probability a player wins a point on
serve is fixed throughout the match; the probability can differ between the players
but must be constant throughout the match for any given player. Although iid is a
severe assumption, past work suggests that it is adequate for analyses that
aggregate over many matches (Newton and Keller, 2005).
Historically, work on the derivation of tennis formulas has been sporadic. Of
particular note, Newton and Keller (2005) unified several earlier works on tennis
probabilities using hierarchical recurrence relations to derive the probability of
winning a game, set, and match (also see references therein). Liu (2001) used
finite Markov chains and results for random walks to derive equivalent formulas.
Notable earlier works include Riddle (1988) on probability models for various
alternative tiebreaker scoring systems, and Morris (1977) on determining which
point is the most important in a tennis match.
The potential usefulness of tennis formulas is evident in the various
applications to which they might be applied. These include: making coverage of
tennis matches more informative and interesting to viewers, evaluating the effect
of potential rule changes, optimizing a players training methods, and studying the
performance of players at key stages of a match. The potential users of tennis
formulas include broadcasters, administrators, coaches, players, and of course
bookies or gamblers; online betting agencies such as Betfair allow bets to be
placed on a professional tennis match before and during the contest (Betfair,
2008). Thus, there are a growing number of potential users and opportunity for
financial gain from successful application of tennis formulas.
The original work presented in this paper includes studying and interpreting
mathematical properties of tennis formulas, developing new tennis formulas such
as the probabilities of a player recovering from various deficits to win a set, and
Published by Berkeley Electronic Press, 2008 1

Journal of Quantitative Analysis in Sports, Vol. 4 [2008], Iss. 2, Art. 15
discussing applications of the formulas. The intention of this paper is to make

readers aware of how probability and statistics can be used to model events in a
tennis match and of the potential applications of such models. Although key
results will be of interest to researchers interested in probability and statistics, the
focus of the paper is on illustration and interpretation rather than technical detail.
Likewise, the analysis of data from actual tennis matches illustrates the methods
and motivates further empirical analysis.
In the next section, the scoring of a tennis match is described. The derivations
of the tennis formulas and their properties are presented in Sections 3-4. Data
from actual tennis matches is analyzed in Section 5 and applications of the tennis
formulas are discussed in Section 6. The paper concludes in Section 7.
2. Scoring a Tennis Match
A tennis match consists of a number of milestones and repeated events. At the

professional level, the most common format is the best of three-set contest, where
the first player to win two sets is the winner. Likewise, in a best of five-set match,
the first player to win three sets is the winner. Each set is a competition to win six
games, except that if the score reaches five-games all, the set continues until a
player wins seven games. If the score reaches six-games all, a tiebreaker is played
to decide the set.
The server alternates between games in a set: one player serves the first
game, the other player the second game, and so on. In a regular game, one player
serves for the entirety. However, in a tiebreaker one player serves the first point,
the other player serves the next two points, the serve changes again for the next
two points, and so on until a player has won at least seven points and has a lead of
at least two points.
Prior to the advent of the tiebreaker in the 1970s, the advantage system was
used where instead of playing a tiebreaker a player had to lead by two games in
order to win the set (some very lengthy sets resulted). Today, only the final set in
singles matches at three of the four Grand Slam tournaments (the Australian
Open, the French Open, and Wimbledon) and in matches in Davis Cup ties are
contested under the advantage system. For simplicity, in this paper we assume the
scoring system used at the US Tennis Open where each set ends with a tiebreaker
if the score is tied at six-games all.
Whereas the scoring of games within a set, sets within a match, and
tiebreakers is purely numeric, the scoring within a service-game features non-
numerical terms that originated in France as far back as the 13th century (Riddle,
1988). A player who has won one, two, and three points has score 15, 30, and 40
respectively (the numbers are thought to be based on the face value of medieval
French coins (Liu, 2001)). A player who has not won any points is at love; e.g.,
http://www.bepress.com/jqas/vol4/iss2/15 2
DOI: 10.2202/1559-0410.1100
2 points to 0 is read as thirty-love. If both players have won three points the
score is deuce and from this point the game continues eternally until one player
has a two-point lead; a player who is one point ahead in the game after the game
has reached deuce has the advantage. If the serving player is one point away
from winning a game, the point is referred to as a game-point; if the non-
serving player is one point away from winning a game, the point is referred to as a
break-point.
3. Probability of Winning a Game
Because a tennis match consists of repeated contests, tennis is highly amenable to

stochastic modeling. We first derive the probability that a player wins a game
when the binary event that they win a point is a Bernoulli random variable with
probability p , the outcome of each point has no bearing on the outcome of any
other point, and p is constant throughout the game. There are four distinct scores
from which the game can be won: winning without losing any points, losing a
single point, losing two points, and winning the game after the score reaches
deuce (three or more points are lost). In the fourth scenario, the number of pairs of
points played until the server wins the game has a geometric distribution and the
associated probabilities follow a geometric sequence. The probability of winning
a game is given by the sum of the joint probabilities of winning the game losing a
certain number of points, given by:
G ( p ) = pr(Win Game)

= i =0 pr(Win game while losing i points)

= p 4 + 4 p 4 (1 p ) + 10 p 4 (1 p )2 + 20 p 3 (1 p )3. p 2 i =3{2 p (1 p )}i 3 (1)
20 p 5 (1 p )3
= p 4 + 4 p 4 (1 p ) + 10 p 4 (1 p )2 +
1 2 p (1 p )
10 p 2
= p 4 15 4 p
1 2 p(1 p )
The expression for G ( p ) is notable for its dependence on the sum of the
geometric sequence beginning at deuce. The probability being summed,
2 p (1 p ) , is the probability that the score transitions from deuce back to deuce
after two further points are played. The final line of (1) reduces the number of
operations required for evaluation and so is convenient for quick calculations.
Equivalent expressions for this formula appear in Equation (5) of Newton and
Keller (2005) and in Section 2.5 of Liu (2001).

3.1. Properties of G ( p )
Although G ( p ) has been derived previously, its properties have not been
thoroughly examined. To explore its properties we plot G ( p ) , its derivative
function, and its integral function against p (Figure 1). After differentiating (1)
and simplifying the resulting expression, the derivative function is obtained as
5 p3 3 p2 + 4 p4
G '( p ) = 20 p 3 3 p + (2)
(1 2 p(1 p )) 2
and after integrating (1) and simplifying the resulting expression, the integral
function is obtained as
p 2 5 5 5 5
G ( p ) = G ( x )dx = p 6 + 2 p 5 p 4 p 3 + p + log{1 2 p (1 p )}. (3)
0 3 4 6 4 8
The plots of G ( p ) , G '( p ) , and G ( p ) against p give the probability of a player
winning a game as a function of the probability of winning a point, the rate of
change in G ( p ) as p increases, and the average of G ( p ) over the interval
[0, p ] .
It is clear from Figure 1 that G ( p ) is a monotone increasing function
asymmetric about the point of inflection at p = 0.5 . The maximal value of G '( p )
occurs at 0.5, implying that the greatest change in the probability of winning the
game occurs at p = 0.5 ; e.g., the benefit of a 0.01 increase in the probability of
winning a point is greater if the baseline probability is 0.5 than if it is 0.7. A
player with 0, 1, and 0.5 chance of winning a point has G (0) = 0 , G (1) = 1 , and
G (0.5) = 0.5 probability of winning the game; 0, 1, and 0.5 are the only self-
evaluating values of p . The fact that a game offers more insurance against
random bad luck than a single point is illustrated by the inequalities G ( p ) < p for
0 < p < 0.5 and G ( p ) > p for 0.5 < p < 1 . For example, G (0.6) = 0.7357 and
G (0.7) = 0.9008 indicating that players with point winning probabilities of 0.6
and 0.7 have 73.57% and 90.08% chances of winning a game respectively.
The quantity G (1) is the probability a player wins a game when their point-
winning probability is chosen at random, i.e., when p is uniformly distributed on
[0,1] . The fact that G (1) = 0.5 confirms that tennis is a fair game. Under the iid
assumption, G (0.5) = 0.0616 , implying that if a players probability of winning a
point is uniformly distributed on [0.5,1] they are expected to win 87.68% of the
games they play.
DOI: 10.2202/1559-0410.1100
Figure 1: Probability of Winning a Game
4. Other Tennis Formulae
We now consider the probabilities of winning a tiebreaker, a set, a match, and the
probability of recovering from a deficit to win a set. These differ from the
probability of winning a game in that they involve the probability of winning a
point when receiving serve. We use q to denote the probability of winning a
point when receiving serve (recall that p denotes the probability of winning a
point when serving) and denote the probability of winning a tiebreaker, a set, a
three-set match, and a five-set match by TB( p, q ) , S ( p, q ) , M 3 ( p, q) and
M 5 ( p, q ) respectively. Due to their length, detailed formulas for TB( p, q ) and
S ( p, q ) appear in the Appendix.
4.1. Probability of Winning a Tiebreaker
The fact that the server alternates every odd-point is the major difference between
a tiebreaker and a regular (service) game. However, a tiebreaker emulates a game

in that play continues past the target number of points (seven for a tiebreaker
versus four for a regular game) if the score does not differ by at least two points
when the target is reached. The algebraic expression for TB( p, q ) in Equation
(A1) has more terms than Equation (1) due to the increased number of distinct
scores that are possible in a tiebreaker compared to a regular game. However, the
procedure for deriving the formula is similar. One approach is to sum the
probabilities for the events that a player wins the tiebreaker with the loss of 0, 1,
2, 3, 4, 5, and 6 or more points. The probability of winning if the score reaches
six-points all equals the sum of the geometric sequence whose argument,
p (1 q ) + (1 p )q , is the probability the score transitions from n points-all to
n + 1 points-all, n 6 ; hence, the appearance of p (1 q ) + (1 p )q in the
denominator of (A1). An alternative way of deriving TB ( p, q ) and other tennis
formulas is to use recurrence relations or transition matrices.
Figure 2 displays plots of TB ( p, 0.5) , TB ( p, p ) , and TB ( p,1 p + 0.02)
versus p . The top plot is the probability of winning the tiebreaker as a function of
the probability of winning a point as the server when the probability of winning a
point as the receiver is fixed at 0.5. As expected TB ( p, 0.5) < 0.5 when p < 0.5 ,
TB ( p, 0.5) > 0.5 when p > 0.5 , and TB ( p, 0.5) = p when p {0, 0.5,1} .
The second plot depicts the case when a player has the same probability of
winning a point returning serve as they do when serving; in this case the
tiebreaker is a sequence of iid events. Because a tiebreaker is a longer contest than
a regular game, TB ( p, p ) < G ( p ) when p < 0.5 and TB ( p, p ) > G ( p ) when
p > 0.5 .
The bottom plot depicts the case where a player has a 0.02 higher point-
winning probability than their opponent. The plot has a bathtub appearance
because the relative advantage of the better player is greatest when p = 0.02 or
p = 1 (it is impossible for the better player to lose in these cases) and least when
p = 0.51 .
4.2. Probabilities of Winning a Set and a Match
The probability of winning a set is derived from the probability of winning a

game, G ( p ) , the probability of winning a return game, G ( q) , and the probability
of winning a tiebreaker, TB ( p, q ) . The expression displayed in Equation (A2) is
an explicit function of these probabilities, which in turn depend on a players
point-winning probabilities, p and q . Because the tiebreaker concludes the set
whenever the score reaches six-all, the probability of winning a set does not
involve a geometric sequence of the game winning probabilities.
DOI: 10.2202/1559-0410.1100
Figure 2: Probability of Winning a Tiebreaker
The probability of winning a match is in turn determined from the probability

of winning a set, S ( p, q ) . The probabilities of winning a best of three-set match
and a best of five-set match are given by:
M 3 ( p, q ) = S ( p, q )2 [1 + 2{1 S ( p, q )}]
and
M 5 ( p, q ) = S ( p, q )3[1 + 3{1 S ( p, q )} + 6{1 S ( p, q )}2 ]
respectively. We plot the probabilities M 3 ( p, 0.5) , M 3 ( p, 0.5) , and
M 3 ( p,1 p + 0.02) against p (Figure 3).
The first two plots in Figure 3 are similar to their Figure 2 counterparts; the
only difference being the steepness of the s-curve. A favorable point-winning
probability translates to a high probability of winning the match; a player who
wins only 45% of serving and receiving points has almost 0 chance of winning the
match whereas a player winning 55% is almost certain to win. This illustrates how
a contest that involves winning a large number of mini-contests favors the better
player. The corresponding plots for M 5 ( p, q) are even more dramatic.

Figure 3: Probability of Winning a Best of Three-Set Tennis Match
The third plot shows that a player with a 0.02 point-winning advantage has at
least a 60% chance of winning a best of three-set match. An interesting feature of
this plot is the presence of local minima near 0.15 and 0.85 and a local maximum
at 0.51. The local optimum at 0.51 occurs because the player has equal chance of
winning both service and receiving points (and thus games). As the difference in
the probabilities increases, the likelihood that an event such as an isolated poorly
played service game decides the match increases, which causes the probability of
winning the match to drop. The trend continues until the probabilities get so close
to 0 or 1 that the relative difference in the point winning probabilities skyrockets
along with the probability of winning the match.
4.3. Recovering from a Deficit to Win a Set
An important property of a tennis match is that it is a contest that is not over until
the final point has been won. In fact, the only truly important point in a tennis
match is the final point since knowing which player won that point determines the
winner.
DOI: 10.2202/1559-0410.1100
The probability a player wins a set conditional on the current score can be
evaluated using the same approach as for S ( p, q ) . To illustrate, we consider the
probability that a player recovers from a break-down (i.e., the player has lost their
serve one time more than their opponent) in a set to win. Table 1 displays the
probabilities of a player winning the set when down a break of serve.
Table 1: Expressions for Probabilities of Winning a Set from Various

Positions
Initial
Score Probability of Winning Set
5-5 S ( p, q | 5 5) = G ( p )G (q ) + {G ( p )(1 G (q )) + (1 G ( p ))G (q )}TB ( p, q )
4-5r S ( p, q | 4 5) = G (q ) S ( p, q | 5 5)
3-5 S ( p, q | 3 5) = G ( p )G (q ) S ( p, q | 5 5)
2-5s S ( p, q | 2 5) = G ( p ) S ( p, q | 3 5)
4-4 S ( p, q | 4 4) = G ( p )G ( q )
+ {G ( p )(1 G ( q )) + (1 G ( p ))G ( q )}S ( p, q | 5 5)
r
3-4 S ( p, q | 3 4) = G (q ) S ( p, q | 4 4) + (1 G (q )) S ( p, q | 3 5)
2-4 S ( p, q | 2 4) = G ( p )G ( q ) S ( p, q | 4 4)
+ {G ( p )(1 G ( q )) + (1 G ( p ))G ( q )}S ( p, q | 3 5)
s
1-4 S ( p, q |1 4) = G ( p ) S ( p, q | 2 4) + (1 G ( p ))G (q ) S ( p, q | 2 5)
5-3 S ( p, q | 5 3) = G ( p ) + G ( q ) G ( p )G ( q)
+ (1 G ( p ))(1 G ( q)) S ( p, q | 5 5)
3-3 S ( p, q | 3 3) = G ( p )G ( q ) S ( p, q | 5 3)
+ {G ( p )(1 G ( q )) + (1 G ( p ))G ( q)}S ( p, q | 4 4)
+ (1 G ( p ))(1 G ( q)) S ( p, q | 3 5)
r
2-3 S ( p, q | 2 3) = G (q ) S ( p, q | 3 3) + (1 G (q )) S ( p, q | 2 4)
1-3 S ( p, q |1 3) = G ( p )G ( q ) S ( p, q | 3 3)
+ {G ( p )(1 G ( q )) + (1 G ( p ))G ( q)}S ( p, q | 2 4)
+ (1 G ( p ))(1 G ( q))G ( p )G ( q) S ( p, q | 3 5)
s
0-3 S ( p, q | 0 3) = G ( p ) S ( p, q |1 3)
Note: Uppercase r indicates player receives serve in the next game while
uppercase s indicates player serves in the next game. The scores in bold font
indicate that a player is a break down. S ( p, q | a b) denotes the probability of
winning the set given that the current score is a b .

The fact that several of the probabilities in Table 1 can be expressed as

simple functions of other probabilities illustrates how recursive relations link
tennis formulas of winning from different situations. In Section 6.2, we evaluate
these expressions at hypothetical values of p and q to illustrate how tennis
formulas could be used to make broadcasts of tennis matches more informative
and interesting.
5. Analysis of Empirical Data: Testing Validity of Underlying Assumptions
Previous work has indicated that although the iid assumption may not hold
exactly, analyses that aggregate data over multiple matches may still be fairly
accurate (Newton and Keller, 2005). To illustrate how the formulas may be
applied to data from tennis matches, we test this assumption on new data. Data
were obtained from the final 14 Mens singles matches played at the 2007
Wimbledon tennis tournament. This included 7 fourth round (or last sixteen)
matches (the eighth matched was defaulted so no points were played), 4 quarter-
finals, 2 semi-finals, and the final. Summary data from these matches is displayed
in Table 2.
Table 2: Data from Mens Singles Championships Wimbledon 2007

Break Points
Points on Serve Games on Serve Against Serve
Match Player Won Total Won Total Saved Total
1 Federer 107 156 20 24 5 8
1 Nadal 109 167 21 24 7 11
2 Federer 67 90 16 16 1 4
2 Gasquet 57 89 12 15 3 3
3 Nadal 38 55 9 10 6 10
3 Djokovic 44 77 7 11 1 2
4 Federer 80 113 17 19 3 8
4 Ferrero 70 120 13 18 5 7
5 Gasquet 131 179 26 29 4 6
5 Roddick 119 170 27 29 7 10
6 Djokovic 125 188 24 29 12 17
6 Baghdatis 140 225 24 29 3 8
7 Nadal 67 93 14 15 5 9
7 Berdych 53 91 11 15 4 5
8 Ferrero 72 99 16 17 1 4
8 Tipsarevic 60 87 13 16 3 4
9 Roddick 66 90 15 16 4 8
DOI: 10.2202/1559-0410.1100
9 Mattieu 59 90 12 16 0 1
10 Gasquet 56 75 13 15 1 6
10 Tsonga 51 92 9 14 2 4
11 Baghdatis 62 100 12 16 4 10
11 Davydenko 65 113 11 17 4 8
12 Djokovic 114 174 19 23 5 8
12 Hewitt 100 157 20 23 8 12
13 Berdych 73 101 17 17 11 18
13 Bjorkman 68 119 10 17 3 3
14 Nadal 83 115 19 21 4 10
14 Youzhny 83 139 15 21 2 4
Total 2219 3364 442 532 118 208
The predicted proportions of games won by a player in a match were

evaluated by substituting the proportion of service points they won into the
expression in (1). The resulting data are plotted in Figure 4.
The scatterplot in Figure 4 indicates that the observed and predicted
proportions of games won by the server are highly correlated. Furthermore, the
overall proportion of games won by the server was 83.09% while the predicted
proportion was 82.36%, suggesting that in aggregate G ( p ) yields accurate
predictions. Furthermore, there was no discernible pattern in the discrepancies
between the predicted and actual proportions of games won.
Quantifying the closeness of predicted and observed proportions of games
won is not a rigorous way of determining whether the iid assumption is valid for a
tennis match. To study this more closely, we evaluate the players performance on
breakpoints. Under iid, the proportion of breakpoints saved should equal the
proportion of points won by the server at other stages of the game. The aggregate
results in Table 1 cast doubt on this assumption: a total of 118 out of 208 break-
points (55%) were saved by the server whereas 2,101 out of 3,156 non-break
points (0.67%) were won by the server. We evaluated the significance of the
difference in these proportions by fitting a logistic regression model using the
generalized estimating equations method to account for repeated observations
made on the players. The associated p-value of 0.007 implies that the difference is
significant.

Figure 4: Scatterplot of Predicted Versus Observed Proportions of Games

Won by Server
The above result suggests that a servers performance suffers when under the
pressure of having to win a point in order to avoid losing their service game. To
accommodate such an effect, the formulas could be extended to allow different
probabilities on break points. Game, set, and match points might be other
occasions when the probability of winning a point deviates from baseline.
Another modification would be to allow the probability of winning a point to
depend on the side of the court from which the serve is delivered.
The independence assumption may also be violated. For instance, the
outcomes of successive points could be serially correlated, especially during
crucial stages of a match. If the complete sequence of points within a sample of
tennis matches was available, the independence assumption could be tested
against alternative models (see Jackson and Mosurski, 1997, for an examination
of the assumption that successive tennis points are independent). One family of
alternate models is the class of Markovian models. These posit that the outcome
of a point depends only on the current state of the match. A special case of the
class of Markovian models in the AR(1) model.
DOI: 10.2202/1559-0410.1100
Statistical tests that confirmed the presence of heterogeneous probabilities

over the course of a match or correlation between points would be of interest to
researchers, coaches, and players interested in the effects of stress on player
performance.
6. Applications of Tennis Formulas
In this Section, applications of the tennis formulas that have the potential to
benefit commentators, administrators or rule makers, and coaches or players are
discussed.
6.1. Match Predictions
Tennis commentators typically make a lot of predictions about the future course
of a match when significant events happen or are about to happen. The statement
if the player breaks serve now the match is effectively over or this break-point
is essentially a match-point is often said. Such statements give the sense that it is
impossible for a player to comeback and win the match. However, as mentioned
earlier, one of the gripping features of tennis is that the match is never over until
the final point is played and so a player always has the chance to comeback.
To illustrate the point, we evaluated the formulas for the probability that a
player recovers from a deficit of one service break to win a set. Probabilities were
evaluated under four different specifications of p and q for each of the nine
situations in which a player can be behind a service break (Table 2).
The first two columns of Table 3 correspond to the case where a player has a
better/worse point-winning differential than their opponent, while the final two
columns assume that players are evenly matched (i.e., have equal point-winning
probabilities). The fourth column predicts what would happen if the players
stopped playing tennis and instead tossed a coin to decide the outcome of each
subsequent point (i.e., as if the probability of winning a point is 0.5 for all
subsequent points).
Comparing the first two columns, a player with a point-winning advantage is
more likely to make a successful comeback than a player with a point-winning
disadvantage. Comparing the third and fourth columns, it is better to have a high
service winning probability when a player must win more games as the server
than the receiver in order to complete a comeback (e.g. when the score is 2-5 with
the player to serve next), whereas it is better to have point-winning probabilities
close to 0.5 if an equal number of serving and receiving games must be won (e.g.,
from 3-5).

Table 3: Probability of Recovering from a Breakdown to Win a Set

Initial Point Winning Probabilities, (p, q)
Score (0.67, 0.38) (0.62, 0.33) (0.645, 0.355) (0.50, 0.50)
4-5r 0.135 0.056 0.089 0.250
3-5 0.116 0.043 0.073 0.125
s
2-5 0.100 0.034 0.060 0.063
r
3-4 0.227 0.091 0.150 0.313
2-4 0.199 0.072 0.125 0.188
1-4s 0.175 0.057 0.105 0.109
r
2-3 0.287 0.112 0.188 0.344
1-3 0.255 0.090 0.160 0.227
s
0-3 0.220 0.070 0.131 0.113
Note: r indicates player receives serve in the next game, s indicates player serves
in the next game.
A commentator could use probabilities such as those in Table 3 to give

viewers a sense of how likely it is that a player will win a match given the current
score, or the score if they won/lost the next point. Predictions could be made
under various scenarios: e.g., if the players point-winning probabilities are the
same as in the match to date, if the players point-winning probabilities equal the
proportions observed in previous encounters between the players, or if play stops
and a coin is tossed to determine the outcome of points. This last prediction might
be a useful mechanism for conveying the likelihood of a comeback to viewers not
used to thinking about probabilities. If commentators choose to make different
predictions they would be compelled to explain why their prediction deviates
from the model, enriching the commentary.
Networks covering tennis might consider including the probabilities for each
player winning the match (under various assumptions) alongside the score of the
match. This would enhance the experience of watching tennis on television,
prevent commentators from making outlandish predictions, and increase
awareness and interest in formulas and statistics about tennis.
6.2. Evaluating Changes in Scoring Tennis
Recently, people involved in the marketing of tennis on television have expressed

a desire to reduce the variability of the length of a match. Proposals have been put
forward to replace best of five-set matches with best of three-set matches and to
replace deuce-advantage scoring with a sudden death point if the score reaches
deuce. Tennis formulas may be used to evaluate the implications of these
proposed changes.
DOI: 10.2202/1559-0410.1100
We first evaluate the difference between a players chance of winning a best

of five-set match and a best of three-set match. Figure 4 shows a plot the function
M 5 ( p, p) M 3 ( p, p) against p .
Figure 5: Difference in Probabilities of Winning Best of Five- and Best of

Three-Set Match
Figure 5 reveals that a player expected to win around 51-53% of serving and
receiving points has a 0.04-0.05 increase in the probability of winning the match
when they play best of five-sets versus best of three-sets. If the point-winning
probability is only 47-49% a disadvantage of the same magnitude is incurred.
When the point winning probability is outside of [0.4, 0.6], the probability of
winning the match under both formats is so close to 0 or 1 that the difference is
essentially 0.
In 1999, a proposal to replace the deuce-advantage system in a game with the
sudden death rule that the winner of the first point after deuce wins the game was
made public. The proposal drew comment from several quarters, including
players. Opinions varied. For instance, Pete Sampras and Andre Agassi were
reported as being against and for the change, respectively. Although it was

recognized that sudden death would shorten the average length of a game and thus
of a match itself, there was no analysis of how the change might affect the results
from tennis matches. For example, would more upsets be expected under the
sudden-death system?
The probability of the server winning a game under the sudden-death scoring
system is:
G% ( p ) = p 4 + 4 p 4 (1 p ) + 10 p 4 (1 p ) 2 + 20 p 4 (1 p )3 .
This is simpler in appearance than G ( p ) because the geometric series at
deuce is replaced with a single probability. The plot of G ( p ) G% ( p ) against p
(Figure 6) shows that G ( p ) < G% ( p ) when 0 < p < 0.5 and G ( p ) > G% ( p ) when
0.5 < p < 1 , with equality at 0, 0.5, and 1. For example, a player who won 70.9%
of service points and 37.1% of receiving points would see their best of three-set
match winning probability fall from 83.31% to 82.10% (a drop of 0.0121 on the
probability scale). Similarly, a player with 65.7% and 41.8% point-winning
probabilities for serving and receiving would drop from 82.96% to 80.92%, a
drop of 0.0205 on the probability scale. This indicates that a better player is more
likely to lose a match under sudden death scoring. Although sudden death will
place a lot of importance on the point played at deuce, viewers might miss the
crescendo of excitement that builds up when a game continues well past deuce.
6.3. Training and Match Strategies
To use the tennis formulas a player must acquire results on a large sample of
points theyve played. This would enable a player (or their coach) to estimate
their probability of winning a match against a group of opponents and also
specific opponents. More importantly, a player could evaluate the amount that
their odds of winning a match would improve if they were able to increase their
point-winning probabilities by certain amounts. Such analyses could be used to
help decide how to prioritize training time and resources.
DOI: 10.2202/1559-0410.1100
Figure 6: Effect of Sudden Death Scoring System on Probability of Server

Winning Game
For example, a player that currently wins 65% of service points and 37% of
receiving points has a 59.85% chance of winning a best-of-three set match. If
focused training on either the serve or the return of serve would enable a player to
improve their point-winning probabilities by 0.01 then, since
M 3 (0.66, 0.37) = 0.645 and M 3 (0.65,0.38) = 0.647 , they would be (slightly)
better off focusing on their return game (because the baseline receiving
probability is closer to 0.5, increasing the receiving probability results in a bigger
increase in the probability of winning the match than does increasing the serving
probability). However, because serving performance is less reliant on opponents
performances, it might be the case that the player could make a bigger
improvement on their serve than on their return of serve. If the respective
improvements were 0.011 and 0.01, the new match-win probabilities would be
0.650 and 0.647 respectively, indicating that in this case the better strategy is to
focus on their serving game. To assist with preparations to play a certain player,
similar calculations could be made using data from previous matches against that
player.

7. Conclusion
This paper has highlighted the wide-range of opportunities for using probability
models and statistical analysis in tennis. Like baseball, the scoring system in
tennis consists of repeated contests involving a fixed number of states. This
makes tennis a prime candidate for detailed modeling and statistical analysis.
The methods and analyses in this paper can be extended in several ways. Our
analysis of the servers performance on break points versus non-break points
suggests that a more appropriate model would allow for different probabilities at
different stages of a match or even sides of the court. In subsequent work,
analyses could be undertaken to determine if there are other stages of a match
where point-winning probabilities are liable to differ, and to evaluate if the
outcome of successive points are correlated, especially during crucial stages of a
match.
The use of tennis formulas has the potential to yield several benefits (besides
making profits for savvy gamblers). As long as appropriate explanations are
provided, probabilistically based predictions of the outcome of a tennis match
may enhance television coverage of tennis the way baseball statistics and player
ratings have extended interest in that sport. Administrators and other concerned
parties could use tennis formulas to evaluate the implications of any proposed
scoring changes on the results of tennis matches and, in particular, on of the
prevalence of upsets. Finally, tennis formulas provide players and coaches with
another tool to use in training and in developing match strategies.
The development of tennis formulas has the potential to increase the publics
interest in tennis. It is hoped that this paper raises the profile of probabilistic
modeling of tennis and leads to further development and application of models of
tennis and associated formulas.
Appendix
The formula for computing TB( p, q ) is given by

TB( p, q) = i =1 A(i ,1) p A( i ,2) (1 p ) A( i ,3) q A( i ,4) (1 q) A( i ,5) d ( p, q) A( i ,6) ,
28
(A1)
where d ( p, q) = pq[1 { p(1 q) + (1 p ) q}]1 and A(i, j ) is the ijth element of
the 28 by 6 matrix
DOI: 10.2202/1559-0410.1100
1 3 0 4 0 0

3 3 1 4 0 0
4 4 0 3 1 0

6 3 2 4 0 0
16 4 1 3 1 0

6 5 0 2 2 0
10 2 3 5 0 0

40 3 2 4 1 0
30 4 1 3 2 0

4 5 0 2 3 0

5 1 4 6 0 0
50 2 3 5 1 0

100 3 2 4 2 0
50 4 1 3 3 0
A=
5 5 0 2 4 0
1 1 5 6 0 0

30 2 4 5 1 0

150 3 3 4 2 0
200 4 2 3 3 0

75 5 1 2 4 0
6 6 0 1 5 0
.
1 0 6 6 0 1
36 1 5 5 1 1

225 2 4 4 2 1

400 3 3 3 3 1
225 4 2 2 4 1

36 5 1 1 5 1
1
1 6 0 0 6
The formula for computing S ( p, q ) is given by
S ( p, q ) = i =1 B (i,1)G ( p ) B ( i ,2) (1 G ( p )) B ( i ,3) G ( q ) B ( i ,4) (1 G ( q )) B ( i ,5)
21
(A2)
( G ( p )G ( q) + {G ( p )(1 G ( q)) + (1 G ( p ))G ( q )}TB( p, q ) )
B ( i ,6)
,

where B(i, j ) is the ijth element of the 21 by 6 matrix

1 3 0 3 0 0

3 3 1 3 0 0
3 4 0 2 1 0

6 2 2 4 0 0
12 3 1 3 1 0

3 4 0 2 2 0
4 2 3 4 0 0

24 3 2 3 1 0
24 4 1 2 2 0

4 5 0 1 3 0

B = 5 1 4 5 0 0
40 2 3 4 1 0

60 3 2 3 2 0
20 4 1 2 3 0

1 5 0 1 4 0
1 0 5 5 0 1

25 1 4 4 1 1

100 2 3 3 2 1 .
100 3 2 2 3 1

25 4 1 1 4 1
1 5 0 0 5 1

References
Newton, P. K. and J. B. Keller. (2005). Probability of Winning at Tennis I.

Theory and Data. Studies in Applied Mathematics, 114, 241-269.
Liu, Y. (2001). Random Walks in Tennis. http://www.math-
cs.cmsu.edu/~mjms/2001.3/Yliuten.pdf
Riddle, L. H. (1988). Probability Models for Tennis Scoring Systems. Applied
Statistics, 37, 63-75.
DOI: 10.2202/1559-0410.1100
Morris, C. N. The most important points in tennis. In: Optimal Strategies in Sport
(S. P. Ladany and R.E. Nichol, Eds.), pp. 131-140, Amsterdam; North
Holland, 1977.
Betfair Tennis. (2008). http://form.tennis.betfair.com/tennis
Jackson, D. and K. Mosurski. (1997). Heavy defeats in tennis: Psychological
momentum or random effects. Chance, 10, 27-34.

Probability Formulas and Statistical Analysis in Tennis

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Probability Formulas and Statistical Analysis in Tennis

Transféré par

Droits d'auteur :

Formats disponibles

Journal of Quantitative Analysis in

Probability Formulas and Statistical Analysis

A. James O'Malley, Harvard Medical School

KEYWORDS: probability, sport, statistics, tennis, tennis formula, winning

Published by Berkeley Electronic Press, 2008 1

discussing applications of the formulas. The intention of this paper is to make

2. Scoring a Tennis Match

A tennis match consists of a number of milestones and repeated events. At the

3. Probability of Winning a Game

Because a tennis match consists of repeated contests, tennis is highly amenable to

Published by Berkeley Electronic Press, 2008 3

Figure 1: Probability of Winning a Game

4. Other Tennis Formulae

4.1. Probability of Winning a Tiebreaker

Published by Berkeley Electronic Press, 2008 5

4.2. Probabilities of Winning a Set and a Match

The probability of winning a set is derived from the probability of winning a

Figure 2: Probability of Winning a Tiebreaker

The probability of winning a match is in turn determined from the probability

Published by Berkeley Electronic Press, 2008 7

Figure 3: Probability of Winning a Best of Three-Set Tennis Match

4.3. Recovering from a Deficit to Win a Set

Table 1: Expressions for Probabilities of Winning a Set from Various

Published by Berkeley Electronic Press, 2008 9

The fact that several of the probabilities in Table 1 can be expressed as

5. Analysis of Empirical Data: Testing Validity of Underlying Assumptions

Table 2: Data from Mens Singles Championships Wimbledon 2007

The predicted proportions of games won by a player in a match were

Published by Berkeley Electronic Press, 2008 11

Figure 4: Scatterplot of Predicted Versus Observed Proportions of Games

Statistical tests that confirmed the presence of heterogeneous probabilities

6. Applications of Tennis Formulas

6.1. Match Predictions

Published by Berkeley Electronic Press, 2008 13

Table 3: Probability of Recovering from a Breakdown to Win a Set

A commentator could use probabilities such as those in Table 3 to give

6.2. Evaluating Changes in Scoring Tennis

Recently, people involved in the marketing of tennis on television have expressed

We first evaluate the difference between a players chance of winning a best

Figure 5: Difference in Probabilities of Winning Best of Five- and Best of

Published by Berkeley Electronic Press, 2008 15

6.3. Training and Match Strategies

Figure 6: Effect of Sudden Death Scoring System on Probability of Server

Published by Berkeley Electronic Press, 2008 17

The formula for computing TB( p, q ) is given by

Published by Berkeley Electronic Press, 2008 19

where B(i, j ) is the ijth element of the 21 by 6 matrix

Newton, P. K. and J. B. Keller. (2005). Probability of Winning at Tennis I.

Published by Berkeley Electronic Press, 2008 21

Vous aimerez peut-être aussi