Vous êtes sur la page 1sur 3

CRICKET CRAZE: METHODOLOGY DOCUMENT

By Team Brahmashiras (Aniruddha Ghosh & Nishant Dey Purkayastha), IIM Kozhikode

Methodology for the 1st innings


The WASP model for the 1st innings is meant to provide the expected final score for each ball after the
20th over. The expected final score at each ball depends on two factors the number of balls bowled so
far (which in turn indicates the number of balls remaining) and the number of wickets lost so far (which
in turn is an indication of the number of wickets in hand).
At any particular ball, the expected additional runs for the remainder of the innings is given by the
equation,
V(b,w) = r(b,w) + p(b,w)*V(b+1,w+1) + (1-p(b,w))*V(b+1,w)
Where, V(b,w) = expected additional runs for the remainder of the innings when b balls have been
bowled and w wickets have been lost
r(b,w) = estimated expected runs on the next ball in that situation
p(b,w) = probability of a wicket on the next ball in that situation

Preparation of Data: In the given dataset, there were several matches which exceeded the
standard quanta of 300 balls due to the occurrence of extras. All such extra balls were merged
with the previous legitimate ball so that every innings has a maximum of 300 balls. Secondly, few
matches were such that the first innings ended despite the fact that batting team did not lose all
their wickets and the 50 overs too were not complete. The innings could have been over due to 2
reasons the match is abandoned or the innings is shortened due to a delayed start or rain in
between. In the second case, the approach of the batting team would be different than that in case
of a 50 over match. Since we cannot conclude from the data the cause of the shortened innings, we
exclude all such matches in order to maintain uniformity. Thus, only innings which reached their
logical conclusion either full 50 overs bowled or the batting side bowled out were considered.

Calculation of r(b,w): There are, in total, 1800 possible values for (b,w) as b varies from 121 to
300 (as the problem statement requires us to predict the score only from the 21 st over onwards) and
w varies from 0 to 9. For each of these situations, we calculate the average runs scored over all
the available matches. Upon doing so, we notice that for many of the (b,w) pairs, the frequency of
occurrence is fairly less. For example, an instance where 299 balls have been bowled and no wickets
have fallen, understandably has a low frequency. Hence, in order to obtain an unbiased dataset, we
only consider those values of r(b,w) which have a frequency higher than 10. We observe that we
have the entire set of values for r(b,w) when w=5. So we plot a scatter graph of the same and
observe that a 4th degree polynomial curve can be used as the best fit curve. Thus to obtain, r(b,w)
values for the entire set, we resort to a polynomial regression (degree 4) with runs scored as the
dependent variable and balls bowled (b) and wickets lost (w) as the independent variables. This
computes the values of r(b,w) for all (b,w) combinations.

Calculation of p(b,w): Once again, we begin by appreciating the fact that there are a total of
1800 possible combinations of (b,w). For each of these situations, the probability of a wicket falling
can be calculated empirically from the given data by dividing the number of instances when a
wicket falls by the total number of instances. However, it is noticed that employing such a system
results in a large number of null data which is logically substantiated by the fact that there are only
a maximum of 10 wickets falling in 300 balls of an innings, hence most deliveries have a probability
of zero. At this juncture, we apply a pinch of cricketing common sense to conclude that whether a

1 | Page
IIM Kozhikode

Team Brahmashiras,

wicket falls in the first or second ball of an over (or for that matter in any other ball of the over),
makes little difference. Hence, we decide to calculate the probability of a wicket falling in a
particular over instead of in each and every ball, and then apply this probability to each ball of the
over.
This gives us a range of 300 data points (30 overs*10 wickets). The number of instances when a
wicket has fallen in an over divided by the total number of such overs gives us the empirical
probability of a wicket falling in each over-wicket combination. We then equally distribute this
probability over each ball of the over. This gives us our p(b,w) values.

Computation of V(b,w): Now that we have an array of values for r(b,w) and p(b,w), we proceed to
compute the values of V(b,w). Since V(b,w) gives the expected runs to be scored in the remainder of
the match, we can conclude that when b=300, V(b,w) =0. Hence, given the nature of the equation,
we can solve the model by proceeding backwards. This gives us the entire array of V(b,w) values.

Methodology for the 2nd innings


For the 2nd innings, the WASP model gives the probability of winning at each ball which depends on the
following factors the number of balls left in the innings, the number of wickets in hand and the number
of runs left to be scored in order to reach the target.
Preparation of Data: Before proceeding with preparation of the model, a couple of modifications
were made to the given Train dataset. First and foremost, as in the 1 st innings model, all extra
deliveries were merged with the legitimate ones so that every innings is of a maximum of 300
balls.
Next all matches which ended prematurely were investigated. There could have been two possible
reasons the match being abandoned or the Duckworth-Lewis method coming into play and
thereby reducing the target score and overs to play. The database of www.espncricinfo.com was
used to understand the cause of each premature innings. As was done in the 1 st innings model, all
abandoned matches were excluded in order to maintain uniformity of data. For the matches where
the Duckworth-Lewis method had been applied, the data pertaining to the target score and the
number of balls left to be bowled was modified as and when the revised target score and overs left
had come into play in the match. There was also an instance when only 10 players from a team
batted as one batsman was absent hurt. In that match, it was assumed that a wicket was down from
the very first ball onwards which seems fair as the batting team knew that they had only 9 wickets
to play with. This ensured that the data that was used to develop the model mirrored the real-life
scenario.
Developing the model: In order to develop the 2nd innings model, we turn to a form of regression
known as the Probit Model. Probit model is a type of regression wherein the dependent variable
result is a number between 0 and 1, thereby giving the probability of occurrence. It is especially
useful in situations where the outcome is of a binary nature and the probability of each outcome is
desired to be known. In our case, there are two possible outcomes to a match win or lose. Hence,
we apply the Probit model on the empirical data in the Train dataset to obtain the required
probability. In the process, we take the result of the matches as the dependent variable and the
balls left to be bowled, the wickets in hand and the runs left to score as independent variables. The
various match scenarios and their corresponding final result helped us to come up with a regression
equation.

Testing the model


So we now have the models for both the innings. In the first innings, at any given stage, we take the
number of balls bowled and number of wickets lost and find the corresponding value of V(b,w) which is

2 | Page
IIM Kozhikode

Team Brahmashiras,

the number of runs the team is expected to score in the rest of the innings. To this we add the runs
scored so far in the match to obtain the predicted score at that delivery. This predicted score is
calculated after every delivery starting from the first delivery of the 21 st over.
In the second innings, the model uses the Probit model equation obtained by using the available
historical data. It takes as input the number of deliveries left, the number of wickets in hand and the
number of runs left to score. The result is the probability that the team batting second will win the
match. This probability is also calculated after every delivery starting from the first delivery of the 21 st
over.
We now turn to the Test dataset and apply the model to all the match scenarios. The output of the
model for 4 matches from the Test dataset, match number 4,6,11 and 15, have been taken and
illustrated graphically in the infographic.

3 | Page
IIM Kozhikode

Team Brahmashiras,

Vous aimerez peut-être aussi