Académique Documents
Professionnel Documents
Culture Documents
By Team Brahmashiras (Aniruddha Ghosh & Nishant Dey Purkayastha), IIM Kozhikode
Preparation of Data: In the given dataset, there were several matches which exceeded the
standard quanta of 300 balls due to the occurrence of extras. All such extra balls were merged
with the previous legitimate ball so that every innings has a maximum of 300 balls. Secondly, few
matches were such that the first innings ended despite the fact that batting team did not lose all
their wickets and the 50 overs too were not complete. The innings could have been over due to 2
reasons the match is abandoned or the innings is shortened due to a delayed start or rain in
between. In the second case, the approach of the batting team would be different than that in case
of a 50 over match. Since we cannot conclude from the data the cause of the shortened innings, we
exclude all such matches in order to maintain uniformity. Thus, only innings which reached their
logical conclusion either full 50 overs bowled or the batting side bowled out were considered.
Calculation of r(b,w): There are, in total, 1800 possible values for (b,w) as b varies from 121 to
300 (as the problem statement requires us to predict the score only from the 21 st over onwards) and
w varies from 0 to 9. For each of these situations, we calculate the average runs scored over all
the available matches. Upon doing so, we notice that for many of the (b,w) pairs, the frequency of
occurrence is fairly less. For example, an instance where 299 balls have been bowled and no wickets
have fallen, understandably has a low frequency. Hence, in order to obtain an unbiased dataset, we
only consider those values of r(b,w) which have a frequency higher than 10. We observe that we
have the entire set of values for r(b,w) when w=5. So we plot a scatter graph of the same and
observe that a 4th degree polynomial curve can be used as the best fit curve. Thus to obtain, r(b,w)
values for the entire set, we resort to a polynomial regression (degree 4) with runs scored as the
dependent variable and balls bowled (b) and wickets lost (w) as the independent variables. This
computes the values of r(b,w) for all (b,w) combinations.
Calculation of p(b,w): Once again, we begin by appreciating the fact that there are a total of
1800 possible combinations of (b,w). For each of these situations, the probability of a wicket falling
can be calculated empirically from the given data by dividing the number of instances when a
wicket falls by the total number of instances. However, it is noticed that employing such a system
results in a large number of null data which is logically substantiated by the fact that there are only
a maximum of 10 wickets falling in 300 balls of an innings, hence most deliveries have a probability
of zero. At this juncture, we apply a pinch of cricketing common sense to conclude that whether a
1 | Page
IIM Kozhikode
Team Brahmashiras,
wicket falls in the first or second ball of an over (or for that matter in any other ball of the over),
makes little difference. Hence, we decide to calculate the probability of a wicket falling in a
particular over instead of in each and every ball, and then apply this probability to each ball of the
over.
This gives us a range of 300 data points (30 overs*10 wickets). The number of instances when a
wicket has fallen in an over divided by the total number of such overs gives us the empirical
probability of a wicket falling in each over-wicket combination. We then equally distribute this
probability over each ball of the over. This gives us our p(b,w) values.
Computation of V(b,w): Now that we have an array of values for r(b,w) and p(b,w), we proceed to
compute the values of V(b,w). Since V(b,w) gives the expected runs to be scored in the remainder of
the match, we can conclude that when b=300, V(b,w) =0. Hence, given the nature of the equation,
we can solve the model by proceeding backwards. This gives us the entire array of V(b,w) values.
2 | Page
IIM Kozhikode
Team Brahmashiras,
the number of runs the team is expected to score in the rest of the innings. To this we add the runs
scored so far in the match to obtain the predicted score at that delivery. This predicted score is
calculated after every delivery starting from the first delivery of the 21 st over.
In the second innings, the model uses the Probit model equation obtained by using the available
historical data. It takes as input the number of deliveries left, the number of wickets in hand and the
number of runs left to score. The result is the probability that the team batting second will win the
match. This probability is also calculated after every delivery starting from the first delivery of the 21 st
over.
We now turn to the Test dataset and apply the model to all the match scenarios. The output of the
model for 4 matches from the Test dataset, match number 4,6,11 and 15, have been taken and
illustrated graphically in the infographic.
3 | Page
IIM Kozhikode
Team Brahmashiras,