Vous êtes sur la page 1sur 13

Module 7: Probability and Statistics

Lecture 4: Goodness of fit tests


1. Introduction
In the previous two lectures, the concepts, steps and applications of Hypotheses testing were
discussed. Hypotheses testing may be used to check the validity of a hypothesis about a
population parameter from an observed sample. Similarly, it may be necessary to check
whether an observed dataset belongs to a particular probability distribution.
First part of this lecture deals with empirical determination of probability distribution of a
RV. The use of a probability paper in this context along with its method of construction is
discussed here. Descriptions of Normal and Lognormal probability paper and probability plot
are presented. Also, there are certain statistical tests, known as goodness-of-fit tests, to check
the probability distribution that a dataset (sample) possibly follows. In real life scenarios,
some common examples of goodness-of-fit tests are whether a sample of a discrete variable
follows a Poisson distribution, whether a sample of a continuous variable follows a Normal
distribution or whether two samples are drawn from identical distributions. The most
commonly used tests are - Chi-square (
2
_ ) Test, Kolmogorov-Smirnov ( ) S K Test and
Anderson-Darling Test. These are discussed at the later part of this lecture.
2. Empirical Determination of probability distribution of a RV
In many real life scenarios, the actual probability distribution of a random process is
unknown. On the basis of frequency distribution determined from observed data, some
probability distribution may be assumed empirically. Probability papers are useful to check
the assumption of a particular probability distribution of a random variable.
3. Probability Paper
A probability paper is a specially constructed plotting paper where the one of the axes (where
the RV is plotted) is an arithmetic axis and the probability axis is distorted in such a way that
the cumulative probability distribution of the RV plots as a straight line.
It may be noted that separate probability papers are needed to plot the CDF of different
probability distributions as a straight line. For a particular probability distribution with
different parameters, one probability paper may be sufficient as in the case of exponential
distribution. For some probability distributions such as Gamma distribution, separate
probability papers are needed for each set of parameters.
4. Utilities of a probability paper
When a RV is expected to fit a certain probability distribution, its observed CDF is plotted on
the corresponding probability paper. The resulting plot has the following utilities:
- If the plot is a straight line, it can be directly concluded that the RV follows the
hypothetical pdf.
- If the plot deviates from a straight line, then the location of deviation indicates the region
(eg. the tail, the mode etc) where the fit is not good.
- If the plot is not at all close to a straight line, it implies that the hypothetical distribution
has to be rejected and some other distribution has to be tested for the fit.
- If the plot follows a straight line for a part of the range and then deviates, it is the
indication of a change in distribution beyond a certain range of the RV.
- The slope and intercept of the straight line plotted on the probability paper can be used to
estimate the parameters of the distribution.
5. Normal Probability Paper
The normal probability paper is constructed on the basis of standard normal probability
distribution function. The random variable X is represented on the horizontal (or vertical)
axis in arithmetic scale. The vertical (or horizontal) axis represents two scales the standard
normal variate
o

=
X
Z and the cumulative probability values ( ) x F
X
ranging from 0 to 1.
The experimental data points are plotted using Gumbels plotting position given by
1 + N
m

where = N the total number of observations and = m rank of the data point when the
observed values are arranged in ascending order. As the probability scale is compressed near
the median and expanded towards the tails, hence a normal variate X with mean and
standard deviation o plots as a straight line on this paper. The straight line passes through
= X and ( ) 5 . 0 = x F
X
and has a slope of
( ) o
1
=
X
Z
. Hence the parameters of the
distribution can be easily obtained from the plot.
Problem on Normal Probability Paper
Q. The observed strengths of 30 concrete cubes are given below. Check whether the strength
of concrete cubes follows normal distribution or not by plotting on normal probability paper.
Determine the mean strength and the standard deviation.
Sl
no
Strength
KN/m
2

Sl
no
Strength
KN/m
2

Sl
no
Strength
KN/m
2

Sl
no
Strength
KN/m
2

Sl
no
Strength
KN/m
2

Sl
no
Strength
KN/m
2

1 25.14 6 27.48 11 21.08 16 24.22 21 23.39 26 28.85
2 24.55 7 19.62 12 24.67 17 24.38 22 23.10 27 14.70
3 23.27 8 18.61 13 20.23 18 25.09 23 20.76 28 21.72
4 19.10 9 23.49 14 27.59 19 25.31 24 18.85 29 31.77
5 24.24 10 16.76 15 26.87 20 25.82 25 23.78 30 21.62
Soln.
The observed strengths of the concrete cubes are first arranged in ascending order. Then their
plotting positions are determined by
1 + N
m
, where 30 = N and = m rank of the observed data
point when arranged in ascending order of their values.
The normal probability plot is prepared and the data is found to plot almost as a straight line.
Thus, the strength of the concrete cubes follows normal distribution.
16 18 20 22 24 26 28 30 32
0.01
0.02
0.05
0.10
0.25
0.50
0.75
0.90
0.95
0.98
0.99
Normal Probability Plot


From the plot, the value of strength corresponding to cumulative probability of 0.5 is 23
2
/ m KN . Thus the mean strength is 23
2
/ m KN .
The standard deviation is given by the inverse of the slope of the straight line. It is obtained
roughly as 8
2
/ m KN .
6. Log Normal Probability Paper
The log normal probability paper differs from a normal probability paper as follows:
The horizontal axis for the random variable X is in logarithmic scale instead of arithmetic
scale. The standard normal variate Z is given by
( )

m
x X
Z
/ ln
= where
m
x is the median of
X .


Strength (KN/m
2
)
Probability
7. General Probability Paper and Probability plot
For any probability distribution, the probability paper may be constructed by identifying a
suitable standard variate Z . The use of the standard variate ensures that the constructed
probability paper is independent of the parameters of the distribution. The standard variate Z
is represented on the vertical axis in arithmetic scale. The cumulative probability values are
also represented on the vertical axis. The random variable X is represented on the horizontal
axis in arithmetic scale. The plotting position of the experimental data points can be obtained
by a number of methods (eg. Gumbel, Hazen etc). As per the Gumbel method, plotting
position is given by
1 + N
m
where = N the total number of observations and = m rank of the
data point when the observed values are arranged in ascending order. If the plotted data
points give rise to a straight line on the paper, then the data points belong to the particular
probability distribution for which the paper is constructed. Depending on the standard variate
selected, the parameters of the distribution can be obtained from the slope, intercept etc of the
plot.
8. Goodness-of-fit tests
As mentioned in the introduction, certain statistical tests, known as goodness-of-fit tests, are
used to check the probability distribution that a dataset (sample) possibly follows. The most
commonly used tests are - Chi-square (
2
_ ) Test, Kolmogorov-Smirnov ( ) S K Test and
Anderson-Darling Test. These are discussed in this section one after another.
8.1 Chi-square test
The Chi Square Distribution is used for testing the goodness of fit of a set of data to a specific
probability distribution. For this, observed and hypothetical frequencies that follow the
specific probability distribution are compared. It can be used for the both discrete or
continuous random variables.
Let us consider a sample containing n observed values of a random variable. Let
k
O O O O ,..., , ,
3 2 1

be the k observed frequencies of the variates and the corresponding
frequencies from an assumed theoretical distribution be
k
E E E E ,..., , ,
3 2 1
.
It is required to test
whether the differences between the observed and expected frequencies are significant. Thus
the goodness is checked as
( )

=
k
i i
i i
E
E O
X
1
2
2

As k approaches to infinity, the sampling distribution of
2
X tends to a
2
v
_ distribution with
( ) 1 = k v is the degrees of freedom.
The
2
_ test for goodness of fit is generally reliable if 5 > k and 5 >
i
E . It may be noted that
in most cases the parameters of theoretical distribution are not known. Hence, the parameters
should be estimated from the data itself and the statistic remains valid if the degree of
freedom is reduced by one for every unknown parameter.
If
( )
v
k
i i
i i
C
E
E O
, 1
1
o
=
<


where
v
C
, 1 o

is value of approximate
2
v
_ distribution at cumulative probability o 1 , then the
assumed theoretical distribution is an acceptable model at the significance level o .
Problem on chi squared distribution
Q. Consider a given station in a watershed where the severe rainstorms are recorded over a
period of 70 years. Out of these 70 years, 22 years were without severe rainstorms and 25, 14,
6, 3 years with 1, 2, 3 and 4 rainstorms annually. Test whether the data can be assumed to
follow Poisson distribution at 5% significance level.
Sol.
Average occurrence rate of rainstorms
( )
year rainstorms / 1857 . 1
4 3 3 6 2 14 1 25
70
1
=
+ + + =

Now to check the goodness of fit, we use the chi-square distribution at % 5 = o significance
level
As the dataset is quite small, the data in the class four storms/year is combined into the class
of three storms/year
Null hypothesis :
0
H The random variable has a Poisson distribution with 1897 . 1 = .
Alternate hypothesis :
1
H The random variable does not follow the distribution specified in
null hypothesis.
Level of significance: 05 . 0 = o
Here 4 = k ; degree of freedom, ( ) 2 2 = = k v
Critical region
2
05 . 0 , 2
2
_ _ >
v





From chi square table, 99 . 5
2
05 . 0 , 2
= _
No of storms/year Observed
frequency
i
O
Theoretical
frequency
i
E
( )
i i i
E E O /
2

0 22 21.3019 0.0229
1 25 25.3428 0.0046
2 14 15.0752 0.0767
3 9 8.2801 0.0626
Total 70 70 0.1668

Thus, we get
( )
( )
2
05 . 0 , 2
1
2
99 . 5 0.1668 _ = < =

=
k
i i
i i
E
E O

Hence, the Poisson distribution is a valid model at 5% significance level
Decision: The null hypothesis can not be rejected at 5% significance level
8.2 Kolmogorov-Smirnov (KS) Test
This test is also most commonly used to check the validity of the assumed model for
continuous random variables. It relates to the CDF rather than the pdf of continuous
variables. It compares the observed or data based cumulative frequency with assumed
theoretical cumulative distribution.
Let the continuous variable be X and
n
x x x ,... ,
2 1
represent the ordered sample of size n , the
values arranged in increasing order. Now from this ordered set the empirical or sample
distribution function ( ) x S
n
is developed. This function is a step function.
Thus the cumulative frequency step function is defined as
1 ,..., 2 , 1 ;
1
0
) (
1
1
=
>
< s
<

=
+
n k
x x
x x x
x x
n
k
x S
n
k k n

( ) x S
n
is the step function and ( ) x F is the proposed theoretical distribution.
The discrepancy between the theoretical model and the observed data is computed and the
maximum difference ( )
n
D between ( ) x S
n
and ( ) x F over the entire range of x is obtained.
( ) ( ) x S x F D
n
x
n
= max

Fig 1. S
n
(x) and F(x)

Thus for a specified significance level o ,the S K test compares the maximum difference
with the critical value
o
n
D

o
n
D is defined as ( ) o
o
= s 1
n n
D D P
If the observed value is less than the critical value, then the proposed distribution is valid at
the significance level o .
8.2.1 Advantage of KS Test
As in Chi-Square Test, division of data into intervals is not necessary in this case. The test
statistic is distribution free unlike that of Chi-Square Test. S K test works for non-normal
data also. However, the test may fail if the data is too far from normal.
If the sample distribution n is large, Smirnov has given the limiting distribution of
n
D n as
( ) | | ( )
( )

=

(


|
|
.
|

\
|
= s
1
2
2
2
8
1 2 exp
2
lim
k
n
n
z
k
z
z D n P
t t

For 50 > n for 05 . 0 = o and 10 . 0
n D
n
36 . 1 =
o
and n D
n
22 . 1 =
o
respectively
-20 -15 -10 -5 0 5 10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1


F(x)
S
n
(x)
S
n
(x)
F(x)
x
D
n

Problem on KS test
The data of fracture toughness of plain concrete specimens made with burnt brick aggregates
is shown in the table in next slide. The data appears to fall approximately a straight line on a
Normal probability paper with ( ) 051 . 0 , 540 . 0 N . Perform the Kolmogorov-Smirnov test at 5%
significance level to statistically justify the assumption for the given data.
Fracture toughness (MPam) of plain concrete specimens (in increasing order)
m
IC
K
m
IC
K
m
IC
K
1 0.451 10 0.508 19 0.557
2 0.481 11 0.531 20 0.59
3 0.484 12 0.532 21 0.591
4 0.484 13 0.538 22 0.602
5 0.489 14 0.538 23 0.605
6 0.494 15 0.544 24 0.611
7 0.494 16 0.548 25 0.658
8 0.494 17 0.548

Sol.
Null hypothesis :
0
H The random variable has a Normal Distribution
Alternate hypothesis :
1
H The random variable does not have the specified distribution.
Level of significance: 05 . 0 = o
Critical region (from table) 264 . 0
05 . 0
25
= = D D
n
o

The cumulative frequency of the given data is plotted in the following figure with respect to
the equation of S K Test. The theoretical distribution function of normal model is also
shown.

Fig 2. S
n
(x) and F(x)
From the figure, the maximum discrepancy of two functions, 1348 . 0
max
= D occurring at
( ) m MPa. 5080 . 0 =
IC
K i.e. the maximum discrepancy 0.1348 is less than the critical value
0.264. Therefore the null hypothesis cannot be rejected at 5% significance level. Thus, the
model ( ) 051 . 0 , 540 . 0 N is a valid model at 5% significance level.
8.2.2 Kolmogorov-Smirnov two-sample test
The same test used in the case of one sample test can be used to evaluate whether two
samples come from the same distribution.
Let the maximum absolute difference between two empirical distribution functions be
n m
D
,

Let the two functions be represented as step functions ( ) x G
m
and ( ) x S
n
based on two
samples of sizes m and n , respectively.
Thus the difference becomes
( ) ( ) x S x G D
n m
x
n m
= max
,

0.45 0.5 0.55 0.6 0.65 0.7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
K
C
D
F
Cumulative distribution of fracture toughness


Sn
N(0.540,0.051)
D

Fig 3. Kolgomorov-Smirnov two-sample goodness-of-fit test

If the sample distribution have large values of m and n , Smirnov has given the limiting
distribution as
( )
( )

=

(


|
|
.
|

\
|
=
(
(

|
|
.
|

\
|
s
+
1
2
2
2
,
,
8
1 2 exp
2
lim
k
n m
n m
z
k
z
z D
n m
mn
P
t t


(replacing n with
n m
mn
+
)

Problem on two-sample test
Q. The table showing modulus of rupture data for two different groups of timber is shown in
next slide. Supplier deliver item in two lots. The first lot consists of 50 samples and the
second lot consists of 30 samples. Both the lots were supplied by a same supplier who claims
the second lot to be superior to the first lot. Apply the Kolmogorov-Smirnov two-sample test
to verify whether the two samples are of same type (from the same population).


-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
F
(
x
)

D
m,n

LOT A (Modulus of Rupture in
2
/ mm N )
35.3 33.18 30.05 32.68 26.63
36.85 36.81 36.38 34.44 23.25
27.9 38.81 37.78 35.88 28.46
24.55 29.9 35.03 37.51 30.33
28.71 17.83 34.63 33.47 38.05
31.33 23.15 33.06 32.48 34.56
23.37 27.93 36.47 34.12 36
23.56 30.02 38.64 35.58 37.65
28 33.71 28.98 36.92 28.83
25.39 28.76 32.02 33.61 32.4

LOT B (Modulus of Rupture in
2
/ mm N )
33.19 34.4 28.97 35.89
28.69 36.53 35.17 39.33
37.69 31.6 38.71 29.11
25.88 22.87 32.76 34.49
27.11 36.88 25.19 38
29.93 32.03 25.84 35.67
33.92 38.16 28.13 30.53
33.14 39.2

Sol.
Null hypothesis H
0
: The random variables sampled by the first 50 values and the random
variables sampled by the next 30 values have the same distribution.
Alternate hypothesis H
1
: The random variables have different distributions.
Level of significance: 05 . 0 = o .
Calculations: The data from each sample are ranked separately with values of the step
functions ( ) x G
m
and ( ) x S
n
.
The samples are sorted in increasing order and ranked accordingly for both samples and
( ) x G
m
and ( ) x S
n
are determined as shown in the following table. Then the step function of
both samples are plotted. The maximum absolute difference between the empirical
distribution is then determined.
Rank x at lot A MR at A Rank x at lot A MR at A
1 0.02 17.83 26 0.52 33.06
2 0.04 23.15 27 0.54 33.18
3 0.06 23.25 28 0.56 33.47
4 0.08 23.37 29 0.58 33.61

22 0.44 32.02 47 0.94 37.78
23 0.46 32.4 48 0.96 38.05
24 0.48 32.48 49 0.98 38.64
25 0.5 32.68 50 1 38.81

Rank x at B MR at B Rank x at B MR at B
1 0.033333 22.87 16 0.533333 33.19
2 0.066667 25.19 17 0.566667 33.92
3 0.1 25.84 18 0.6 34.4
.
12 0.4 31.6 27 0.9 38.16
13 0.433333 32.03 28 0.933333 38.71
14 0.466667 32.76 29 0.966667 39.2
15 0.5 33.14 30 1 39.33

The maximum difference is 12 . 0 = D and 75 . 18
30 50
30 50
=
+

=
+ n m
mn
.
15 20 25 30 35 40
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
F
(
x
)


S(x)
G(x)

Fig 4. S(x) and G(x)
The critical value is 0.301 (from table).
As 0.12 is less than 0.301, the null hypothesis cannot be rejected at 5% significance level.
10. Concluding Remarks
In this lecture, commonly used goodness of fit tests such as Chi-square (
2
_ ) test,
Kolmogorov-Smirnov ( ) S K test and Anderson-Darling test are discussed. Example
problems using these tests are also presented here. The next lecture introduces regression
analyses and correlation.

Vous aimerez peut-être aussi