Académique Documents
Professionnel Documents
Culture Documents
X E ( X ) X np
~ N (0,1)
V (X )
npq
E ( P) E X 1 E x np p
n
n
pq
V ( P) V X 1 V ( X ) 1 * npq
n
2
2
n
n
n
pq
S.E.( P)
n
X
is asymptotically normal for large n, the normal
n
test for the proportion of successes becomes.
Since X and consequently
P E ( P) P p
~ N (0,1)
V ( p)
pq
n
pr [ z z
2
pq
P p z
n
2
pr [ z
2
pr [ P z
2
pr [
P p
z ] 1
2
pq
n
pq
] 1
n
pq
p P z
n
2
pq
] 1
n
pq
X
pq
X
p z
] 1
z
n
n
n
2
n
2
X
as an estimator of p is given
n
X
p
n
X
pq
p z
n
n
2
Maximum error the estimate E z
2
pq
n
2
1
If no information is available for p we can take p so,
2
1 z2
Sample size n
4 E
H
H
: P P0
: P P0
3.
1
Since n is large, the sampling distribution of P is approximately normal.
If H is true, the test statistic z
0
p P0
has an approximately
pq
Where
p
P P i.e.( P P orP P
0
)
Critical region P P0
Critical region
PP
: P P0 is given below
Level of significance
Critical region
0
Z > 2.33
Z >1.645
Z > 1.28
Z < -2.33
Z < -1.645
Z < - 1.28
H :P P
^ p p
n n
Estimated overall proportion of success in two populations p 1 1 2 2
n1 n2
^ n p n p
q , where p 1 1 2 2 and
S.D. = p
n1 n2
n1 n 2
1
p q
n n
1
q 1 p
p1 p 2
,
Standard error of ( p1 p 2 )
p1 p 2
z
Standard error of ( p1 p 2 )
p1 p 2
1
p q
n1 n 2
H : P P is given below
Level of significance
Critical region
P1 P0
Critical
P1 P2
region
Z > 2.33
Z >1.645
Z > 1.28
Critical
P1 P2
region
Z < -2.33
Z < -1.645
Z < - 1.28
Chi-Square Test
In the last chapter, we learned how to test hypotheses using data from
either one or two samples.
We used one-sample tests to determine whether a mean or a proportion
was significantly different from a hypothesized value.
In the two sample tests, we examined the difference between either two
means or two proportions and tried to learn whether this difference was
significant or not.
Suppose we have proportion from four populations instead of only two. In
this case, the method for comparing proportions described earlier does not
5
independent
1,
2,
3,
variates
x , x , x ,.... x
1
with
x1 1
1
2
normal
x 2 2
x 3 3
.........
xn n
is a
be n
means
, is a Chi-
O , O , O ,...., O
1
are
expected
A , A , A ,...., A
n
occur
A , A , A ,...., A
with frequencies E , E , E ,...., E
then
observed frequencies and E , E , E ,...., E
1
are called
are called expected frequency.
1
to
O , O , O ,...., O
(O1 E1) (O E ) (O E )
E
E
E1
)
O
E
i
i
,
O E
E i
(On E n ) 2
i 1
i 1
i 1
Solution: Here we set up a null hypothesis that the digits occur equally
frequently in the directory.
Under the null hypothesis, the expected frequency for each of digits
0,1,2n is 10000/10 = 1000. The value of chi-square is computed as
2
Calculation for
Digits Observed frequency
Expected
(O-E)2 (O-E)2/E
(O)
frequency (E)
0
1026
1000
676
0.676
1
1017
1000
11449
11.449
2
997
1000
9
.009
3
966
1000
1156
1.156
4
1075
1000
5625
5.625
5
933
1000
4489
4.489
8
6
7
8
9
Total
1000
1000
1000
1000
10000
11449
784
1296
21609
11.449
0.784
1.296
21.609
58.542
(O i E i)
58.542
E i
1107
972
964
853
10000
2
i 1
Since the calculated is much greater than tabulated value, hence we reject
null hypothesis. Thus we should conclude the digits are not uniformly
distributed in the directory.
Example: The Theory predicts the proportion of beans in the four groups A, B,
C and D should be 9:3:3:1. In an experiment among 1600 beans, the
numbers in the four groups were 882, 313, 287, and 118. Does the
2
experimental result support the theory? 3 (0.05) 7.815
Solution: Null Hypothesis: Theory fits well into the experiment, i.e. the
experimental result supports the theory.
Under the null hypothesis, the expected frequencies can be computed as follows:
Total number of beans = 882+313+287+118 = 1600
These are to be divided in the ratio 9:3:3:1
E (882) N * pi 1600 * 9 900, E (313) 1600 * 3 300, E (287) 1600 * 3 300
16
16
16
E (118) 1600 * 1 100
16
2
(O2 E 2) 2 (O3 E3) 2 (O4 E 4) 2
2 (O1 E1)
E
E
E4
2
3
E1
2
(313 300) 2 (287 300) 2 (118 100) 2
2 (882 900)
4.7266
300
300
100
900
2
df 4 1 3 and tabulated 3 (0.05) 7.815
2
Since the calculated value of is less than the tabulated value. Hence null
hypothesis is accepted at 5% level of significance and we may conclude that the
experimental results support the theory.
2
Example: A survey of 320 families with 5 children each revealed the following
distribution.
Number of boys
5
4
3
2
1
0
Number of girls
0
1
2
3
4
5
Number of families
14
56
110
88
40
12
Is the result consistent with the hypothesis that male and female births are equal
probable (or fit a Binomial distribution and test the goodness of fit)
2
5 (0.05) 11.07
Solution: Let the null hypothesis be that male and female births are equally
probable i.e.
p=q=
p = probability of male birth
p r q nr
p r q nr
1
n
r p r q n r 5
0 2
5 0
1
1
1*1*
32
2
N*
5
0
1
0
5
320
*
10
p q
32
Similarly,
The expected frequency of 0 male births is given by f(1)= 50
The expected frequency of 0 male births is given by f(2)= 100
The expected frequency of 0 male births is given by f(3)=100
The expected frequency of 0 male births is given by f(4)=50
The expected frequency of 0 male births is given by f(5)=10
2
Calculation for
Digits Observed frequency Expected frequency (O-E)2 (O-E)2/E
(O)
(E)
0
14
10
16
1.60
1
56
50
36
0.72
2
110
100
100
1.00
10
3
4
5
Total
88
40
12
320
144
100
4
1.44
2.00
0.40
7.16
(O i E i)
7.16
E i
100
50
10
320
2
i 1
2
df 6 1 5 and tabulated 5 (0.05) 11.07
Calculated value of chi-square is less than tabulated value. Hence we accept null
hypothesis. Thus, we may conclude that the male and female births are equally
probable.
No. Turned
up
1
2
3
4
5
6
Total
11
(O-E)2
36
4
9
64
49
36
(O-E)2/E
1.64
0.18
0.41
2.91
2.23
1.64
9.01
)
O
E
i
i
9.01
E
i
i 1
2
df 6 1 5 and tabulated 5 (0.05) 11.07
2 k (ni npi )
,
np
i 1
i
12
(i = 1,2,.k)
i 1
ni n
X i ni pi
ni pi (1 pi ) is
In practice we substitute for the pi, which under the null hypothesis are all equal
then the polled estimate
x1 x1 .... xk
n1 n1 .... nk
( xi ni p ) 2
~ 2 with k 1df
(1 p )
i 1 ni p
k
Sample
Total
k
Success
x1
x2
xk
x
Failure
n1- x1
n2- x2
nk- xk
n-x
Total
n1
n2
nk
n
13
x and n respectively, represent the total number if success and the total number
of trails for all sample combined.
The entry in the cell belonging to the ith row and jth column is called the observed
cell frequency Oij with i = 1, 2 and k = 1,2k.
x
Under the null hypothesis H0= p0 = p2 == pk=p, we estimate p n . Hence the
expected number of success and failure for the jth sample are estimated by
e1 j n j * p
n j *x
n
e 2 j n j * (1 p )
The quantities e1 j and
e2 j
and
n j *(n x)
n
(Oij eij ) 2
i 1 j 1
eij
pi
can be
~ 2 with k 1df
Contingency Table
Let the data be classified into classes
Classe
s
B2
B2
..
Bj
14
Bt
Total
.
(A1)
O1t
.
.
A2
(A2)
O21 O22
..
O2j
O2t
.
.
:
:
:
:
..
:
:
.
.
Ai
(Ai)
Oi1 Oi2
..
Oij
Oit
.
.
:
:
:
:
..
:
:
.
.
As
(As)
Os1 Os2
..
Osj
Ost
.
.
(Bj)
(Bt)
Total (B1) (B2)
..
N
.
Example: Suppose that in four regions, the National Health Care Company
samples its hospital employees attitudes toward job performance reviews.
Respondents are given a choice between the present method and a proposed new
method. Here the data are classified in to two classes (choice between the
present method and proposed new method) and another four classes according to
geographical region (Northeast, Southeast, Central, West Coast). Sample
response is given as follows:
Observed Frequency Table
A1
O11 O12
..
O1j
Northeast
68
Southeast
75
Central
57
West coast
79
Total
279
32
45
33
31
141
100
120
90
110
420
Row Total for that cell Column Total for that cell
Total no of observation
Southeast
79.72
Central
59.79
West coast
73.07
40.28
30.21
36.93
Total
100
120
90
110
Degrees of Freedom
Degrees of freedom is the total number of observation minus the number of
independent constraints (restrictions) imposed on the observations.
In the above table (Contingency Table) there are in all (s * t) cells but since the
marginal totals are fixed there are (s + t) constraints. These constraints are,
however, not independent since sum of the border column frequencies must be
equal to that of the border row frequencies and thus there are only (s + t -1)
independent linear constraints. Hence the number of degrees of freedom,
associated with a (s * t) contingency table is
Degrees of freedom = (s * t)- (s + t -1) = (s - 1) (t - 1)
= (Number of rows - 1) (Number of columns - 1)
Column 1
Row 1
TOTAL
RT1
Row 2
RT2
TOTAL
CT1
Column 2
Column 3
CT2
CT3
Column 4
CT4
16
2
P 2 n ( ) as shown in the following figure
is
n
2
Step 1. Calculate the expected frequency. In general the expected frequency for
any cell can be calculated as follows:
e
Row total (for that cell) * Column total (for that cell)
Total number of observations
Step 2. Obtain the difference between observed and expected frequencies and
find out the square of these difference i.e. find out (oe)
Step 3. Divide the quantity (o e)
(o e)
expected frequency to get
.
e
(o e)
Step 4. Then find the sum of
e
(o e) 2
values i.e.
e
value.
**Note**
2
It may be noted that the - test depends only on the set of observed and
expected frequencies and on degrees of freedom. It does not make any
assumptions regarding to the parent population from which the observation are
2
taken. Since does not involve any population parameters and the test is
known as Non-parametric test.
Test of Homogeneity
18
2 n 1s 2
n
2
19
n 1 s2
n 1 s2
2 n 1s 2 19 * 25 475 7.42
64
64
2
0
2
df 20 1 19 and tabulated 19 (0.05) 30.14
2 n 1s 2 19 * 25 475 7.42
64
64
2
0
2
df 20 1 19 and tabulated 19 (0.05) 30.14
n 1 s2 (20 1)12.2
2
7.69
Lower Confidence Limit L
2
30.144
2
U
n 1 s2 (20 1)12.2
10.117
22.91
The difference between two sample means can be studied through the
standard error of the difference of the means of the two samples or through
students t test but the difficulty arises when we happen to examine the
significance of the difference between more than two sample means at once.
Analysis of variance help us to test whether more than two population means
can be considered to be equal.
Analysis of variance will enable us to test for the significance of the
differences between more than two sample means.
Using analysis of variance, we will able to make inference about whether our
samples are drawn from population having the same mean
Sir R. A. Fisher originated the technique of analysis of variance.
The analysis of variance is essentially a technique for testing the difference
between groups of data for homogeneity. It is a method of analyzing the
variance to which a response is subject into its various components
corresponding to the various sources of variation. There may be variation
between the samples or there may be variation within the sample items. Thus,
the technique of analysis of variance consists in splitting the variance for
analytical purposes into its various components. Normally the variance (or
what can be called as the Total variance) is divided into two parts:
1. Variance between samples,
2. Variance within a samples; such that
Variance = Variance between samples + Variance within samples
Three steps in analysis of variance
Analysis of variance consists of three different steps.
1. Determine first estimate of the population variance from the variance
among (between) the sample means.
2. Determine second estimate of the population variance from the
variance within the sample.
3. Compare these two estimates. If they approximately equal in value,
accept null hypothesis
Assumption
In order to use analysis if variance, we must assume that the each of the
samples is drawn from a normal population and that each of these
populations has the same variance
21
1 n
xi
n i 1
1
= n x
x 2 x 3 xn
=85/5 = 17
= 105/5 = 21
= 114/6 = 19
Step first in analysis of variance indicates that we must obtain one estimate of
the population variance from the variance among the three sample means. This
estimate is called the betweencolumn variance.
2
( x x)
As we know the sample variance is given by 2
n 1
Now, because we are working with three sample means and a grand mean, let us
substitute x for , x for x and k (the number of sample) for n to get a
formula for the variance among the sample means.
Variance among the sample means:
sx
( x x)
k 1
2
2
x n , where x is the variance among the
2
sample mean. But we do not know 2x , but we could however calculate the
variance among the three sample means s x 2 . So, substitute 2x by s x 2 .
Then we have estimated population variance n * 2
n( x x)
k 1
23
nj( xj x )
k 1
Calculation of the
n x
between column variance 5 17
5 21
6 19
x -x
19
19
19
17-19 = -2
21-19 = 2
19-19 = 0
n( x - x )2
5*(-2) 2 = 20
5*(2) 2 = 20
6*(0)2 = 0
n( x x )
b2
nj ( xj x )
k 1
= 40
40 40 20
3 1 2
n 1
As we have assumed that the variances of our three populations are the same, we
could use any one of the three-sample variance ( s12 , s 2 2 and s 32 ) as the second
estimate of the population variance.
We can get a better estimate of the population variance by using a weighted
average of all three samples.
24
n j 1 2
s j
nT k
Where, 2w within column variance
n size of jth sample
s 2j sample variance of the jth sample
2w
s12
( x x)
n 1
70
17.5
4
Training Method 2
(x- x )
( x- x )2
22-21 = 1
1
27-21 = 6
36
18-21 = -3 9
21-21 = 0
0
17-21 = -4 16
2
( x x ) = 62
Training Method 3
(x- x )
( x- x )2
18-19 = -1 1
24-19 = 5 25
19-19 = 0 0
16-19 = -3 9
22-19 = 3 9
15-19 = -4 16
2
( x x ) =60
(x )
70 ( x x )
62
s 22
s 3215.5 x
n 1
51 n 1
51
60
12.0
61
25
n j 1 2 51 2 51 2 61 2
s j
s1
s2
s3
k
163 163 163
nT
4
4
5
*17.5 *15.5 *12.0
13
13
13
192
14.769
13
^
b2
first estimate of the population variance based on the variance among the sample mean
second estimate of the population variance based on the variance within the samples
b
2
20
1.345
14.769
The nearer the F- ratio comes to 1, then the more we are inclined to accept the
null hypothesis. Conversely, as the F-ratio becomes larger, we will be more
inclined to reject null hypothesis and accept the alternative hypothesis.
The F-Distribution
The F is skewed distribution. Generally it is skewed to the right and tends to
become more symmetrical as the number of degrees of freedom in the numerator
and denominator increase. The F- distribution has single mode. The shape of the
distribution depends on the number of degrees of freedom in both numerator and
denominator of the F-ratio. The first number is the number of degrees of
freedom in the numerator of the F-ratio; the second is the degrees of freedoms
in the denominator.
Fig 11.8 Pg 597 Rubin
26
Degrees of Freedom
As we have mentioned each F-distribution has a pair of degrees of
freedom, one for the numerator of the F-ratio and the other for the
denominator.
While calculating variance between the sample mean we have used
different values of x - x , one for each sample to calculate
nj ( xj x )
x-x
values, the third was automatically determined and could not be freely
specified. Thus, one df is lost when we calculate the variance between
samples. Hence, the number of degrees of freedom for the numerator of
the F-ratio is always one fewer than the number of samples.
Number of degrees of freedom in the numerator of the F-ratio = (n-1)
For the denominator we have calculated the variance within the samples
and we used all three samples. For the j th sample, we used nj values of
(x x j )
(x x j )
and could not be freely specified. Thus, we lost 1 df in the calculations for
each sample. In above example we lost 1 df in the calculations for each
sample, leaving us with 4, 4 and 5 df in the sample. Because we had three
samples, we were left with 4+4+5 = 13 df. Which could also be calculated
as 5+5+6 3 = 13. Thus,
Number of degrees of freedom in the denominator of the F-ratio = ( nT -k)
The F-Table
27
For analysis of variance, we shall use an F-table in which the columns represent
the number of degrees of freedom for the numerator and the rows represents the
degrees of freedom for the denominator. Suppose we are testing a hypothesis at
the level of significance 0.05, using F-distribution and our degrees of freedom
for numerator is 2 and 13 for the denominator. The value we find in the F-Table
is 3.81 (First look in column and then in row)
Critical Value of F- distribution
Usually F-tables give the critical value of F for the right tailed test, the right-tail
area determines i.e. the critical region. Thus, the significant value F ( n1 ,n2 ) at
the level of significance and (n1, n2) where n1 is the number of degrees of
freedom in the numerator and n2 the number of degrees of freedom in the
denominator. P[F F ( n1 ,n2 )] . As shown in figure
Pg. 877 Gupta and Kapoor
H 1 : 1 2 3
df
28
Mean
Test
variation
Between
Within
Total
(k-1)
SSB = nj ( xj x ) 2
j= 1,2,k
SSW
= (N-k)
SST=
(N-1)
( xi x)2
29
square
MSB=
SSB/(k-1)
Statistics
F=
MSB/MSW
MSW=
SSW/(N-k)
-
P P i.e.( P P orP P
0
P P H 0 : 0
x1 x1 x2
x1 x 2
1 SD
z x x
x
N n z x ~ N (0,1)
1
n
n
0
x
k n
x H
:P
P H
0
:P
P H
0
k ( x1, x 2 , x3... x n)
H
n
1
:P
P H
0
:P
30
P x1
0
x
z
2
=
n
31
1. A company manufacturer gold weighting balances. It maintains strict quality control over the
products and do not release a balance to sale unless the balance showed variability significantly
below one microgram (at alpha=0.01), when weighting quantities of about 500 grams. A new
balance has just been delivered to the quality control division form the production line. This new
balance is tested by using it to weight the same 500-gram standard weight 30 different times the
standard deviation turns out to be 0.73 microgram. Should this balance be released for sale?
2001 q-4
In this question you have to test the variability of balance so
Given,
Sample size (n) = 30
Sample SD (s) = 0.73 microgram
Alpha = 0.01
2
2 n 1s 2 (30 1) * (0.73) 29 * 0.5329 15.4541
1
2
0
2
df 30 1 29 and tabulated 29 (0.01) 14.256
2. Describe, what do you understand by goodness of fit. Explain how can you test the unbiased ness of
a die using 2 distribution.
Goodness of fit
An important problem of statistical inference is to test the hypothesis that the given data has been
obtained by random sampling from a specified population with definite values for its parameter. The
data usually given can be arranged in the form of frequency distribution, where in we are given the
observed frequencies. The corresponding theoretical frequencies are obtained from the knowledge
of the population and our problem is to test the compatibility of observed and theoretical
frequencies or to determine whether the deviations of the observed frequencies from the theoretical
frequencies are small enough to be regarded as due to fluctuations of sampling or whether they
indicated that the data could not have possibly come from a population giving rise to theoretical
frequencies.
In other words, when some theoretical distribution is fitted to the given data, we are always
interested in knowing as to how well this distribution fits with the observed data. Chi square test can
give an answer to this. If the calculated value of Chi-square (
32
certain level of significance, the fit is considered to be a good one, which means that the divergence
between the observed and expected frequencies is attributable to fluctuations of sampling. But it the
calculated value of Chi-square (
good one.
We use this test to whether the difference between the theoretical and observed values can be
attributed to chance or not.
Explain how can you test the unbiased ness of a die using 2 distribution.
By using Chi square distribution we can test the unbiased ness of die.
Here in this case we set the null hypothesis (H0) as the die is unbiased and alternative hypothesis (H1) as the
die is biased.
Once we perform experiment the data usually obtained can be arranged in the form of frequency
distribution, where in we are given the observed frequencies. The corresponding theoretical frequencies are
obtained from the knowledge of the population (The number of time that we throw a die and the probability
of turning any face. For example if we throw a die 132 times then expected frequency is obtained by
multiplying this 132 by 1/6, since 1/6 is probability of turning any face up when we throw a die) Once we
have observed frequency and expected frequency we can use chi square as a test statistic. If the calculated
value of Chi-square (
) is less than the table value at a certain level of significance, we accept null
hypothesis and the fit is considered to be a good one, which means that the divergence between the
observed and expected frequencies is attributable to fluctuations of sampling. But it the calculated value of
Chi-square (
) is greater than its table value, the fit is not considered to be a good one.
16
20
25
14
29
No.
Turn
ed up
1
2
33
(OE)2
36
4
(OE)2/
E
1.64
0.18
3
4
5
6
Total
25
14
29
28
132
9
64
49
36
0.41
2.91
2.23
1.64
9.01
(Oi E i)
9.01
E i
22
22
22
22
132
i 1
2
df 6 1 5 and tabulated 5 (0.05) 11.07
)
O
E
i
i
1.8889
E
i
i 1
2
df 6 1 5 and tabulated 5 (0.05) 11.07
Since calculated chi-square is less than tabulated value,
We accept null hypothesis. So, the accidents are uniformly distributed over the week.
35