Chi Square

Calculating Interval Estimates of the Proportion from Large Sample
If X in number of success in n independent trails with constant probability p if

success for each trail
Then E(X) = np V(X) = npq where q = 1-p, is the probability of failure.
It has been proved that for large n, the binomial distribution tends to normal
distribution. Hence for large n, X ~ N (np, npq)
i.e.
X E ( X ) X np
~ N (0,1)
V (X )
npq
Let, in a sample of size n, X is the number of persons possessing the given

attribute. Then, Observed proportion of success
X
P=
n
E ( P) E X 1 E x np p
n
n
Thus, the sample proportion is unbiased estimate of population proportion. Also,
pq
V ( P) V X 1 V ( X ) 1 * npq
n
2
2
n
n
n
pq
S.E.( P)
n
X
is asymptotically normal for large n, the normal
n
test for the proportion of successes becomes.
Since X and consequently
P E ( P) P p
~ N (0,1)
V ( p)
pq
n
pr [ z z
2
pq
P p z
n
2
pr [ z
2
pr [ P z
2
pr [
P p
z ] 1
2
pq
n
pq
] 1
n
pq
p P z
n
2
pq
] 1
n
pq
X
pq
X
p z
] 1
z
n
n
n
2
n
2
The magnitude of the error make when we use

by
X
as an estimator of p is given
n
X
p
n
X
pq
p z
n
n
2
Maximum error the estimate E z
2
pq
n
2
sample size that is needed to attain a desired degree of psision n p (1 p )
1
If no information is available for p we can take p so,
2

1 z2
Sample size n
4 E
Test for a Specified Proportion (Large sample)

2
If a random sample of size n (greater than 30) has a sample proportion p of

members possessing a certain attribute i.e. proportion of successes). To test the
hypothesis that the proportion P in the population has a specified value P0
The Null Hypothesis is H 0 : P P0
Then the alternative hypothesis could be
1. H 1 : P P 0
2.
H
H
: P P0
: P P0
3.
1
Since n is large, the sampling distribution of P is approximately normal.
If H is true, the test statistic z
0
p P0
has an approximately
standard normal distribution.
pq
Where
p
The critical region for Z depending on the nature of

significance is given in the following table.
The rejection rule for
P P i.e.( P P orP P
0
)
Critical region P P0
Critical region
PP
and the level of
: P P0 is given below
Reject Null hypothesis H 0 : P P0 if

Critical values of Z
1%
5%
10%
|Z| > 2.58
|Z| > 1.96
|Z| > 1.645
Level of significance
Critical region
0
Z > 2.33
Z >1.645
Z > 1.28
Z < -2.33
Z < -1.645
Z < - 1.28
To Test Whether The Two Population Proportions P1, P2 Are Equal
H :P P
The Null Hypothesis is 0 1

2
Then the alternative hypothesis could be
1. H 1 : P1 P2
2. H 1 : P1 P2
3. H 1 : P1 P 2
If we hypothesize that there is no difference between the two-population

proportions, then our best estimate of the overall population proportion of
success is probably the combined proportion of success in both samples that is:
Estimated overall proportion of success in two populations
number of successes in sample1 number of successes in sample 2
=
total size of the both samples
Estimated standard error of the difference between two proportions using
combined estimates from both samples.
^ p p
n n
Estimated overall proportion of success in two populations p 1 1 2 2
n1 n2
If H 0 is true, we have P1 = P2 = P (say) and the sampling distribution of p1 - p2

is approximately normal with mean = 0
And
^ n p n p
q , where p 1 1 2 2 and
S.D. = p
n1 n2
n1 n 2
1
Standard error of (p1 p2) =
p q
n n
1
Since the sample size n1 and n2 are large,

4
q 1 p
The test statistic
p1 p 2
,
Standard error of ( p1 p 2 )
is approximately normally distributed with mean 0 and standard deviation 1.

The
test
statistic
p1 p 2
z
Standard error of ( p1 p 2 )
p1 p 2
1
p q
n1 n 2
The rejection rule for
H : P P is given below
Level of significance
Critical region
P1 P0
Reject Null hypothesis H 0 : P1 P2 if

Critical values of Z
1%
5%
10%
|Z| > 2.58
|Z| > 1.96
|Z| > 1.645
Critical
P1 P2
region
Z > 2.33
Z >1.645
Z > 1.28
Critical
P1 P2
region
Z < -2.33
Z < -1.645
Z < - 1.28
Chi-Square Test
In the last chapter, we learned how to test hypotheses using data from
either one or two samples.
We used one-sample tests to determine whether a mean or a proportion
was significantly different from a hypothesized value.
In the two sample tests, we examined the difference between either two
means or two proportions and tried to learn whether this difference was
significant or not.
Suppose we have proportion from four populations instead of only two. In
this case, the method for comparing proportions described earlier does not
5
apply. In such a situation we must use Chi-Square test. Chi-Square tests

enable us to test whether more than two population proportions can be
considered equal.
Chi-Square Test
The Chi-square test is derived from the properties of the Chi- square
distribution.
Chi- Square Distribution is continuous probability distribution and it is
used in both large and small sample tests.
The Chi-Square test provides a technique where by it is possible to:
i.
Test the goodness of fit.
ii. Compare a number of frequency distributions.
iii. Find out the association and relationship between attributes.
iv. Test the population variance.
It is a test of independence, homogeneity and goodness of fit.
With the help of this test, it is possible to assess the significance of the
difference between the observed frequencies and the frequencies expected
if the data conformed to some theoretical distribution.
It is, therefore, possible to test the goodness of fit-to see how well the
distribution of observed data fits the assumed theoretical distribution. If the
distribution of observed data does, in fact, approximate to an assumed
distribution, then we would expect that there should be no significance
difference between the expected frequencies and the actual frequencies.
Chi-square Variate
If x is normally distributed with mean and S.D. , then
Chi- square ( ) Variate with 1 degree of freedom. If
2
independent

1,
2,
3,
variates
x , x , x ,.... x
1
with
...... andS .D. 1, 2, 3, ...... n respectively then,

n
x1 1

1
2
normal
x 2 2
x 3 3
square Variate with n degrees of freedom.
.........
xn n
is a
be n
means
, is a Chi-
Observed and Expected Frequencies

If a set of events,
O , O , O ,...., O
1
are
expected
A , A , A ,...., A
n
occur
A , A , A ,...., A
with frequencies E , E , E ,...., E
then
observed frequencies and E , E , E ,...., E
1
are called
are called expected frequency.
1
are observed to occur with frequencies
and according to the probability rules
to
O , O , O ,...., O
If O1 , O2 , O3 ,...., On is a set of observed frequencies and E1 , E 2 , E 3 ,...., E n

is the corresponding set of expected (theoretical or hypothetical) frequencies,
then Karl Pearsons chi-square is given by
(O1 E1) (O E ) (O E )

E
E
E1
)
O
E
i
i
,

O E
E i
(On E n ) 2
i 1
i 1
i 1
Follows chi-square distribution with (n-1) df

Area of Application of Chi - Square test
Test the goodness of fit
An important problem of statistical inference is to test the hypothesis that
the given data has been obtained by random sampling from a specified
population with definite values for its parameter. The data usually given
can be arranged in the form of frequency distribution, where in we are
given the observed frequencies. The corresponding theoretical frequencies
are obtained from the knowledge of the population and our problem is to
test the compatibility of observed and theoretical frequencies or to
determine whether the deviations of the observed frequencies from the
theoretical frequencies are small enough to be regarded as due to
fluctuations of sampling or whether they indicated that the data could not
have possibly come from a population giving rise to theoretical
frequencies.
Chi square ( ) test enables us to see how well the distribution of

observed data fits the assumed theoretical distribution such as Binomial
distribution, Poisson distribution or the normal distribution. In other
words, when some theoretical distribution is fitted to the given data, we are
always interested in knowing as to how well this distribution fits with the
observed data. Chi square test can give an answer to this. If the calculated
2
value of Chi-square ( ) is less than the table value at a certain level of
significance, the fit is considered to be a good one, which means that the
divergence between the observed and expected frequencies is attributable
2
to fluctuations of sampling. But it the calculated value of Chi-square ( )
is greater than its table value, the fit is not considered to be a good one.
We use this test to whether the difference between the theoretical and
observed values can be attributed to chance or not.
2
Example: The following figures show the distribution of digits in numbers

chosen at random from telephone directory:
Digits:
0 1
2
3
4
5
6
7
8
9 Total
Frequency:1026 1107 997 966 1075 933 1107 972 964 853 10,000
Test whether the digits may be taken to occur equally frequently in the directory.
2
9 (0.05) 16.919
Solution: Here we set up a null hypothesis that the digits occur equally
frequently in the directory.
Under the null hypothesis, the expected frequency for each of digits
0,1,2n is 10000/10 = 1000. The value of chi-square is computed as
2
Calculation for
Digits Observed frequency
Expected
(O-E)2 (O-E)2/E
(O)
frequency (E)
0
1026
1000
676
0.676
1
1017
1000
11449
11.449
2
997
1000
9
.009
3
966
1000
1156
1.156
4
1075
1000
5625
5.625
5
933
1000
4489
4.489
8
6
7
8
9
Total
1000
1000
1000
1000
10000
11449
784
1296
21609
11.449
0.784
1.296
21.609
58.542
(O i E i)
58.542

E i
1107
972
964
853
10000
2
i 1
Since the calculated is much greater than tabulated value, hence we reject
null hypothesis. Thus we should conclude the digits are not uniformly
distributed in the directory.
Example: The Theory predicts the proportion of beans in the four groups A, B,
C and D should be 9:3:3:1. In an experiment among 1600 beans, the
numbers in the four groups were 882, 313, 287, and 118. Does the
2
experimental result support the theory? 3 (0.05) 7.815
Solution: Null Hypothesis: Theory fits well into the experiment, i.e. the
experimental result supports the theory.
Under the null hypothesis, the expected frequencies can be computed as follows:
Total number of beans = 882+313+287+118 = 1600
These are to be divided in the ratio 9:3:3:1
E (882) N * pi 1600 * 9 900, E (313) 1600 * 3 300, E (287) 1600 * 3 300
16
16
16
E (118) 1600 * 1 100
16
2
(O2 E 2) 2 (O3 E3) 2 (O4 E 4) 2
2 (O1 E1)

E
E
E4
2
3
E1
2
(313 300) 2 (287 300) 2 (118 100) 2
2 (882 900)
4.7266

300
300
100
900
2
df 4 1 3 and tabulated 3 (0.05) 7.815
2
Since the calculated value of is less than the tabulated value. Hence null
hypothesis is accepted at 5% level of significance and we may conclude that the
experimental results support the theory.
2
Example: A survey of 320 families with 5 children each revealed the following
distribution.
Number of boys
5
4
3
2
1
0
Number of girls
0
1
2
3
4
5
Number of families
14
56
110
88
40
12
Is the result consistent with the hypothesis that male and female births are equal
probable (or fit a Binomial distribution and test the goodness of fit)
2
5 (0.05) 11.07
Solution: Let the null hypothesis be that male and female births are equally
probable i.e.
p=q=
p = probability of male birth
p r q nr
p(r) = Probability of r male births in a family of n = nr
p r q nr
The expected frequency of r male births is given by f(r) = N* nr

p(0) = Probability of 0 male births in a family of 5
1
n
r p r q n r 5
0 2

5 0
1
1
1*1*

32
2
The expected frequency of 0 male births is given by f(0) =
N*
5
0
1
0
5
320
*
10
p q
32
Similarly,
The expected frequency of 0 male births is given by f(1)= 50
The expected frequency of 0 male births is given by f(2)= 100
The expected frequency of 0 male births is given by f(3)=100
2
Calculation for
Digits Observed frequency Expected frequency (O-E)2 (O-E)2/E
(O)
(E)
0
14
10
16
1.60
1
56
50
36
0.72
2
110
100
100
1.00
10
3
4
5
Total
88
40
12
320
144
100
4
1.44
2.00
0.40
7.16
(O i E i)
7.16

E i
100
50
10
320
2
i 1
2
df 6 1 5 and tabulated 5 (0.05) 11.07
Calculated value of chi-square is less than tabulated value. Hence we accept null
hypothesis. Thus, we may conclude that the male and female births are equally
probable.
Example: A Die is thrown 132 times with the following result:

Number turned up
1
2
3
4
5
6
Frequency
16
20
25
14
29
28
Test the hypothesis that dies is unbiased.
Solution: Let us take the hypothesis that the die is unbiased. If that is true, the
probability if obtaining any one of the six faces is 1/6 and such the
expected frequency of any one face coming upward is 132*1/6 = 22
Calculation for
Observed frequency Expected frequency
(O)
(E)
16
22
20
22
25
22
14
22
29
22
28
22
132
132
2
No. Turned
up
1
2
3
4
5
6
Total
11
(O-E)2
36
4
9
64
49
36
(O-E)2/E
1.64
0.18
0.41
2.91
2.23
1.64
9.01
)
O
E
i
i
9.01

E
i
i 1
2
df 6 1 5 and tabulated 5 (0.05) 11.07
Since calculated value of chi-square is less than tabulated value. Hence we

accept null hypothesis. Thus, we may conclude that the die is unbiased.
Test of independence
If we classify a population into several categories with respect to two
attributes, we can then use chi-square test to determine if the two attributes
are independent of each other.
2
Chi-square ( ) test enables to explain whether or not two attributes are
associated. For instance, we may be interested in knowing whether a new
2
medicine is effective in controlling fever or not and Chi-square ( ) test
will help us in deciding this issue. In such a situation we proceed on the
Null hypothesis that the two attributes are independent which means that
new medicine is not effective in controlling fever. On this basis we first
2
calculate the expected frequencies and then work out the value of ( ). If
the calculated value of ( ) is less than table value at a certain level of
significance for a given degree of freedom, then we conclude that our
hypothesis stands which means the two attributes are independent or not
associated (i.e. new medicine is not effective in controlling the fever). But if
2
the calculated value of ( ) is greater than its table value, then our
inference would be that hypothesis does not hold good which means the
two attributes are associated and the association is not because of some
chance factor but it exists in reality.
2
Theorem: In a random and large sample

2
2 k (ni npi )
,
np
i 1
i
12
follows chi-square distribution approximately with (k-1) degrees of freedom

where ni is the observed frequency and npi is the corresponding of the ith class
(i = 1,2,.k)
i 1
ni n
Hypothesis Concerning Several Proportions

Objective: Whether more than two binomial populations have the same
parameter.
H0= p0 = p2 == pk=p against the alternative hypothesis that these population
proportions are not equal. To perform a suitable large sample test of this
hypothesis, we requires independent random samples of size n1, n2,., nk, from
the k population; then , if the corresponding number of success are X1, X2, ,Xk.
As test is based on large sample
Zi
X i ni pi
ni pi (1 pi ) is
approximately the standard normal distribution, the square of a
random variable having the standard normal distribution is a random variable

having the chi-square distribution with 1 df and the sum of k independent
random variables having chi-square distribution with k df
( xi ni pi ) 2
i 1 ni pi (1 pi )
k
In practice we substitute for the pi, which under the null hypothesis are all equal
then the polled estimate
x1 x1 .... xk
n1 n1 .... nk
( xi ni p ) 2
~ 2 with k 1df
(1 p )
i 1 ni p
k
In actual practice, when we compare more than two sample proportion it is

convenient to determine the value of chi-square statistic by looking at the data as
arranged in the following table.
Sample 1 Sample 2
Sample
Total
k
Success
x1
x2
xk
x
Failure
n1- x1
n2- x2
nk- xk
n-x
Total
n1
n2
nk
n
13
x and n respectively, represent the total number if success and the total number
of trails for all sample combined.
The entry in the cell belonging to the ith row and jth column is called the observed
cell frequency Oij with i = 1, 2 and k = 1,2k.
x
Under the null hypothesis H0= p0 = p2 == pk=p, we estimate p n . Hence the
expected number of success and failure for the jth sample are estimated by
e1 j n j * p
n j *x
n
e 2 j n j * (1 p )
The quantities e1 j and
e2 j
and
n j *(n x)
n
are called expected frequencies for j= 1,2,k
In this notation, the chi-square statistic with p substituted for the

written in the form.
2
(Oij eij ) 2
i 1 j 1
eij
pi
can be
~ 2 with k 1df
Contingency Table
Let the data be classified into classes
A , A , A ,...., A according to attribute

B , B , B ,...., B according to attribute B. Let O ij
1
A and into t classes

1
2
3
t
denotes the observed frequency of the cell belonging to both the classes
Ai and B j [i 1,2,.., s; j 1,2,..., t ] . Let (Ai) and (Bj)
denotes the totals of all the frequencies belonging to classes Ai and B j
respectively, then the data can be set into a (s * t) contingency table of s rows
and t columns as follows:
Classe
s
B2
B2
..
Bj
14
Bt
Total
.
(A1)
O1t
.
.
A2
(A2)
O21 O22
..
O2j
O2t
.
.
:
:
:
:
..
:
:
.
.
Ai
(Ai)
Oi1 Oi2
..
Oij
Oit
.
.
:
:
:
:
..
:
:
.
.
As
(As)
Os1 Os2
..
Osj
Ost
.
.
(Bj)
(Bt)
Total (B1) (B2)
..
N
.
Example: Suppose that in four regions, the National Health Care Company
samples its hospital employees attitudes toward job performance reviews.
Respondents are given a choice between the present method and a proposed new
method. Here the data are classified in to two classes (choice between the
present method and proposed new method) and another four classes according to
geographical region (Northeast, Southeast, Central, West Coast). Sample
response is given as follows:
Observed Frequency Table
A1
O11 O12
..
Number who prefer present

method
Number who prefer new
method
Total employees sampled in
each region
O1j
Northeast
68
Southeast
75
Central
57
West coast
79
Total
279
32
45
33
31
141
100
120
90
110
420
Expected frequency for given cell
Row Total for that cell Column Total for that cell
Total no of observation
Expected frequency for first cell =
279 *100 27900 66.43

420
420
Expected Frequency Table

Northeast
Number expected to prefer present 66.43
method
Number expected to prefer new method 33.57
15
Southeast
79.72
Central
59.79
West coast
73.07
40.28
30.21
36.93
Total
100
120
90
110
Degrees of Freedom
Degrees of freedom is the total number of observation minus the number of
independent constraints (restrictions) imposed on the observations.
In the above table (Contingency Table) there are in all (s * t) cells but since the
marginal totals are fixed there are (s + t) constraints. These constraints are,
however, not independent since sum of the border column frequencies must be
equal to that of the border row frequencies and thus there are only (s + t -1)
independent linear constraints. Hence the number of degrees of freedom,
associated with a (s * t) contingency table is
Degrees of freedom = (s * t)- (s + t -1) = (s - 1) (t - 1)
= (Number of rows - 1) (Number of columns - 1)
Column 1
Row 1
TOTAL
RT1
Row 2
RT2
TOTAL
CT1
Column 2
Column 3
CT2
CT3
Column 4
CT4
Values that can be freely specified

Values that cannot be freely specified
Here total no of observation 2 * 4 = 8
There are 2+4 = 6 restriction imposed on the observation
But 1 is dependent restriction
So, there is total 6-1= 5 independent linearly restrictions
Degrees of freedom = (s * t)- (s + t -1) = 8-5 = 3
Degrees of freedom = (s - 1) (t - 1)
`
= (2 - 1) (4 - 1) = 1*3 =3
Critical Value
From the chi-square tables it may be observed that the critical (Tabulated) values
2
of increases as n (df) increases and level of significance decrease.
Let n denotes the value of chi-square for n df such that area to the right of
this point is
2
16

2
P 2 n ( ) as shown in the following figure
We reject null hypothesis at level of significance if calculated value of

greater than tabulated value
is
n
2
Conditions for the Application of Test

The following conditions should be satisfied before the test can be applied:
i.
The Sample observation should be independent.
ii.
Constrains on the cell frequencies, if any should be
linear
iii.
N, the total frequencies should be large (say
greater than 50).
iv.
No theoretical cell frequency should be less than 5.
If any theoretical cell frequency is less than 5, then for the application of
2
chi-square ( ) test, it is pooled with the preceding or succeeding
frequency so that the pooled frequency is more than 5 and finally adjusts
for df lost in pooling.
Steps Involved in Finding the Value of Chi-square

17
Step 1. Calculate the expected frequency. In general the expected frequency for
any cell can be calculated as follows:
e
Row total (for that cell) * Column total (for that cell)
Total number of observations
Step 2. Obtain the difference between observed and expected frequencies and
find out the square of these difference i.e. find out (oe)
Step 3. Divide the quantity (o e)
obtained in step 2 by the corresponding

2
(o e)
expected frequency to get
.
e
(o e)
Step 4. Then find the sum of
e
This is the required
(o e) 2
values i.e.
e
value.
The value obtained as such should be compared with table value of at a

certain level of significance for a given degree of freedom and inference may be
2
2
drawn. If calculated value is greater than table n value then reject Null
hypothesis otherwise accept.
2
**Note**
2
It may be noted that the - test depends only on the set of observed and
expected frequencies and on degrees of freedom. It does not make any
assumptions regarding to the parent population from which the observation are
2
taken. Since does not involve any population parameters and the test is
known as Non-parametric test.
Test of Homogeneity
18
Chi-square ( ) test helps us in stating whether different samples come from

the same universe. Through this test, we can also explain whether the results
worked out on the basis of samples are in conformity with well-defined
hypothesis or the results fails to support the given hypothesis.
2
Test for a Specified Population Variance

Let a random sample x1 , x2 , x3 ,.... xn of size n be drawn from a normal
population with mean and variance 2 . To test the hypothesis that the
population variance has a specified value 2
Let the Null hypothesis be H0: 2 = 2
Then the Alternative hypothesis be H1: 2 2
Assuming that H0 is true, the test statistic is
0
2 n 1s 2
, Where s2 is sample variance

2
0
The test statistic
follows chi-square distribution with (n-1) df. If calculated
value is greater than table
n
2
value then Reject Null hypothesis.
Confidence Intervals for the Population Variance

Suppose we want a 95 % confidence interval for the variance. For instant let us
2
consider degrees of freedom is 8. We locate two points on chi square
distribution with given degrees of freedom:
upper tail of the distribution and
of the distribution.
2
The values of U = 17.535 and
cuts off 0.025 of the area in the

2
cuts off 0.025 of the area in the lower tail

2
= 2.180 can be found from the table.

2
The Following expression gives the confidence interval for
Lower Confidence Limit
Upper Confidence Limit
19
n 1 s2
n 1 s2
Example: A random sample of size 20 from a normal population gives a mean

of 42 and a variance of 25, test the hypothesis that the population
variance is 64 at 5% level of significance.
Solution:
Let the null Hypothesis be H0: 2 = 2 = 64
Then the alternative hypothesis is H1: 2 64
0
2 n 1s 2 19 * 25 475 7.42

64
64
2
0
2
df 20 1 19 and tabulated 19 (0.05) 30.14

accept null hypothesis. Thus, we may conclude that population variance is 64.
Example: A random sample of size 20 from a normal population gives a mean
of 42 and a variance of 25, test the hypothesis that the population
variance is 64 at 5% level of significance.
Solution:
Then the alternative hypothesis is H1: 2 64
0
2 n 1s 2 19 * 25 475 7.42
64
64
2
0
2
df 20 1 19 and tabulated 19 (0.05) 30.14

accept null hypothesis. Thus, we may conclude that population variance is 64.
Example: A sample of 20 observations from a normal distribution has mean 37
and variance of 12.2. Construct a 90 percent confidence interval for
the true population variance.
n 1 s2 (20 1)12.2
2
7.69
Lower Confidence Limit L
2
30.144
Upper Confidence Limit
2
U
n 1 s2 (20 1)12.2
Analysis of variance (ANOVA)

20
10.117
22.91
The difference between two sample means can be studied through the
standard error of the difference of the means of the two samples or through
students t test but the difficulty arises when we happen to examine the
significance of the difference between more than two sample means at once.
Analysis of variance help us to test whether more than two population means
can be considered to be equal.
Analysis of variance will enable us to test for the significance of the
differences between more than two sample means.
Using analysis of variance, we will able to make inference about whether our
samples are drawn from population having the same mean
Sir R. A. Fisher originated the technique of analysis of variance.
The analysis of variance is essentially a technique for testing the difference
between groups of data for homogeneity. It is a method of analyzing the
variance to which a response is subject into its various components
corresponding to the various sources of variation. There may be variation
between the samples or there may be variation within the sample items. Thus,
the technique of analysis of variance consists in splitting the variance for
analytical purposes into its various components. Normally the variance (or
what can be called as the Total variance) is divided into two parts:
1. Variance between samples,
2. Variance within a samples; such that
Variance = Variance between samples + Variance within samples
Three steps in analysis of variance
Analysis of variance consists of three different steps.
1. Determine first estimate of the population variance from the variance
among (between) the sample means.
2. Determine second estimate of the population variance from the
variance within the sample.
3. Compare these two estimates. If they approximately equal in value,
accept null hypothesis
Assumption
In order to use analysis if variance, we must assume that the each of the
samples is drawn from a normal population and that each of these
populations has the same variance
21
Example: The training director of a company is trying to evaluate three

different method of training new employees. The first method assigns each to an
experienced employee for individual help in the factory. The second method
puts all new employees in a training room separate from the factory, and the
third method uses training films and programmed learning materials. The
training director chooses 16 new employees assigned at a random to the three
training methods and records their daily production after they complete the
programs:
Method 1.
15 18 19 22 11
Method 2.
22 27 18 21 17
Method 3.
18 24 19 16 22 15
The director wonders whether there are difference in effectiveness among the
method.
As analysis of variance is based on a comparison of two different
2
estimates of the variance, , of overall population
The first estimate of the population variance can be calculated by examining the
variance among (between) the three samples means,
In this case, we can calculate one of these estimates by examining the
variance between the three samples i.e. variance among the three samples
means, which are 17, 21 and 19
Sample mean of Method 1
=
1 n
xi
n i 1
1
= n x
x 2 x 3 xn
=85/5 = 17

= 105/5 = 21
= 114/6 = 19
The other estimate of the population variance is determined by the variation

within three samples them selves, that is (15, 18, 19, 22, 11), (22, 27, 18,
21, 17) and (18, 24, 19, 16, 22, 15).
Then we compare these two estimates of the population variance. Because

2
both (first and second) are estimate of overall population , they should
be approximately equal in value when the null hypothesis is true.
22
Calculating the variance among (between) the sample means

(Determine first estimate of the population variance from the variance among the sample means)
Step first in analysis of variance indicates that we must obtain one estimate of
the population variance from the variance among the three sample means. This
estimate is called the betweencolumn variance.
2
( x x)
As we know the sample variance is given by 2
n 1
Now, because we are working with three sample means and a grand mean, let us
substitute x for , x for x and k (the number of sample) for n to get a
formula for the variance among the sample means.
Variance among the sample means:
sx
( x x)
k 1
As we know that standard error of the sample mean x

Then population variance
2
2
x n , where x is the variance among the
2
sample mean. But we do not know 2x , but we could however calculate the
variance among the three sample means s x 2 . So, substitute 2x by s x 2 .
Then we have estimated population variance n * 2
n( x x)
k 1
Since different sample have different sample size

First estimate of the population variance b 2
^
23
nj( xj x )
k 1
Calculation of the
n x
between column variance 5 17
5 21
6 19
x -x
19
19
19
17-19 = -2
21-19 = 2
19-19 = 0
n( x - x )2
5*(-2) 2 = 20
5*(2) 2 = 20
6*(0)2 = 0
n( x x )
b2
nj ( xj x )
k 1
= 40
40 40 20
3 1 2
Calculating the variance within the samples

(Determine second estimate of the population variance from the variance within the sample)
Step second in analysis of variance requires a second estimate of the population

variance based on the variance within the samples. This variance is called within
column variance.
2
As we know that variance within the samples 2 ( x x )
n 1
As we have assumed that the variances of our three populations are the same, we
could use any one of the three-sample variance ( s12 , s 2 2 and s 32 ) as the second
estimate of the population variance.
We can get a better estimate of the population variance by using a weighted
average of all three samples.
24
The general formula for second estimate of population variance is

2
n j 1 2
s j
nT k
Where, 2w within column variance
n size of jth sample
s 2j sample variance of the jth sample
2w
k number of sample sample

nT total sample size
Training Method 1
(x- x )
(x- x )2
15-17 = -2 4
18-17 = 1
1
19-17 = 2
4
22-17 = 5
25
11-17 = -6 36
2
( x x ) =70
s12
( x x)
n 1
70
17.5
4
Training Method 2
(x- x )
( x- x )2
22-21 = 1
1
27-21 = 6
36
18-21 = -3 9
21-21 = 0
0
17-21 = -4 16
2
( x x ) = 62
Training Method 3
(x- x )
( x- x )2
18-19 = -1 1
24-19 = 5 25
19-19 = 0 0
16-19 = -3 9
22-19 = 3 9
15-19 = -4 16
2
( x x ) =60
(x )
70 ( x x )
62
s 22
s 3215.5 x
n 1
51 n 1
51
60
12.0
61
25
n j 1 2 51 2 51 2 61 2
s j
s1
s2
s3
k
163 163 163
nT
4
4
5
*17.5 *15.5 *12.0
13
13
13
192
14.769
13
^
b2
Comparison of two estimates

Step 3 in ANOVA compares theses two estimates of the population variance
by computing their ratio, called F- ratio.
F
first estimate of the population variance based on the variance among the sample mean
second estimate of the population variance based on the variance within the samples
b
2
20
1.345
14.769
The nearer the F- ratio comes to 1, then the more we are inclined to accept the
null hypothesis. Conversely, as the F-ratio becomes larger, we will be more
inclined to reject null hypothesis and accept the alternative hypothesis.
The F-Distribution
The F is skewed distribution. Generally it is skewed to the right and tends to
become more symmetrical as the number of degrees of freedom in the numerator
and denominator increase. The F- distribution has single mode. The shape of the
distribution depends on the number of degrees of freedom in both numerator and
denominator of the F-ratio. The first number is the number of degrees of
freedom in the numerator of the F-ratio; the second is the degrees of freedoms
in the denominator.
Fig 11.8 Pg 597 Rubin
26
Degrees of Freedom
As we have mentioned each F-distribution has a pair of degrees of
freedom, one for the numerator of the F-ratio and the other for the
denominator.
While calculating variance between the sample mean we have used
different values of x - x , one for each sample to calculate
nj ( xj x )
. In above example once we knew two of these
x-x
values, the third was automatically determined and could not be freely
specified. Thus, one df is lost when we calculate the variance between
samples. Hence, the number of degrees of freedom for the numerator of
the F-ratio is always one fewer than the number of samples.
Number of degrees of freedom in the numerator of the F-ratio = (n-1)
For the denominator we have calculated the variance within the samples
and we used all three samples. For the j th sample, we used nj values of
(x x j )
to calculate ( x x j ) 2 for that sample. Once we knew all
but one of these
(x x j )
values, the last was automatically determined
and could not be freely specified. Thus, we lost 1 df in the calculations for
each sample. In above example we lost 1 df in the calculations for each
sample, leaving us with 4, 4 and 5 df in the sample. Because we had three
samples, we were left with 4+4+5 = 13 df. Which could also be calculated
as 5+5+6 3 = 13. Thus,
Number of degrees of freedom in the denominator of the F-ratio = ( nT -k)
The F-Table
27
For analysis of variance, we shall use an F-table in which the columns represent
the number of degrees of freedom for the numerator and the rows represents the
degrees of freedom for the denominator. Suppose we are testing a hypothesis at
the level of significance 0.05, using F-distribution and our degrees of freedom
for numerator is 2 and 13 for the denominator. The value we find in the F-Table
is 3.81 (First look in column and then in row)
Critical Value of F- distribution
Usually F-tables give the critical value of F for the right tailed test, the right-tail
area determines i.e. the critical region. Thus, the significant value F ( n1 ,n2 ) at
the level of significance and (n1, n2) where n1 is the number of degrees of
freedom in the numerator and n2 the number of degrees of freedom in the
denominator. P[F F ( n1 ,n2 )] . As shown in figure
Pg. 877 Gupta and Kapoor
If calculated F-ratio value is greater than table F ( n1 ,n2 ) we reject null

hypothesis, otherwise accept it.
Statement of Hypotheses
Null Hypothesis: There is no significance difference between population means
In our above example suppose the director of training wants to test at the 0.05
levels the hypothesis that there are no differences among the three training
methods.
We set the null hypothesis as H 0 : 1 2 3
H 1 : 1 2 3
Analysis of variance Table

Source of Sum of square
df
28
Mean
Test
variation
Between
Within
Total
(k-1)
SSB = nj ( xj x ) 2
j= 1,2,k
SSW
= (N-k)
( xi1 x1) 2 ( x i 2 x 2 ) 2 ..... ( x ik x k ) 2
SST=
(N-1)
( xi x)2
29
square
MSB=
SSB/(k-1)
Statistics
F=
MSB/MSW
MSW=
SSW/(N-k)
-
P P i.e.( P P orP P
0
P P H 0 : 0

x1 x1 x2

x1 x 2
1 SD
z x x
x
N n z x ~ N (0,1)
1
n
n
0
x
k n
x H
:P
P H
0
:P
P H
0
k ( x1, x 2 , x3... x n)
H
n
1
:P
P H
0
:P
30
P x1
0
x
z

2
=
n
31
1. A company manufacturer gold weighting balances. It maintains strict quality control over the
products and do not release a balance to sale unless the balance showed variability significantly
below one microgram (at alpha=0.01), when weighting quantities of about 500 grams. A new
balance has just been delivered to the quality control division form the production line. This new
balance is tested by using it to weight the same 500-gram standard weight 30 different times the
standard deviation turns out to be 0.73 microgram. Should this balance be released for sale?
2001 q-4
In this question you have to test the variability of balance so
Given,
Sample size (n) = 30
Sample SD (s) = 0.73 microgram
Alpha = 0.01

Then the alternative hypothesis is H1: 2 < 1
0
2
2 n 1s 2 (30 1) * (0.73) 29 * 0.5329 15.4541
1
2
0
2
df 30 1 29 and tabulated 29 (0.01) 14.256
Since calculated value of chi-square is greater than tabulated value.

Cal > Tab i.e. 15.4541 > 14.256
Hence we reject null hypothesis. Thus, we may conclude that population
variance is less than 1 so the company can release a balance to sale.
2
2. Describe, what do you understand by goodness of fit. Explain how can you test the unbiased ness of
a die using 2 distribution.
Goodness of fit
An important problem of statistical inference is to test the hypothesis that the given data has been
obtained by random sampling from a specified population with definite values for its parameter. The
data usually given can be arranged in the form of frequency distribution, where in we are given the
observed frequencies. The corresponding theoretical frequencies are obtained from the knowledge
of the population and our problem is to test the compatibility of observed and theoretical
frequencies or to determine whether the deviations of the observed frequencies from the theoretical
frequencies are small enough to be regarded as due to fluctuations of sampling or whether they
indicated that the data could not have possibly come from a population giving rise to theoretical
frequencies.
In other words, when some theoretical distribution is fitted to the given data, we are always
interested in knowing as to how well this distribution fits with the observed data. Chi square test can
give an answer to this. If the calculated value of Chi-square (
32
) is less than the table value at a
certain level of significance, the fit is considered to be a good one, which means that the divergence
between the observed and expected frequencies is attributable to fluctuations of sampling. But it the
calculated value of Chi-square (
) is greater than its table value, the fit is not considered to be a
good one.
We use this test to whether the difference between the theoretical and observed values can be
attributed to chance or not.
Explain how can you test the unbiased ness of a die using 2 distribution.
By using Chi square distribution we can test the unbiased ness of die.
Here in this case we set the null hypothesis (H0) as the die is unbiased and alternative hypothesis (H1) as the
die is biased.
Once we perform experiment the data usually obtained can be arranged in the form of frequency
distribution, where in we are given the observed frequencies. The corresponding theoretical frequencies are
obtained from the knowledge of the population (The number of time that we throw a die and the probability
of turning any face. For example if we throw a die 132 times then expected frequency is obtained by
multiplying this 132 by 1/6, since 1/6 is probability of turning any face up when we throw a die) Once we
have observed frequency and expected frequency we can use chi square as a test statistic. If the calculated
value of Chi-square (
) is less than the table value at a certain level of significance, we accept null
hypothesis and the fit is considered to be a good one, which means that the divergence between the
observed and expected frequencies is attributable to fluctuations of sampling. But it the calculated value of
Chi-square (
) is greater than its table value, the fit is not considered to be a good one.
Example: A Die is thrown 132 times with the following result:

Number
turned up
Frequenc
y
16
20
25
14
29
Test the hypothesis that dies is unbiased.

Solution: Let us take the null hypothesis that the die is unbiased. If that is true,
the probability if obtaining any one of the six faces is 1/6 and such
the expected frequency of any one face coming upward is 132*1/6 =
22
Calculation for
Observed
Expected
frequency
frequency
(O)
(E)
16
22
20
22
2
No.
Turn
ed up
1
2
33
(OE)2
36
4
(OE)2/
E
1.64
0.18
3
4
5
6
Total
25
14
29
28
132
9
64
49
36
0.41
2.91
2.23
1.64
9.01
(Oi E i)
9.01

E i
22
22
22
22
132
i 1
2
df 6 1 5 and tabulated 5 (0.05) 11.07
Since calculated value of chi-square is less than tabulated value. Hence

we accept null hypothesis. Thus, we may conclude that the die is
unbiased.
3. The following table gives the number of motorcycle accidents that occurred during the various days
of the week.
Days
Sunday Monday Tuesday Wednesday
Thursday
Friday
No. of accidents 14
18
12
11
15
14
Test whether the accidents are uniformly distributed over the week. Take 0.05 level of significance.
2001-q-9
Solution:
Let us take the null hypothesis that the accidents are uniformly distributed over a week. If that is true, the
probability that the motorcycle accidents that occurred during the various days of the week is 1/6.
Expected frequency of motorcycle accidents that occurred during the

various days of the week is 84*1/6 = 14
2
Calculation for
Day
Observed
Expected
(O(O2
frequency
frequency
E)
E)2/E
(O)
(E)
1
14
14
0
0
2
18
14
16
0.8889
3
12
14
4
0.2857
4
11
14
9
0.6429
5
15
14
1
0.0714
6
14
14
0
0
Total
84
14
1.8889
34
)
O
E
i
i
1.8889

E
i
i 1
2
df 6 1 5 and tabulated 5 (0.05) 11.07
Since calculated chi-square is less than tabulated value,
Cal < Tab i.e. 1.8889 < 11.070

2
We accept null hypothesis. So, the accidents are uniformly distributed over the week.
35

Chi Square

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Chi Square

Transféré par

Droits d'auteur :

Formats disponibles

Calculating Interval Estimates of the Proportion from Large Sample

If X in number of success in n independent trails with constant probability p if

Let, in a sample of size n, X is the number of persons possessing the given

Thus, the sample proportion is unbiased estimate of population proportion. Also,

The magnitude of the error make when we use

sample size that is needed to attain a desired degree of psision n p (1 p )

Test for a Specified Proportion (Large sample)

If a random sample of size n (greater than 30) has a sample proportion p of

standard normal distribution.

The critical region for Z depending on the nature of

The rejection rule for

and the level of

Reject Null hypothesis H 0 : P P0 if

To Test Whether The Two Population Proportions P1, P2 Are Equal

The Null Hypothesis is 0 1

If we hypothesize that there is no difference between the two-population

If H 0 is true, we have P1 = P2 = P (say) and the sampling distribution of p1 - p2

Standard error of (p1 p2) =

Since the sample size n1 and n2 are large,

The test statistic

is approximately normally distributed with mean 0 and standard deviation 1.

The rejection rule for

Reject Null hypothesis H 0 : P1 P2 if

apply. In such a situation we must use Chi-Square test. Chi-Square tests

...... andS .D. 1, 2, 3, ...... n respectively then,

square Variate with n degrees of freedom.

Observed and Expected Frequencies

are observed to occur with frequencies

and according to the probability rules

If O1 , O2 , O3 ,...., On is a set of observed frequencies and E1 , E 2 , E 3 ,...., E n

Follows chi-square distribution with (n-1) df

Chi square ( ) test enables us to see how well the distribution of

Example: The following figures show the distribution of digits in numbers

p(r) = Probability of r male births in a family of n = nr

The expected frequency of r male births is given by f(r) = N* nr

The expected frequency of 0 male births is given by f(0) =

Example: A Die is thrown 132 times with the following result:

Since calculated value of chi-square is less than tabulated value. Hence we

Theorem: In a random and large sample

follows chi-square distribution approximately with (k-1) degrees of freedom

Hypothesis Concerning Several Proportions

approximately the standard normal distribution, the square of a

random variable having the standard normal distribution is a random variable

In actual practice, when we compare more than two sample proportion it is

are called expected frequencies for j= 1,2,k

In this notation, the chi-square statistic with p substituted for the

A , A , A ,...., A according to attribute

A and into t classes

Number who prefer present

Expected frequency for given cell

Expected frequency for first cell =

279 *100 27900 66.43

Expected Frequency Table

Values that can be freely specified

We reject null hypothesis at level of significance if calculated value of

Conditions for the Application of Test

Steps Involved in Finding the Value of Chi-square

obtained in step 2 by the corresponding

This is the required

The value obtained as such should be compared with table value of at a

Chi-square ( ) test helps us in stating whether different samples come from

Test for a Specified Population Variance

, Where s2 is sample variance