Vous êtes sur la page 1sur 35

Calculating Interval Estimates of the Proportion from Large Sample

If X in number of success in n independent trails with constant probability p if


success for each trail
Then E(X) = np V(X) = npq where q = 1-p, is the probability of failure.
It has been proved that for large n, the binomial distribution tends to normal
distribution. Hence for large n, X ~ N (np, npq)
i.e.

X E ( X ) X np

~ N (0,1)
V (X )
npq

Let, in a sample of size n, X is the number of persons possessing the given


attribute. Then, Observed proportion of success
X
P=
n

E ( P) E X 1 E x np p
n
n

Thus, the sample proportion is unbiased estimate of population proportion. Also,

pq
V ( P) V X 1 V ( X ) 1 * npq
n
2
2
n
n
n
pq
S.E.( P)
n

X
is asymptotically normal for large n, the normal
n
test for the proportion of successes becomes.
Since X and consequently

P E ( P) P p

~ N (0,1)
V ( p)
pq
n

pr [ z z
2

pq
P p z
n
2

pr [ z
2

pr [ P z
2

pr [

P p
z ] 1
2
pq
n
pq
] 1
n

pq
p P z
n
2

pq
] 1
n

pq
X
pq
X
p z
] 1
z
n
n
n
2
n
2

The magnitude of the error make when we use


by

X
as an estimator of p is given
n

X
p
n
X
pq
p z
n
n
2
Maximum error the estimate E z
2

pq
n
2

sample size that is needed to attain a desired degree of psision n p (1 p )

1
If no information is available for p we can take p so,
2

1 z2
Sample size n
4 E

Test for a Specified Proportion (Large sample)


2

If a random sample of size n (greater than 30) has a sample proportion p of


members possessing a certain attribute i.e. proportion of successes). To test the
hypothesis that the proportion P in the population has a specified value P0
The Null Hypothesis is H 0 : P P0
Then the alternative hypothesis could be
1. H 1 : P P 0
2.

H
H

: P P0

: P P0
3.
1
Since n is large, the sampling distribution of P is approximately normal.
If H is true, the test statistic z
0

p P0

has an approximately

standard normal distribution.

pq

Where
p

The critical region for Z depending on the nature of


significance is given in the following table.

The rejection rule for

P P i.e.( P P orP P
0

)
Critical region P P0
Critical region

PP

and the level of

: P P0 is given below

Reject Null hypothesis H 0 : P P0 if


Critical values of Z
1%
5%
10%
|Z| > 2.58
|Z| > 1.96
|Z| > 1.645

Level of significance
Critical region
0

Z > 2.33

Z >1.645

Z > 1.28

Z < -2.33

Z < -1.645

Z < - 1.28

To Test Whether The Two Population Proportions P1, P2 Are Equal

H :P P

The Null Hypothesis is 0 1


2
Then the alternative hypothesis could be
1. H 1 : P1 P2
2. H 1 : P1 P2
3. H 1 : P1 P 2

If we hypothesize that there is no difference between the two-population


proportions, then our best estimate of the overall population proportion of
success is probably the combined proportion of success in both samples that is:
Estimated overall proportion of success in two populations
number of successes in sample1 number of successes in sample 2
=
total size of the both samples
Estimated standard error of the difference between two proportions using
combined estimates from both samples.

^ p p
n n
Estimated overall proportion of success in two populations p 1 1 2 2
n1 n2

If H 0 is true, we have P1 = P2 = P (say) and the sampling distribution of p1 - p2


is approximately normal with mean = 0
And

^ n p n p
q , where p 1 1 2 2 and
S.D. = p
n1 n2
n1 n 2
1

Standard error of (p1 p2) =

p q

n n
1

Since the sample size n1 and n2 are large,


4

q 1 p

The test statistic

p1 p 2
,
Standard error of ( p1 p 2 )

is approximately normally distributed with mean 0 and standard deviation 1.


The
test
statistic

p1 p 2
z

Standard error of ( p1 p 2 )

p1 p 2
1

p q

n1 n 2

The rejection rule for

H : P P is given below

Level of significance
Critical region
P1 P0

Reject Null hypothesis H 0 : P1 P2 if


Critical values of Z
1%
5%
10%
|Z| > 2.58
|Z| > 1.96
|Z| > 1.645

Critical
P1 P2

region

Z > 2.33

Z >1.645

Z > 1.28

Critical
P1 P2

region

Z < -2.33

Z < -1.645

Z < - 1.28

Chi-Square Test

In the last chapter, we learned how to test hypotheses using data from
either one or two samples.
We used one-sample tests to determine whether a mean or a proportion
was significantly different from a hypothesized value.
In the two sample tests, we examined the difference between either two
means or two proportions and tried to learn whether this difference was
significant or not.
Suppose we have proportion from four populations instead of only two. In
this case, the method for comparing proportions described earlier does not
5

apply. In such a situation we must use Chi-Square test. Chi-Square tests


enable us to test whether more than two population proportions can be
considered equal.
Chi-Square Test
The Chi-square test is derived from the properties of the Chi- square
distribution.
Chi- Square Distribution is continuous probability distribution and it is
used in both large and small sample tests.
The Chi-Square test provides a technique where by it is possible to:
i.
Test the goodness of fit.
ii. Compare a number of frequency distributions.
iii. Find out the association and relationship between attributes.
iv. Test the population variance.
It is a test of independence, homogeneity and goodness of fit.
With the help of this test, it is possible to assess the significance of the
difference between the observed frequencies and the frequencies expected
if the data conformed to some theoretical distribution.
It is, therefore, possible to test the goodness of fit-to see how well the
distribution of observed data fits the assumed theoretical distribution. If the
distribution of observed data does, in fact, approximate to an assumed
distribution, then we would expect that there should be no significance
difference between the expected frequencies and the actual frequencies.
Chi-square Variate
If x is normally distributed with mean and S.D. , then
Chi- square ( ) Variate with 1 degree of freedom. If
2

independent


1,

2,

3,

variates

x , x , x ,.... x
1

with

...... andS .D. 1, 2, 3, ...... n respectively then,


n

x1 1

1
2

normal

x 2 2

x 3 3

square Variate with n degrees of freedom.

.........

xn n

is a

be n
means

, is a Chi-

Observed and Expected Frequencies


If a set of events,

O , O , O ,...., O
1

are

expected

A , A , A ,...., A
n

occur

A , A , A ,...., A
with frequencies E , E , E ,...., E
then
observed frequencies and E , E , E ,...., E
1

are called
are called expected frequency.
1

are observed to occur with frequencies

and according to the probability rules

to

O , O , O ,...., O

If O1 , O2 , O3 ,...., On is a set of observed frequencies and E1 , E 2 , E 3 ,...., E n


is the corresponding set of expected (theoretical or hypothetical) frequencies,
then Karl Pearsons chi-square is given by

(O1 E1) (O E ) (O E )


E
E
E1

)
O
E
i
i
,


O E
E i

(On E n ) 2

i 1

i 1

i 1

Follows chi-square distribution with (n-1) df


Area of Application of Chi - Square test
Test the goodness of fit
An important problem of statistical inference is to test the hypothesis that
the given data has been obtained by random sampling from a specified
population with definite values for its parameter. The data usually given
can be arranged in the form of frequency distribution, where in we are
given the observed frequencies. The corresponding theoretical frequencies
are obtained from the knowledge of the population and our problem is to
test the compatibility of observed and theoretical frequencies or to
determine whether the deviations of the observed frequencies from the
theoretical frequencies are small enough to be regarded as due to
fluctuations of sampling or whether they indicated that the data could not
have possibly come from a population giving rise to theoretical
frequencies.

Chi square ( ) test enables us to see how well the distribution of


observed data fits the assumed theoretical distribution such as Binomial
distribution, Poisson distribution or the normal distribution. In other
words, when some theoretical distribution is fitted to the given data, we are
always interested in knowing as to how well this distribution fits with the
observed data. Chi square test can give an answer to this. If the calculated
2
value of Chi-square ( ) is less than the table value at a certain level of
significance, the fit is considered to be a good one, which means that the
divergence between the observed and expected frequencies is attributable
2
to fluctuations of sampling. But it the calculated value of Chi-square ( )
is greater than its table value, the fit is not considered to be a good one.
We use this test to whether the difference between the theoretical and
observed values can be attributed to chance or not.
2

Example: The following figures show the distribution of digits in numbers


chosen at random from telephone directory:
Digits:
0 1
2
3
4
5
6
7
8
9 Total
Frequency:1026 1107 997 966 1075 933 1107 972 964 853 10,000
Test whether the digits may be taken to occur equally frequently in the directory.
2
9 (0.05) 16.919

Solution: Here we set up a null hypothesis that the digits occur equally
frequently in the directory.
Under the null hypothesis, the expected frequency for each of digits
0,1,2n is 10000/10 = 1000. The value of chi-square is computed as
2
Calculation for
Digits Observed frequency
Expected
(O-E)2 (O-E)2/E
(O)
frequency (E)
0
1026
1000
676
0.676
1
1017
1000
11449
11.449
2
997
1000
9
.009
3
966
1000
1156
1.156
4
1075
1000
5625
5.625
5
933
1000
4489
4.489
8

6
7
8
9
Total

1000
1000
1000
1000
10000

11449
784
1296
21609

11.449
0.784
1.296
21.609
58.542

(O i E i)
58.542

E i

1107
972
964
853
10000
2

i 1

Since the calculated is much greater than tabulated value, hence we reject
null hypothesis. Thus we should conclude the digits are not uniformly
distributed in the directory.
Example: The Theory predicts the proportion of beans in the four groups A, B,
C and D should be 9:3:3:1. In an experiment among 1600 beans, the
numbers in the four groups were 882, 313, 287, and 118. Does the
2
experimental result support the theory? 3 (0.05) 7.815
Solution: Null Hypothesis: Theory fits well into the experiment, i.e. the
experimental result supports the theory.
Under the null hypothesis, the expected frequencies can be computed as follows:
Total number of beans = 882+313+287+118 = 1600
These are to be divided in the ratio 9:3:3:1
E (882) N * pi 1600 * 9 900, E (313) 1600 * 3 300, E (287) 1600 * 3 300
16
16
16
E (118) 1600 * 1 100
16
2
(O2 E 2) 2 (O3 E3) 2 (O4 E 4) 2
2 (O1 E1)


E
E
E4
2
3
E1
2
(313 300) 2 (287 300) 2 (118 100) 2
2 (882 900)

4.7266

300
300
100
900
2
df 4 1 3 and tabulated 3 (0.05) 7.815
2

Since the calculated value of is less than the tabulated value. Hence null
hypothesis is accepted at 5% level of significance and we may conclude that the
experimental results support the theory.
2

Example: A survey of 320 families with 5 children each revealed the following
distribution.
Number of boys
5
4
3
2
1
0
Number of girls
0
1
2
3
4
5
Number of families
14
56
110
88
40
12
Is the result consistent with the hypothesis that male and female births are equal
probable (or fit a Binomial distribution and test the goodness of fit)
2
5 (0.05) 11.07
Solution: Let the null hypothesis be that male and female births are equally
probable i.e.
p=q=
p = probability of male birth

p r q nr

p(r) = Probability of r male births in a family of n = nr

p r q nr

The expected frequency of r male births is given by f(r) = N* nr


p(0) = Probability of 0 male births in a family of 5

1
n
r p r q n r 5
0 2

5 0
1
1
1*1*

32
2

The expected frequency of 0 male births is given by f(0) =

N*

5
0

1
0
5

320
*
10
p q
32

Similarly,
The expected frequency of 0 male births is given by f(1)= 50
The expected frequency of 0 male births is given by f(2)= 100
The expected frequency of 0 male births is given by f(3)=100
The expected frequency of 0 male births is given by f(4)=50
The expected frequency of 0 male births is given by f(5)=10
2
Calculation for
Digits Observed frequency Expected frequency (O-E)2 (O-E)2/E
(O)
(E)
0
14
10
16
1.60
1
56
50
36
0.72
2
110
100
100
1.00
10

3
4
5
Total

88
40
12
320

144
100
4

1.44
2.00
0.40
7.16

(O i E i)
7.16

E i

100
50
10
320
2

i 1

2
df 6 1 5 and tabulated 5 (0.05) 11.07

Calculated value of chi-square is less than tabulated value. Hence we accept null
hypothesis. Thus, we may conclude that the male and female births are equally
probable.

Example: A Die is thrown 132 times with the following result:


Number turned up
1
2
3
4
5
6
Frequency
16
20
25
14
29
28
Test the hypothesis that dies is unbiased.
Solution: Let us take the hypothesis that the die is unbiased. If that is true, the
probability if obtaining any one of the six faces is 1/6 and such the
expected frequency of any one face coming upward is 132*1/6 = 22
Calculation for
Observed frequency Expected frequency
(O)
(E)
16
22
20
22
25
22
14
22
29
22
28
22
132
132
2

No. Turned
up
1
2
3
4
5
6
Total

11

(O-E)2
36
4
9
64
49
36

(O-E)2/E
1.64
0.18
0.41
2.91
2.23
1.64
9.01

)
O
E
i
i
9.01

E
i

i 1

2
df 6 1 5 and tabulated 5 (0.05) 11.07

Since calculated value of chi-square is less than tabulated value. Hence we


accept null hypothesis. Thus, we may conclude that the die is unbiased.
Test of independence
If we classify a population into several categories with respect to two
attributes, we can then use chi-square test to determine if the two attributes
are independent of each other.
2
Chi-square ( ) test enables to explain whether or not two attributes are
associated. For instance, we may be interested in knowing whether a new
2
medicine is effective in controlling fever or not and Chi-square ( ) test
will help us in deciding this issue. In such a situation we proceed on the
Null hypothesis that the two attributes are independent which means that
new medicine is not effective in controlling fever. On this basis we first
2
calculate the expected frequencies and then work out the value of ( ). If
the calculated value of ( ) is less than table value at a certain level of
significance for a given degree of freedom, then we conclude that our
hypothesis stands which means the two attributes are independent or not
associated (i.e. new medicine is not effective in controlling the fever). But if
2
the calculated value of ( ) is greater than its table value, then our
inference would be that hypothesis does not hold good which means the
two attributes are associated and the association is not because of some
chance factor but it exists in reality.
2

Theorem: In a random and large sample


2

2 k (ni npi )

,
np
i 1
i

12

follows chi-square distribution approximately with (k-1) degrees of freedom


where ni is the observed frequency and npi is the corresponding of the ith class

(i = 1,2,.k)

i 1

ni n

Hypothesis Concerning Several Proportions


Objective: Whether more than two binomial populations have the same
parameter.
H0= p0 = p2 == pk=p against the alternative hypothesis that these population
proportions are not equal. To perform a suitable large sample test of this
hypothesis, we requires independent random samples of size n1, n2,., nk, from
the k population; then , if the corresponding number of success are X1, X2, ,Xk.
As test is based on large sample
Zi

X i ni pi
ni pi (1 pi ) is

approximately the standard normal distribution, the square of a

random variable having the standard normal distribution is a random variable


having the chi-square distribution with 1 df and the sum of k independent
random variables having chi-square distribution with k df
( xi ni pi ) 2
i 1 ni pi (1 pi )
k

In practice we substitute for the pi, which under the null hypothesis are all equal
then the polled estimate

x1 x1 .... xk
n1 n1 .... nk

( xi ni p ) 2
~ 2 with k 1df
(1 p )
i 1 ni p
k

In actual practice, when we compare more than two sample proportion it is


convenient to determine the value of chi-square statistic by looking at the data as
arranged in the following table.
Sample 1 Sample 2

Sample
Total
k
Success
x1
x2

xk
x
Failure
n1- x1
n2- x2

nk- xk
n-x
Total
n1
n2

nk
n

13

x and n respectively, represent the total number if success and the total number
of trails for all sample combined.
The entry in the cell belonging to the ith row and jth column is called the observed
cell frequency Oij with i = 1, 2 and k = 1,2k.
x
Under the null hypothesis H0= p0 = p2 == pk=p, we estimate p n . Hence the
expected number of success and failure for the jth sample are estimated by

e1 j n j * p

n j *x
n

e 2 j n j * (1 p )
The quantities e1 j and

e2 j

and
n j *(n x)
n

are called expected frequencies for j= 1,2,k

In this notation, the chi-square statistic with p substituted for the


written in the form.
2

(Oij eij ) 2

i 1 j 1

eij

pi

can be

~ 2 with k 1df

Contingency Table
Let the data be classified into classes

A , A , A ,...., A according to attribute


B , B , B ,...., B according to attribute B. Let O ij
1

A and into t classes


1
2
3
t
denotes the observed frequency of the cell belonging to both the classes
Ai and B j [i 1,2,.., s; j 1,2,..., t ] . Let (Ai) and (Bj)
denotes the totals of all the frequencies belonging to classes Ai and B j
respectively, then the data can be set into a (s * t) contingency table of s rows
and t columns as follows:

Classe
s

B2

B2

..

Bj
14

Bt

Total

.
(A1)
O1t
.
.
A2
(A2)
O21 O22
..
O2j
O2t
.
.
:
:
:
:
..
:
:
.
.
Ai
(Ai)
Oi1 Oi2
..
Oij
Oit
.
.
:
:
:
:
..
:
:
.
.
As
(As)
Os1 Os2
..
Osj
Ost
.
.
(Bj)
(Bt)
Total (B1) (B2)
..
N
.
Example: Suppose that in four regions, the National Health Care Company
samples its hospital employees attitudes toward job performance reviews.
Respondents are given a choice between the present method and a proposed new
method. Here the data are classified in to two classes (choice between the
present method and proposed new method) and another four classes according to
geographical region (Northeast, Southeast, Central, West Coast). Sample
response is given as follows:
Observed Frequency Table
A1

O11 O12

..

Number who prefer present


method
Number who prefer new
method
Total employees sampled in
each region

O1j

Northeast
68

Southeast
75

Central
57

West coast
79

Total
279

32

45

33

31

141

100

120

90

110

420

Expected frequency for given cell

Row Total for that cell Column Total for that cell
Total no of observation

Expected frequency for first cell =

279 *100 27900 66.43


420
420

Expected Frequency Table


Northeast
Number expected to prefer present 66.43
method
Number expected to prefer new method 33.57
15

Southeast
79.72

Central
59.79

West coast
73.07

40.28

30.21

36.93

Total

100

120

90

110

Degrees of Freedom
Degrees of freedom is the total number of observation minus the number of
independent constraints (restrictions) imposed on the observations.
In the above table (Contingency Table) there are in all (s * t) cells but since the
marginal totals are fixed there are (s + t) constraints. These constraints are,
however, not independent since sum of the border column frequencies must be
equal to that of the border row frequencies and thus there are only (s + t -1)
independent linear constraints. Hence the number of degrees of freedom,
associated with a (s * t) contingency table is
Degrees of freedom = (s * t)- (s + t -1) = (s - 1) (t - 1)
= (Number of rows - 1) (Number of columns - 1)
Column 1
Row 1

TOTAL
RT1

Row 2

RT2

TOTAL

CT1

Column 2

Column 3

CT2

CT3

Column 4

CT4

Values that can be freely specified


Values that cannot be freely specified
Here total no of observation 2 * 4 = 8
There are 2+4 = 6 restriction imposed on the observation
But 1 is dependent restriction
So, there is total 6-1= 5 independent linearly restrictions
Degrees of freedom = (s * t)- (s + t -1) = 8-5 = 3
Degrees of freedom = (s - 1) (t - 1)
`
= (2 - 1) (4 - 1) = 1*3 =3
Critical Value
From the chi-square tables it may be observed that the critical (Tabulated) values
2
of increases as n (df) increases and level of significance decrease.
Let n denotes the value of chi-square for n df such that area to the right of
this point is
2

16


2
P 2 n ( ) as shown in the following figure

We reject null hypothesis at level of significance if calculated value of


greater than tabulated value

is

n
2

Conditions for the Application of Test


The following conditions should be satisfied before the test can be applied:
i.
The Sample observation should be independent.
ii.
Constrains on the cell frequencies, if any should be
linear
iii.
N, the total frequencies should be large (say
greater than 50).
iv.
No theoretical cell frequency should be less than 5.
If any theoretical cell frequency is less than 5, then for the application of
2
chi-square ( ) test, it is pooled with the preceding or succeeding
frequency so that the pooled frequency is more than 5 and finally adjusts
for df lost in pooling.

Steps Involved in Finding the Value of Chi-square


17

Step 1. Calculate the expected frequency. In general the expected frequency for
any cell can be calculated as follows:
e

Row total (for that cell) * Column total (for that cell)
Total number of observations

Step 2. Obtain the difference between observed and expected frequencies and
find out the square of these difference i.e. find out (oe)
Step 3. Divide the quantity (o e)

obtained in step 2 by the corresponding


2

(o e)
expected frequency to get
.
e

(o e)
Step 4. Then find the sum of
e

This is the required

(o e) 2

values i.e.
e

value.

The value obtained as such should be compared with table value of at a


certain level of significance for a given degree of freedom and inference may be
2
2
drawn. If calculated value is greater than table n value then reject Null
hypothesis otherwise accept.
2

**Note**
2
It may be noted that the - test depends only on the set of observed and
expected frequencies and on degrees of freedom. It does not make any
assumptions regarding to the parent population from which the observation are
2
taken. Since does not involve any population parameters and the test is
known as Non-parametric test.

Test of Homogeneity
18

Chi-square ( ) test helps us in stating whether different samples come from


the same universe. Through this test, we can also explain whether the results
worked out on the basis of samples are in conformity with well-defined
hypothesis or the results fails to support the given hypothesis.
2

Test for a Specified Population Variance


Let a random sample x1 , x2 , x3 ,.... xn of size n be drawn from a normal
population with mean and variance 2 . To test the hypothesis that the
population variance has a specified value 2
Let the Null hypothesis be H0: 2 = 2
Then the Alternative hypothesis be H1: 2 2
Assuming that H0 is true, the test statistic is
0

2 n 1s 2

, Where s2 is sample variance


2
0

The test statistic

follows chi-square distribution with (n-1) df. If calculated

value is greater than table

n
2

value then Reject Null hypothesis.

Confidence Intervals for the Population Variance


Suppose we want a 95 % confidence interval for the variance. For instant let us
2
consider degrees of freedom is 8. We locate two points on chi square
distribution with given degrees of freedom:
upper tail of the distribution and
of the distribution.
2
The values of U = 17.535 and

cuts off 0.025 of the area in the


2

cuts off 0.025 of the area in the lower tail


2

= 2.180 can be found from the table.


2

The Following expression gives the confidence interval for

Lower Confidence Limit

Upper Confidence Limit

19

n 1 s2

n 1 s2

Example: A random sample of size 20 from a normal population gives a mean


of 42 and a variance of 25, test the hypothesis that the population
variance is 64 at 5% level of significance.
Solution:
Let the null Hypothesis be H0: 2 = 2 = 64
Then the alternative hypothesis is H1: 2 64
Assuming that H0 is true, the test statistic is
0

2 n 1s 2 19 * 25 475 7.42

64
64
2
0

2
df 20 1 19 and tabulated 19 (0.05) 30.14

Since calculated value of chi-square is less than tabulated value. Hence we


accept null hypothesis. Thus, we may conclude that population variance is 64.
Example: A random sample of size 20 from a normal population gives a mean
of 42 and a variance of 25, test the hypothesis that the population
variance is 64 at 5% level of significance.
Solution:
Let the null Hypothesis be H0: 2 = 2 = 64
Then the alternative hypothesis is H1: 2 64
Assuming that H0 is true, the test statistic is
0

2 n 1s 2 19 * 25 475 7.42

64
64
2
0

2
df 20 1 19 and tabulated 19 (0.05) 30.14

Since calculated value of chi-square is less than tabulated value. Hence we


accept null hypothesis. Thus, we may conclude that population variance is 64.
Example: A sample of 20 observations from a normal distribution has mean 37
and variance of 12.2. Construct a 90 percent confidence interval for
the true population variance.

n 1 s2 (20 1)12.2
2

7.69
Lower Confidence Limit L
2
30.144

Upper Confidence Limit

2
U

n 1 s2 (20 1)12.2

Analysis of variance (ANOVA)


20

10.117

22.91

The difference between two sample means can be studied through the
standard error of the difference of the means of the two samples or through
students t test but the difficulty arises when we happen to examine the
significance of the difference between more than two sample means at once.
Analysis of variance help us to test whether more than two population means
can be considered to be equal.
Analysis of variance will enable us to test for the significance of the
differences between more than two sample means.
Using analysis of variance, we will able to make inference about whether our
samples are drawn from population having the same mean
Sir R. A. Fisher originated the technique of analysis of variance.
The analysis of variance is essentially a technique for testing the difference
between groups of data for homogeneity. It is a method of analyzing the
variance to which a response is subject into its various components
corresponding to the various sources of variation. There may be variation
between the samples or there may be variation within the sample items. Thus,
the technique of analysis of variance consists in splitting the variance for
analytical purposes into its various components. Normally the variance (or
what can be called as the Total variance) is divided into two parts:
1. Variance between samples,
2. Variance within a samples; such that
Variance = Variance between samples + Variance within samples
Three steps in analysis of variance
Analysis of variance consists of three different steps.
1. Determine first estimate of the population variance from the variance
among (between) the sample means.
2. Determine second estimate of the population variance from the
variance within the sample.
3. Compare these two estimates. If they approximately equal in value,
accept null hypothesis
Assumption
In order to use analysis if variance, we must assume that the each of the
samples is drawn from a normal population and that each of these
populations has the same variance
21

Example: The training director of a company is trying to evaluate three


different method of training new employees. The first method assigns each to an
experienced employee for individual help in the factory. The second method
puts all new employees in a training room separate from the factory, and the
third method uses training films and programmed learning materials. The
training director chooses 16 new employees assigned at a random to the three
training methods and records their daily production after they complete the
programs:
Method 1.
15 18 19 22 11
Method 2.
22 27 18 21 17
Method 3.
18 24 19 16 22 15
The director wonders whether there are difference in effectiveness among the
method.
As analysis of variance is based on a comparison of two different
2
estimates of the variance, , of overall population
The first estimate of the population variance can be calculated by examining the
variance among (between) the three samples means,
In this case, we can calculate one of these estimates by examining the
variance between the three samples i.e. variance among the three samples
means, which are 17, 21 and 19
Sample mean of Method 1
=

1 n
xi
n i 1

1
= n x

x 2 x 3 xn

=85/5 = 17

Sample mean of Method 2


Sample mean of Method 3

= 105/5 = 21
= 114/6 = 19

The other estimate of the population variance is determined by the variation


within three samples them selves, that is (15, 18, 19, 22, 11), (22, 27, 18,
21, 17) and (18, 24, 19, 16, 22, 15).

Then we compare these two estimates of the population variance. Because


2
both (first and second) are estimate of overall population , they should
be approximately equal in value when the null hypothesis is true.
22

Calculating the variance among (between) the sample means


(Determine first estimate of the population variance from the variance among the sample means)

Step first in analysis of variance indicates that we must obtain one estimate of
the population variance from the variance among the three sample means. This
estimate is called the betweencolumn variance.
2
( x x)
As we know the sample variance is given by 2

n 1

Now, because we are working with three sample means and a grand mean, let us
substitute x for , x for x and k (the number of sample) for n to get a
formula for the variance among the sample means.
Variance among the sample means:

sx

( x x)

k 1

As we know that standard error of the sample mean x


Then population variance

2
2
x n , where x is the variance among the
2

sample mean. But we do not know 2x , but we could however calculate the
variance among the three sample means s x 2 . So, substitute 2x by s x 2 .
Then we have estimated population variance n * 2

n( x x)

k 1

Since different sample have different sample size


First estimate of the population variance b 2
^

23

nj( xj x )

k 1

Calculation of the
n x
between column variance 5 17
5 21
6 19

x -x

19
19
19

17-19 = -2
21-19 = 2
19-19 = 0

n( x - x )2
5*(-2) 2 = 20
5*(2) 2 = 20
6*(0)2 = 0
n( x x )

b2

nj ( xj x )

k 1

= 40

40 40 20
3 1 2

Calculating the variance within the samples


(Determine second estimate of the population variance from the variance within the sample)

Step second in analysis of variance requires a second estimate of the population


variance based on the variance within the samples. This variance is called within
column variance.
2

As we know that variance within the samples 2 ( x x )

n 1

As we have assumed that the variances of our three populations are the same, we
could use any one of the three-sample variance ( s12 , s 2 2 and s 32 ) as the second
estimate of the population variance.
We can get a better estimate of the population variance by using a weighted
average of all three samples.

24

The general formula for second estimate of population variance is


2

n j 1 2
s j
nT k
Where, 2w within column variance
n size of jth sample
s 2j sample variance of the jth sample

2w

k number of sample sample


nT total sample size
Training Method 1
(x- x )
(x- x )2
15-17 = -2 4
18-17 = 1
1
19-17 = 2
4
22-17 = 5
25
11-17 = -6 36
2
( x x ) =70

s12

( x x)

n 1

70

17.5
4

Training Method 2
(x- x )
( x- x )2
22-21 = 1
1
27-21 = 6
36
18-21 = -3 9
21-21 = 0
0
17-21 = -4 16

2
( x x ) = 62

Training Method 3
(x- x )
( x- x )2
18-19 = -1 1
24-19 = 5 25
19-19 = 0 0
16-19 = -3 9
22-19 = 3 9
15-19 = -4 16

2
( x x ) =60

(x )
70 ( x x )
62
s 22
s 3215.5 x
n 1
51 n 1
51
60

12.0
61

25

n j 1 2 51 2 51 2 61 2
s j
s1
s2
s3

k
163 163 163
nT
4
4
5
*17.5 *15.5 *12.0
13
13
13
192

14.769
13
^

b2

Comparison of two estimates


Step 3 in ANOVA compares theses two estimates of the population variance
by computing their ratio, called F- ratio.
F

first estimate of the population variance based on the variance among the sample mean
second estimate of the population variance based on the variance within the samples

b
2

20
1.345
14.769

The nearer the F- ratio comes to 1, then the more we are inclined to accept the
null hypothesis. Conversely, as the F-ratio becomes larger, we will be more
inclined to reject null hypothesis and accept the alternative hypothesis.
The F-Distribution
The F is skewed distribution. Generally it is skewed to the right and tends to
become more symmetrical as the number of degrees of freedom in the numerator
and denominator increase. The F- distribution has single mode. The shape of the
distribution depends on the number of degrees of freedom in both numerator and
denominator of the F-ratio. The first number is the number of degrees of
freedom in the numerator of the F-ratio; the second is the degrees of freedoms
in the denominator.
Fig 11.8 Pg 597 Rubin

26

Degrees of Freedom
As we have mentioned each F-distribution has a pair of degrees of
freedom, one for the numerator of the F-ratio and the other for the
denominator.
While calculating variance between the sample mean we have used
different values of x - x , one for each sample to calculate
nj ( xj x )

. In above example once we knew two of these

x-x

values, the third was automatically determined and could not be freely
specified. Thus, one df is lost when we calculate the variance between
samples. Hence, the number of degrees of freedom for the numerator of
the F-ratio is always one fewer than the number of samples.
Number of degrees of freedom in the numerator of the F-ratio = (n-1)

For the denominator we have calculated the variance within the samples
and we used all three samples. For the j th sample, we used nj values of

(x x j )

to calculate ( x x j ) 2 for that sample. Once we knew all

but one of these

(x x j )

values, the last was automatically determined

and could not be freely specified. Thus, we lost 1 df in the calculations for
each sample. In above example we lost 1 df in the calculations for each
sample, leaving us with 4, 4 and 5 df in the sample. Because we had three
samples, we were left with 4+4+5 = 13 df. Which could also be calculated
as 5+5+6 3 = 13. Thus,
Number of degrees of freedom in the denominator of the F-ratio = ( nT -k)
The F-Table
27

For analysis of variance, we shall use an F-table in which the columns represent
the number of degrees of freedom for the numerator and the rows represents the
degrees of freedom for the denominator. Suppose we are testing a hypothesis at
the level of significance 0.05, using F-distribution and our degrees of freedom
for numerator is 2 and 13 for the denominator. The value we find in the F-Table
is 3.81 (First look in column and then in row)
Critical Value of F- distribution
Usually F-tables give the critical value of F for the right tailed test, the right-tail
area determines i.e. the critical region. Thus, the significant value F ( n1 ,n2 ) at
the level of significance and (n1, n2) where n1 is the number of degrees of
freedom in the numerator and n2 the number of degrees of freedom in the
denominator. P[F F ( n1 ,n2 )] . As shown in figure
Pg. 877 Gupta and Kapoor

If calculated F-ratio value is greater than table F ( n1 ,n2 ) we reject null


hypothesis, otherwise accept it.
Statement of Hypotheses
Null Hypothesis: There is no significance difference between population means
In our above example suppose the director of training wants to test at the 0.05
levels the hypothesis that there are no differences among the three training
methods.
We set the null hypothesis as H 0 : 1 2 3

H 1 : 1 2 3

Analysis of variance Table


Source of Sum of square

df
28

Mean

Test

variation
Between

Within

Total

(k-1)

SSB = nj ( xj x ) 2
j= 1,2,k
SSW

= (N-k)

( xi1 x1) 2 ( x i 2 x 2 ) 2 ..... ( x ik x k ) 2

SST=

(N-1)

( xi x)2

29

square
MSB=
SSB/(k-1)

Statistics
F=
MSB/MSW

MSW=
SSW/(N-k)
-

P P i.e.( P P orP P
0

P P H 0 : 0


x1 x1 x2


x1 x 2

1 SD

z x x
x

N n z x ~ N (0,1)

1
n
n

0
x
k n

x H

:P

P H
0

:P

P H
0

k ( x1, x 2 , x3... x n)

H
n
1

:P

P H
0

:P

30

P x1
0

x
z

2

=
n

31

1. A company manufacturer gold weighting balances. It maintains strict quality control over the
products and do not release a balance to sale unless the balance showed variability significantly
below one microgram (at alpha=0.01), when weighting quantities of about 500 grams. A new
balance has just been delivered to the quality control division form the production line. This new
balance is tested by using it to weight the same 500-gram standard weight 30 different times the
standard deviation turns out to be 0.73 microgram. Should this balance be released for sale?
2001 q-4
In this question you have to test the variability of balance so
Given,
Sample size (n) = 30
Sample SD (s) = 0.73 microgram
Alpha = 0.01

Let the null Hypothesis be H0: 2 = 2 = 1


Then the alternative hypothesis is H1: 2 < 1
0

Assuming that H0 is true, the test statistic is

2
2 n 1s 2 (30 1) * (0.73) 29 * 0.5329 15.4541

1
2
0

2
df 30 1 29 and tabulated 29 (0.01) 14.256

Since calculated value of chi-square is greater than tabulated value.


Cal > Tab i.e. 15.4541 > 14.256
Hence we reject null hypothesis. Thus, we may conclude that population
variance is less than 1 so the company can release a balance to sale.
2

2. Describe, what do you understand by goodness of fit. Explain how can you test the unbiased ness of
a die using 2 distribution.
Goodness of fit
An important problem of statistical inference is to test the hypothesis that the given data has been
obtained by random sampling from a specified population with definite values for its parameter. The
data usually given can be arranged in the form of frequency distribution, where in we are given the
observed frequencies. The corresponding theoretical frequencies are obtained from the knowledge
of the population and our problem is to test the compatibility of observed and theoretical
frequencies or to determine whether the deviations of the observed frequencies from the theoretical
frequencies are small enough to be regarded as due to fluctuations of sampling or whether they
indicated that the data could not have possibly come from a population giving rise to theoretical
frequencies.
In other words, when some theoretical distribution is fitted to the given data, we are always
interested in knowing as to how well this distribution fits with the observed data. Chi square test can
give an answer to this. If the calculated value of Chi-square (
32

) is less than the table value at a

certain level of significance, the fit is considered to be a good one, which means that the divergence
between the observed and expected frequencies is attributable to fluctuations of sampling. But it the
calculated value of Chi-square (

) is greater than its table value, the fit is not considered to be a

good one.
We use this test to whether the difference between the theoretical and observed values can be
attributed to chance or not.

Explain how can you test the unbiased ness of a die using 2 distribution.
By using Chi square distribution we can test the unbiased ness of die.
Here in this case we set the null hypothesis (H0) as the die is unbiased and alternative hypothesis (H1) as the
die is biased.
Once we perform experiment the data usually obtained can be arranged in the form of frequency
distribution, where in we are given the observed frequencies. The corresponding theoretical frequencies are
obtained from the knowledge of the population (The number of time that we throw a die and the probability
of turning any face. For example if we throw a die 132 times then expected frequency is obtained by
multiplying this 132 by 1/6, since 1/6 is probability of turning any face up when we throw a die) Once we
have observed frequency and expected frequency we can use chi square as a test statistic. If the calculated
value of Chi-square (

) is less than the table value at a certain level of significance, we accept null

hypothesis and the fit is considered to be a good one, which means that the divergence between the
observed and expected frequencies is attributable to fluctuations of sampling. But it the calculated value of
Chi-square (

) is greater than its table value, the fit is not considered to be a good one.

Example: A Die is thrown 132 times with the following result:


Number
turned up
Frequenc
y

16

20

25

14

29

Test the hypothesis that dies is unbiased.


Solution: Let us take the null hypothesis that the die is unbiased. If that is true,
the probability if obtaining any one of the six faces is 1/6 and such
the expected frequency of any one face coming upward is 132*1/6 =
22
Calculation for
Observed
Expected
frequency
frequency
(O)
(E)
16
22
20
22
2

No.
Turn
ed up
1
2

33

(OE)2
36
4

(OE)2/
E
1.64
0.18

3
4
5
6
Total

25
14
29
28
132

9
64
49
36

0.41
2.91
2.23
1.64
9.01

(Oi E i)
9.01

E i

22
22
22
22
132

i 1

2
df 6 1 5 and tabulated 5 (0.05) 11.07

Since calculated value of chi-square is less than tabulated value. Hence


we accept null hypothesis. Thus, we may conclude that the die is
unbiased.
3. The following table gives the number of motorcycle accidents that occurred during the various days
of the week.
Days
Sunday Monday Tuesday Wednesday
Thursday
Friday
No. of accidents 14
18
12
11
15
14
Test whether the accidents are uniformly distributed over the week. Take 0.05 level of significance.
2001-q-9
Solution:
Let us take the null hypothesis that the accidents are uniformly distributed over a week. If that is true, the
probability that the motorcycle accidents that occurred during the various days of the week is 1/6.

Expected frequency of motorcycle accidents that occurred during the


various days of the week is 84*1/6 = 14
2
Calculation for
Day
Observed
Expected
(O(O2
frequency
frequency
E)
E)2/E
(O)
(E)
1
14
14
0
0
2
18
14
16
0.8889
3
12
14
4
0.2857
4
11
14
9
0.6429
5
15
14
1
0.0714
6
14
14
0
0
Total
84
14
1.8889
34

)
O
E
i
i
1.8889

E
i

i 1

2
df 6 1 5 and tabulated 5 (0.05) 11.07
Since calculated chi-square is less than tabulated value,

Cal < Tab i.e. 1.8889 < 11.070


2

We accept null hypothesis. So, the accidents are uniformly distributed over the week.

35

Vous aimerez peut-être aussi