Vous êtes sur la page 1sur 113

Statistics 111

Updated lecture notes


Fall 2013
Warren J. Ewens
wewens@sas.upenn.edu
Room 324 Leidy Labs
(corner 38th St and Hamilton Walk).
These notes provide an outline of the some of material to be discussed in the rst few lectures
of STAT 111. They are important because they provide the framework for all the material
given during the entire course. Also, much of this material is not given in the textbook for
the course.
1
Introduction and basic ideas
What is Statistics?
Statistics is the science of analyzing data in whose generation chance has played some part.
This explains why statistical methods are important in many areas, including for exam-
ple sociology, psychology, biology, economics and anthropology. In these areas there are
many chance mechanisms at work. For example, in biology the random transmission of one
chromosome from a pair of chromosomes from parent to ospring introduces a chance mech-
anism into many areas of genetics. Second, in all the above areas data are usually derived
from some random sample of individuals. A dierent sample would almost certainly yield
dierent data, so that the sampling process introduces a second chance element. Finally,
in economics the values of quantities such as the Dow Jones industrial average cannot be
predicted in advance, since their values are aected by many chance events that we cannot
know of in advance.
Everyone knows that you cannot make much progress in such areas as physics, astronomy
and chemistry without using mathematics. Similarly, you cannot make much progress in
such areas as psychology, sociology and biology without using statistics.
Because of the chance, or random, aspect in the generation of statistical data, it is nec-
essary, in discussing statistics, to also consider aspects of probability theory. The syllabus
for this course thus starts with an introduction to probability theory, and this is reected in
these introductory notes. But before discussing probability theory, we have to discuss the
relation between probability theory and statistics.
The relation between probability theory and statistics
Most of the examples given in the class concern simple situations and are not taken from the
sociological, psychological, etc. contexts. This is done so that the basic ideas of probability
theory and statistics will not be confounded with the complexities arising in those areas. So
we start here with a simple example concerning the ipping of a coin.
Suppose that we have a coin that we suspect of being biased towards heads. To check up
on this suspicion we ip the coins (say) 2,000 times and observe the number of heads that
we get. If the coin is fair, we would, beforehand, expect to see about 1,000 heads. If once
we ipped the coin we got 1,973 heads we would obviously (and reasonably) claim that we
have very good evidence that the coin is biased towards heads. If you think about it, the
reasoning that you went through in coming to this conclusion was something like this: If
the coin is fair is is extremely unlikely that I would get 1,973 heads from 2,000 ips. Thus
since I did in fact get 1,973 heads, I have strong evidence that the coin is unfair.
Equally obviously, if we got 1,005 heads, we would conclude that we do not have good ev-
idence that the coin is biased towards heads. Again, the reason for coming to this conclusion
is that a fair coin can easily give 1,005 (or more) heads from 2,000 ips.
2
But these are extreme cases, and reality often has to deal with more gray-area cases.
What if we saw 1,072 heads? Intuition and common sense might not help in such a case.
What we have to do is to calculate the probability that we would get 1,072 or more heads if
the coin is fair. If this probability is low we might conclude that we have signicant evidence
that the coin is biased towards heads. If this probability is fairly high we might conclude
that we do not have signicant evidence that the coin is biased.
The conclusion that we draw is an act of statistical inference, or a statistical induction.
An inference, or an induction, is a conclusion that we draw about reality, based on some
observation or observations. The reason why this is a statistical inference (or induction) is
that it is based on a probability calculation. No statistical inference can be made without
rst making the relevant corresponding probability calculation.
In the above example, probability theory calculations (which we will do later) shows that
the probability of getting 1,072 or more heads from 2,000 ips of a fair coin is very low
(less than 0.01). Thus having observed 1,072 heads in our 2,000 ips, we would reasonably
conclude that we have signicant evidence that the coin is biased.
Here is a more important example. Suppose that we are using some medicine (the cur-
rent medicine) to cure some illness. From long experience we know that, for any person
having this illness, the probability that this current medicine cures any patient is 0.8. A new
medicine is proposed as being better than the current one. To test whether this claim is
justied we plan to conduct a clinical trial, in which the new medicine will be given to 2,000
people suering from the disease in question. If the new medicine is equally eective as the
current one we would, beforehand, expect it to cure about 1,600 of these people. If after the
clinical trial is conducted the proposed new medicine cured 1,945 people, no-one would doubt
that it is better than the current medicine. Again, the reason for this opinion is something
like: If the new medicine has the same cure rate as the current one, it is extremely unlikely
that is would cure 1,945 people out of 2,000. But it did cure 1,945 people, and therefore I
have signicant evidence that its cure rate is higher than that of the current medicine.
But, equally obviously, if the proposed medicine cured 1,615 people we do not have strong
evidence that it is better than the current medicine. The reason for this is that if the new
medicine is equally eective as the current one, that is if the probability of a cure with the
new medicine is the same (0.8) as that for the current medicine, we can easily observe 1,615
(or more) people cured with the new medicine.
Again these are extreme cases, and reality often has to deal with more gray-area cases.
What if the new medicine cured 1,628 people? Intuition and common sense might not help
in such a case. What we have to do is to calculate the probability that we would get 1,628
or more people cured with the new medicine if it is equally eective as the current medicine.
This probability is about 0.11, and because this is not a really small probability we might
conclude that we do not have signicant evidence that the new medicine is superior to the
current one. Drawing this conclusion is an act of statistical inference.
Statistics is a dicult subject for two reasons. First, we have to think of the situation
both before and after our experiment, (in the medicine case the experiment is giving the
3
new medicine to the individuals in the clinical trial), and go back and forth several times
between these two time points in any statistical operation. This is not easy. Second, before
the experiment we have to consider aspects of probability theory. Unfortunately our minds
are not wired up well to think in terms of probabilities. (Think of the two fair coins
example given in class, and also the Monty Hall situation.)
The central point is this: no statistical operation can be carried out without considering
the situation before the experiment is performed. Because, at this time point, we do not
know what will happen in our experiment, these considerations involve probability calcula-
tions. We therefore have to consider the general features of the probability theory before
the experiment situation and the relation between these aspects and the statistical after the
experiment aspects. We will do this later, after rst looking more closely at the relation
between the deductive processes of probability and the inductive processes of statistics.
Deductions (implications) and inductions (inferences)
Probability theory is a deductive activity, and uses deductions (also called implications). It
starts with some assumed state of the world (for example that the coin is fair), and enables
us to make various probability calculations relevant to our proposed experiment. Statistics
is the converse, or inductive, operation, and uses inductions (also called inferences). It starts
with data from our experiment and attempts to make objective statements about the un-
known real world (whether the coin is fair or not). These inductive statements are always
based on some probability calculation. The relation between probability and statistics can be
seen from the following diagram :
Probability theory (deductive)
Some unknown reality
and a hypothesis about it.
Uses data (what is ob-
served in an experiment)
to test this hypothesis.
Statistical inference (inductive)
This diagram makes it clear that to learn how to conduct a statistical procedure we rst
have to discuss probability on its own. We now do this.
4
Probability Theory
Events and their probabilities
As has been discussed above, any discussion of Statistics requires a prior discussion of prob-
ability theory. In this section an introduction to probability theory is given as it applies to
probabilities of events.
Events
An event is something which does or does not happen when some experiment is performed,
eld survey is conducted, etc. Consider for example a Gallup poll, in which (say) 2,000
people are asked, before an election involving two candidates, Smith and Jones, whether
they will vote for Smith or Jones. Here are some events that could occur:-
1. More people say that they will vote for Smith than say they will vote for Jones.
2. At least 1,200 people say that they will vote for Jones.
3. Exactly 1,124 people say that they will vote for Smith.
We will later discuss probability theory relating to Gallup polls. However, all the exam-
ples given below relate to events involving rolling a six-sided die, since that is a simple and
easily understood situation. Here are some events that could occur in that context:-
1. An even number turns up.
2. The number 3 turns up.
3. A 3, 4, 5 or a 6 turns up.
Clearly there are many other events that we could consider. Also, with two or more rolls
of the die, we have events like a 6 turns up both times on two rolls of a die, in ten rolls
of a die, a 3 never turns up, and so on.
Notation
We denote events by upper-case letters at the beginning of the alphabet, and also the letter
S and the symbol . So in the die-rolling example we might have:-
1. A is the event that an even number turns up.
2. B is the event that the number 3 turns up.
3. C is the event that a 3, 4, 5 or a 6 turns up.
The letter S has a special meaning. In the die-rolling example it is the event that the
number turning up is 1, 2, 3, 4, 5 or 6. In other words, S is the certain event. It comprises
5
all possible events that could occur.
The symbol also has a special meaning. This is the so-called empty event. It is an
event that cannot occur, such as rolling both an even number and an odd number in one
single roll of a die. It is an important event when considering intersections of events - see
below.
Unions, intersections and complements of events
Given a collection of events we can dene various derived events. The most important of
these are unions of events, intersections of events, and complements of events.These are de-
ned as follows:-
(i) Unions of events: If D and E are events, the union of D and E, written D E, is the
event that either D, or E, or both occur. In the die-rolling example above, A B is the
event that a 2, 3, 4 or 6 turns up, A C is the event that a 2, 3, 4, 5 or 6 turns up, and
BC is the event that a 3, 4, 5 or a 6 turns up. (Notice that in this case BC is the same
as C.)
(ii) Intersections of events: If D and E are events, the intersection of D and E, written
DE, is the event that both D and E occur. In the die-rolling example above, AB is the
empty event , since A and B cannot both occur, A C is the event that a 4 or a 6 turns
up, and B C is the event that the number 3 turns up. (Notice that in this case B C is
the same as B.)
(iii) Complements of events: If D is an event, D
c
is the complementary event to D. It is
the event that D does not occur. In the three examples above, A
c
is the event that an odd
number turns up, B
c
is the event that some number other than 3 turns up, and C
c
is the
event that a 1 or a 2 turns up.
Probabilities of events
The concept of a probability is quite a complex one. These complexities are not discussed
here: we will be satised with a straightforward intuitive concept of probability as in some
sense meaning a long-term frequency. Thus we would say, for a fair coin, that the probability
of a head is 1/2, in the sense that we think that in a very large number of ips of this coin,
we will get a head almost exactly half the time. We are interested here in the probabilities
of events, and we write the probability of the event A as P(A), the probability of the event
B as P(B), and so on.
6
Probabilities of derived events
The probabilities for the union and the intersection of two events are linked by the following
equation. If D and E are any two events,
P(D E) = P(D) +P(E) P(D E).
To check this equation, we note that P(AC) = 5/6, and this is given by P(A) +P(C)
P(A C) = 1/2 + 2/3 1/3 = 5/6. The other examples can be checked similarly.
It is always true that for any event D, P(D
c
) = 1P(D). This is obvious: the probability
that D does not occur is 1 minus the probability that D does occur.
Suppose that the die in the die-rolling example is fair. Then the probabilities of the
various union and intersection events discussed above are as follows:-
P(A) = 1/2. P(A B) = 2/3. P(A B) = 0.
P(B) = 1/6. P(A C) = 5/6. P(A C) = 1/3.
P(C) = 2/3. P(B C) = 2/3. P(B C) = 1/6.
Notice that the probability of the empty event is 0.
Mutually exclusive events
Two events D and E are mutually exclusive if the both cannot occur together. Then their
intersection is the empty event and
P(D E) = P(D) +P(E). (1)
A generzalization of this formula is that if D, E, F, . . . , H are all mutually exclusive
events (that is, no two of them can happen together), then
P(D E F .... H) = P(D) +P(E) +P(F) + +P(H). (2)
Independence of events
Two events D and E are said to be independent if P(D E) = P(D) P(E). It can be
seen from the calculations given above that A and B are not independent, B and C are not
independent but that A and C are independent.
The intuitive meaning of independence is that if two events are independent, and if you
are told that one of them has occurred, then this information does not change the probability
that the other event occurs. Thus in the above example, if you are given the information
that an even number turned up (event A), then the probability that a 3, 4, 5 or a 6 turns up
7
(event C) is still 2/3, which is the probability of the event C without this information being
given. Similarly if you are told that the number that turned up was a 3, 4, 5 or 6, then the
probability that an even number turned up is still 1/2. (The calculations conrming this are
given in the next section.)
The generalization of the above is the following: if D, E, ... M are independent events,
then the probability of the intersection of all these events is
P(D E F .... H) = P(D) P(E) P(F) P(H). (3)
An example
A fair six-sided die is to be rolled twice. What is the probability that the sum of the two
numbers to turn up is 6?
This sum can be 6 in ve mutually exclusive ways:-
1 turns up on the rst roll, 5 turns up on the second roll (event D).
2 turns up on the rst roll, 4 turns up on the second roll (event E)
3 turns up on the rst roll, 3 turns up on the second roll (event F)
4 turns up on the rst roll, 2 turns up on the second roll (event G)
5 turns up on the rst roll, 1 turns up on the second roll (event H)
The union of these ve events is the event the sum of the two numbers to turn up is 6.
Next, the number to turn up on the rst roll is independent of the number to turn up on
the second roll. This the probability of reach of the above ve events is
1
6

1
6
=
1
36
.
Using the fomula for the probability of the union of mutually exclusive events, the prob-
ability of the event the sum of the two numbers to turn up is 6 is
1
36
+
1
36
+
1
36
+
1
36
+
1
36
=
5
36
.
The calculations above assume that the die is fair. For an unfair die we might reach a
dierent conclusion than the one that we reach for a fair die. For example, if the die is biased,
so that the probabilites for a 1, 2, 3, 4, 5 or 6 turning up are, respectively, 0.1, 0.3, 0.1, 0.2,
0.2 and 0.1, then with the events A, B and C as dened above, P(A) = 0.6, P(C) = 0.6 and
P(AC) is 0.3. Since 0.6 0.6 = 0.36 = 0.3, the events A and C are now not independent.
Conditional probabilities
We often wish to calculate the probability of some event D, given that some other event E
has occurred. Such a probability is called a conditional probability, and is denoted P(D|E).
8
The conditional probability P(D|E) is calculated by the formula
P(D|E) =
P(D E)
P(E)
. (4)
It is essential to calculate P(D|E) using this formula: using any other approach, and in
particular using common sense, will usually give an incorrect answer.
If the events D and E are independent, then P(D|E) = P(D). In other words, D and
E are independent if the knowledge that E has occurred does not change the probability
that D occurs. In the fair die example given above, equation (4) shows that P(A|C) =
(1/3)/(2/3) = 1/2, and this is equal to P(A). This conrms that A and C are independent
(for a fair die). In the unfair die example given above, equation (4) shows that P(A|C) =
0.3 /0.6 = 0.5, and this is not equal to P(A), which 0.6. This conrms that for this unfair
die A and C are not independent.
Probability: One Discrete Random Variable
Random variables and data
In this section we dene some terms that we will use often. We do this in terms of the coin
ipping example, but the corresponding denitions for other examples are easy to imagine.
Before we ip the coin the number of heads that we will get is unknown to us. This
number is therefore called a random variable. It is a before the experiment concept. By
the word data we mean the observed value of a random variable once some experiment
is performed. In the coin example, once we have ipped the coin the data is simply the
number of heads that we did get. It is the observed value of the random variable once the
experiment of ipping the coin is carried out. It is an after the experiment concept.
To assist us with keeping the distinction between random variables and data clear, and
as a matter of notational convention, a random variable (a before the experiment is carried
out concept) is always denoted by an upper case Roman letter. We use the upper-case
letter X in these notes for this purpose. It is a concept of our mind - at this stage we do not
know its value. In the coin example the random variable X is the concept of our mind
number of heads we will get, tomorrow, when we ip the coin.
The notation for the after the experiment is done data is the corresponding lower case
letter. So after we have ipped the coin we would denote the number of heads that we
did get by the corresponding lower-case letter x. Thus it makes sense, after the coin has
been ipped, to say x = 1, 142. It does not make sense before the coin is ipped to say
X = 1, 142. This second statement does not compute.
There are therefore two notational conventions that we always use: upper-case Roman
letters for random variables, lower-case Roman for data. We will later nd a third notational
convention (for parameters).
9
Denition: one discrete random variable
In this section we give informal denitions for discrete random variables and their prob-
ability distributions rather than the formal denitions often found in statistics textbooks.
Continuous random variables will be considered in a later section.
A discrete random variable is a numerical quantity that in some future experiment that
involves some degree of randomness will take one value from some discrete set of possible
values. These possible values are usually known before the experiment: In the coin example
the possible values of X, the number of heads that will turn up, tomorrow, when we will
ip the coin 2,000 times, are clearly 0, 1, 2, 3, . . . 2, 000. In practice the possible values of a
discrete random variable often consist of the numbers 0, 1, 2, 3, . . . k, for some number k.
The probability distribution of a discrete random variable; parameters
The probability distribution of a discrete random variable X is a listing of the possible values
that this random variable can take, together with their respective probabilities. If there are
k possible values of X, namely v
1
, v
2
, . . . , v
k
, with respective probabilities P(v
1
), P(v
2
, ) . . . ,
P(v
k
), this probability distribution can be written generically as
Possible values of X v
1
v
2
. . . v
k
Respective probabilities P(v
1
) P(v
2
) . . . P(v
k
)
(5)
In some cases we know (or hypothesize) the probabilities of the possible values v
1
, v
2
, . . . , v
k
.
For example, if in the coin example we we know that the coin is fair, the probability distri-
bution of X, the number of heads that we would get on two ips of the coin, is found as
follows.
The possible values of X are 0, 1 and 2. We will get 0 heads if both ips give tails, and since
the outcomes of the two ips are independent, the probability of this is (1/2) (1/2) = 1/4.
We can get 1 head in two ways: head on the rst ip and tail on the second, and tail on
the rst ip and head on the second. These events are mutually exclusive, and arguing as
above each has probability 1/4. Thus the total probability of 1 head is 1/2. Similarly the
probability of 2 heads is 1/4. This leads to the following probability distribution:-
Possible values of X 0 1 2
Respective probabilities .25 .50 .25
(6)
Here P(0) = .25, P(1) = .5, P(2) = .25.
Suppose more generally that the probability of getting a head on any ip is some value
. We continue to dene X as the number of heads that we would get on two ips of the
coin, and the possible values of X are still 0, 1 and 2. We will get 0 heads if both ips
give tails, and since the outcomes of the two ips are independent, the probability of this is
(1 ) (1 ) = (1 )
2
. As above, we can get 1 head in two ways: head on the rst ip
10
and tail on the second, and tail on the rst ip and head on the second. These events are
mutually exclusive, and arguing as above each has probability (1). Thus the probability
of getting one head is 2(1 ). Similarly the probability of 2 heads is
2
. This leads to the
following probability distribution:-
Possible values of X 0 1 2
Respective probabilities (1 )
2
2(1 )
2
(7)
In this case,
P(0) = (1 )
2
, P(1) = 2(1 ), P(2) =
2
. (8)
Here is a so-called parameter: see more on these below.
The probability distribution (8) can be generalized to the case of an arbitrary number of
ips of the coin- see (9) below.
The binomial distribution
There are many important discrete probability distributions that arise often in the applica-
tions of probability and statistics to real-world problems. Each one of these distributions
is appropriate under some collection of requirement specic to that distribution. Here we
focus only on the most important of these distributions, namely, the binomial distribution,
and consider rst the requirements for it to be appropriate.
The binomial distribution arises if, and only if, all four of the following requirements
hold. First, we plan to conduct some xed number n of trials. (By xed we mean xed
in advance, and not, for example, determined by the outcomes of the trials as they occur.)
Second, there must be exactly two possible outcomes on each trial. The two outcomes are
often called, for convenience, success and failure. (Here we might regard getting a head
on the ip of a coin as a success and a tail as a failure.) Third, the various trials must
be independent - the outcome of any trial must not aect the outcome of any other trial.
Finally, the probability of success must be the same on all trials. One must be careful when
using a binomial distribution that all four of these conditions hold. We reasonably believe
that these conditions hold when ipping a coin.
We often denote the probability of success on each trial by , since in practice this is often
unknown. That is, it is a parameter. The random variable of interest is the total number
X of successes in the n trials. The probability distribution of X is given by the (binomial
distribution) formula
P(x) =
_
n
x
_

x
(1 )
nx
, x = 0, 1, 2, . . . , n. (9)
The binomial coecient
_
n
x
_
is often spoken as n choose x: it is the number of dierent
orderings in which x successes can arise in the n trials.
_
n
x
_
is calculated as
n!
x!(nx)!
, where
x! = x(x 1)(x 2) 3.2.1.
11
The factor 2 in (8) is an example of a binomial coecient, reecting the fact that there are
two orders (success followed by failure and failure followed by success) in which we can obtain
one success and one failure in two trials. THis is also given by the calculation
2!
1!(1)!
= 2.
In the expression (9) is the parameter, and n is called the index, of the binomial
distribution. The probabilities in (8) are binomial distribution probabilities for the case
n = 2, and can be found from (9) by putting n = 2 and considering the respective values
x = 0, 1 and 2.
Parameters
The quantity introduced above is a parameter. In general a parameter is some unknown
constant. In the binomial case it is the unknown probability of success in (9). Almost all of
Statistics consists of:-
(i) Estimating the value of a parameter.
(ii) Giving some idea of the precision of our estimate of a parameter (sometimes called the
margin of error).
(iii) Testing hypotheses about the value of a parameter.
We shall consider these three activities later in the course. In the coin example, these
would be:-
(i) Estimating the value of the binomial parameter .
(ii) Giving some idea of the precision of our estimate of this parameter.
(iii) Testing hypotheses about the numerical value of this parameter, for example testing the
hypothesis that = 1/2.
The mean of a discrete random variable
The mean of a random variable is often confused with the concept of an average, and it is
important to keep a clear distinction between the two concepts. The mean of the discrete
random variable X whose probability distribution is given in (5) above is dened as
v
1
P(v
1
) +v
2
P(v
2
) + v
k
P(v
k
). (10)
In more mathematical notation this is
k

i=1
v
i
P(v
i
), (11)
12
the summation being over all possible values (v
1
, v
2
, . . . , v
k
) that the random variable X can
take. As an example, the mean of a random variable having the binomial distribution (9) is
n

x=0
x
_
n
x
_

x
(1 )
nx
, (12)
and this can be shown, after some algebra, to be n.
As a second example, consider the (random) number (which we denote by X) to turn up
when a die is rolled. The possible values of X are 1, 2, 3, 4, 5 and 6. If the die is fair, each
of these values has probability
1
6
. Application of equation (10) shows that the mean of X is
1
1
6
+ 2
1
6
+ 3
1
6
+ 4
1
6
+ 5
1
6
+ 6
1
6
= 3.5. (13)
Suppose on the other hand that the die is unfair, and that the probability distribution of
the (random) number X to turn up is:-
Possible values of X 1 2 3 4 5 6
Respective probabilities 0.15 0.25 0.10 0.15 0.30 0.05
(14)
In this case the mean of X is
1 0.15 + 2 0.25 + 3 0.10 + 4 0.15 + 5 0.30 + 6 0.05 = 3.35. (15)
There are several important points to note about the mean of a discrete random variable:-
(i) The notation is often used for a mean. In many practical situations the mean of a
discrete random variable X is unknown to us, because we do not know the numerical
values of the probabilities P(x). That is to say, is a parameter, and this is why we
use Greek notation for it. As an example, if in the binomial distribution case we do
not know the value of the parameter , then we do not know the value (= n) of the
mean of that distribution.
(ii) The mean of a probability distribution is its center of gravity, that is its knife-edge
balance point.
(iii) Testing hypotheses about the value of a mean is perhaps the most important of statis-
tical operations. An important example of tests of hypotheses about means is a t test.
Dierent t tests will be discussed in this course.
(iv) The word average is not an alternative for the word mean, and has a quite dierent
interpretation from that of mean. This distinction will be discussed often in class.
13
The variance of a discrete random variable
A quantity of importance equal to that of the mean of a random variable is its variance. The
variance (denoted by
2
) of the discrete random variable X whose probability distribution
is given in (5) above is dened by

2
= (v
1
)
2
P(v
1
) + (v
2
)
2
P(v
2
) +. . . + (v
k
)
2
P(v
k
). (16)
In more mathematical terms we write this as

2
=
k

i=1
(v
i
)
2
P(v
i
), (17)
the summation being taken over all possible values of the random variable X.
In the case of a fair die, we have already calculated (in (15)) the mean of X, the (random)
number to turn up on a roll of the die, to be 3.5. Application of (16) shows that the variance
of X is

2
= (13.5)
2

1
6
+(23.5)
2

1
6
(33.5)
2

1
6
(43.5)
2

1
6
(53.5)
2

1
6
(63.5)
2
=
35
12
. (18)
There are several important points to note about the variance of a discrete random variable:-
(i) The variance has the standard notation
2
, anticipated above.
(ii) The variance is a measure of the dispersion of the probability distribution of the random
variable around its mean. Thus a random variable with a small variance is likely to be
close to its mean. (see Figure 1).
smaller variance larger variance
Figure 1:
14
(iii) A quantity that is often more useful than the variance of a probability distribution is
the standard deviation. This is dened as the positive square root of the variance, and
(naturally enough) is denoted by .
(iv) The variance, like the mean, is often unknown to us. This is why we denote it by a
Greek letter.
(v) The variance of the number of successes in the binomial distribution (9) can be shown,
after some algebra, to be n(1 ).
Many Random Variables
Introduction
Almost every application of statistical methods in psychology, sociology, biology and similar
areas requires the analysis of many observations. For example, if a psychologist wanted
to assess the eects of sleep deprivation on the time needed to answer the questions in a
questionnaire, he/she would want to test a fairly large number of people in order to get
reasonably reliable results. Before this experiment is performed, the various times that the
people in the experiment will need to answer the questions are all random variables.
In line with the approach in this course, ideas about many observations will often be
discussed in the simple case of rolling a die (fair or unfair) many times. Here the observations
are the numbers that turn up on the various rolls of the die. If we wish to test whether this
die is fair, we would plan to roll it many times, and thus plan to get many observations,
before making our assessment. As with the sleep deprivation example, before we actually
roll the die the numbers that will turn up on the various rolls are all random variables. To
assess the implications of the numbers which do, later, turn up when we get around to rolling
the die, and of the times needed in the sleep deprivation example, we have to consider the
probability theory for many random variables.
Notation
Since we are now considering many random variables, the notation X for one single random
variable is no longer sucient for us. We denote the rst random variable by X
1
, the second
by X
2
, and so on. Suppose that in the die example we denote the planned number of rolls
by n. We would then denote the (random) number that will turn up on the rst roll of the
die by X
1
, the (random) number that will turn up on the second roll of the die by X
2
, . . .,
the (random) number that will turn up on the n-th roll of the die by X
n
.
As with a single random variable (see notes, page 8), we need a separate notation for
the actual observed numbers that did turn up once the die was rolled (n times). We denote
these by x
1
, x
2
, . . . , x
n
. To assess (for example) whether we can reasonably assume that the
die is fair we would use these numbers, but also we would have to use the theory of the n
random variables X
1
, X
2
, . . . , X
n
.
15
As in the case of one random variable, a statement in the die example like: X
6
= 4
makes no sense. It does not compute. On the other hand, once the die has been rolled,
the statement x
6
= 4 does make sense. It means that a 4 turned up on the sixth roll of
the die. In the sleep example, a statement like: x
6
= 23.7 also makes sense. It means
that the time that the sixth person in the experiment took to complete the questionnaire
was 23.7 minutes. By contrast, before the experiment was conducted, the time X
6
that the
sixth person will take to complete the questionnaire is unknown. It was a random variable.
Thus a statement like: X
6
= 23.7 does not make sense.
Independently and identically distributed random variables
The die example introduces two important concepts. We would reasonably assume that
X
1
, X
2
, . . . , X
n
all have the same probability distribution, since it is the same die that is
being rolled each time. For example, we would assume that the probability that a three
turns up on roll 77 (whatever it might be) is the same as the probability that a three turns
up on roll 144. Further, we would also reasonably assume that the various random variables
X
1
, X
2
, . . . , X
n
are all independent of each other. That is, we would reasonably assume that
the value of any one of these would not aect the value of any other one. Whatever number
turned up on roll 77 has no inuence on the number turning up on roll 144.
Random variables which are independent of each other, and which all have the same
probability distribution, are said to be iid (independently and identically distributed). This
concept is discussed again below.
The assumptions that the various random variables X
1
, X
2
, . . . , X
n
are all independent
of each other, and that they all have the same probability distribution, are often made in
the application of statistical methods. However, in areas such as psychology, sociology and
biology that are more scientically important and complex than rolling a die, the assumption
of identically and independently distributed random variables might not be reasonable. Thus
if twin sisters were used in the sleep deprivation example, the times that they take to complete
the questionnaire might not be independent, since we might expect them to be quite similar
because of the common environment and genetic make-up of the twins. If the people in
the experiment were not all of the same age it might not be reasonable to assume that the
times needed are identically distributed - people of dierent ages might perhaps be expected
to tend to need dierent amounts of time. Thus in practice care must often be exercised
and common sense used when applying the theory of iid random variables in areas such as
psychology, sociology and biology.
The mean and variance of a sum and of an average
Given n random variables X
1
, X
2
, . . . , X
n
, two very important derived random variables are
their sum, denoted by T
n
, and dened as
T
n
= X
1
+X
2
+ +X
n
, (19)
16
and their average, denoted by

X, and dened by

X =
X
1
+X
2
+ +X
n
n
=
T
n
n
. (20)
Since both T
n
and

X are functions of the random variables X
1
, X
2
, . . . , X
n
they are them-
selves random variables. In the die example we do not know, before we roll the die, what
the sum or the average of the n numbers that will turn up will be.
Both the sum and the average, being random variables, each have a probability distribu-
tion, and thus each has a mean and a variance. These must be related in some way to the
mean and the variance of each of X
1
, X
2
, . . . , X
n
. The general theory of many random vari-
ables shows that if X
1
, X
2
, . . . , X
n
are iid, with (common) mean and (common) variance

2
, then the mean and the variance of the random variable T
n
are, respectively,
mean of T
n
= n, variance of T
n
= n
2
, (21)
and the mean and the variance of the random variable

X are given respectively by
mean of

X = , variance of

X =

2
n
. (22)
In STAT 111 we will call these four formulas the four magic formulas and will refer to
them often. Thus you have to know them by heart.
Equations (21) and (22) apply of course in the particular case n = 2. However in this case
two further equations are important. If we dene the random variable D by D = X
1
X
2
(think of D standing for dierence) then
mean of D = 0, variance of D = 2
2
. (23)
These are also magic formulas and will refer to them several times, especially when
making comparison studies. Thus you also have to know these two formulas by heart.
Two generalizations
More generally, suppose that X
1
, X
2
, . . . , X
n
are independent random variables with respec-
tive means
1
,
2
, . . . ,
n
and respective variances
2
1
,
2
2
, . . .
2
n
. Then
mean of T
n
=
1
+
2
+ +
n
, variance of T
n
=
2
1
+
2
2
+ +
2
n
, (24)
and
mean of

X = (
1
+
2
+ +
n
)/n, variance of

X =

2
1
+
2
2
+ +
2
n
n
2
. (25)
The formulas is (21) and (22) are, respectively, special cases of these formulas.
17
Next, the generalization of the formulas in (23) is that D
ij
, dened by D
ij
= X
i
X
j
, is a
random variable and that
mean of D
ij
=
i

j
, variance of D
ij
=
2
i
+
2
j
. (26)
The formulas in (23) are, respectively, special cases of these formulas.
An example of the use of equations (22)
In the case of a fair die, each X
i
has mean 3.5 and variance 35/12, as given by (15) and
(18), and thus standard deviation is
_
(35/12), or about 1.708. On the other hand if n, the
number of rolls, is 1,000, the variance of

X = (X
1
+ X
2
+ + X
1,000
)/1, 000 is, from the
second equation in (22), 35/12,000. Therefore the standard deviation of

X is
_
(35/12, 000),
or about 0.0540. This small standard deviation implies that once we roll the die 1,000 times,
it is very likely that the observed average of the numbers that actually turned up will be
very close to 3.5. This is no more than what intuition suggests. We will later do a JMP
experiment to conrm this.
Later we will see that if the die is fair, the probability that the observed average x is
between its mean of

X minus two standard deviations of

X, (that is 3.5 - 2 0.0540 =
3.392) and its mean of

X plus two standard deviations of

X , (that is 3.5 + 2 0.0540 =
3.608) is about 95%. This statement is one of probability theory. It is an implication, or
deduction.
So here is a window into Statistics. If we have now rolled the die 1,000 times, and the
observed average x is 3.382. This is outside the range 3.392 to 3.608, the range within which
the average is about 95% likely to lie if the die is fair. Then we have good evidence that the
die is not fair. This claim is an act of Statistics. It is an inference, or induction.
We will later make many statistical inferences, all of which will be based on the relevant
corresponding probability theory calculation.
The proportion of successes in n binomial trials
The random variable in the binomial distribution is the number of successes in n binomial
trials, with probability distribution given in (9). In some applications it is necessary to
consider instead the proportion of successes in these trials (more exactly, the proportion of
trials leading to success). If X is the number of successes in n binomial trials, then this
proportion is X/n, which we will denote by P.
P is a discrete random variable, and its possible values are 0,
1
n
,
2
n
, ...,
(n1)
n
, 1. It has a
probability distribution which can be found from the binomial distribution (9) since the
probability that P = i/n is the same as the probability that X = i for any value of i.
Because P is a random variable it has a mean and variance. These are
mean of P = , variance of P = (1 )/n. (27)
18
These equations bear a similarity to the formulas for the mean and variance of an average
given in (22).
We will see later, when testing for the equality of two binomial parameters, why it is often
necessary to operate via the proportion of trials giving success rather than by the number
of trials giving success.
The standard deviation and the standard error
In the die example in the previous section the standard deviation of

X = (X
1
+X
2
+ +
X
1,000
)/1, 000 is about 0.0540. The standard deviation of an average such as this is sometimes
called the standard error of the mean. (This terminology is unfortunate and causes much
confusion - it should be the standard deviation of the average. How can a mean have a
standard deviation? A mean is a parameter, and only random variables can have a standard
deviation.) Many textbooks use this unfortunate terminology. Watch out for it.
Means and averages
It is crucial to remember that a mean and an average are two entirely dierent things. (The
textbook, and other textbooks, are sometimes not too good on making this distinction.) A
mean is a parameter, that is some constant number whose value is often unknown to us. For
example, with an unfair die for which the probabilities for the number to turn up on any
roll are unknown, the mean of the number to turn up is unknown. It is a parameter which
we might wish to estimate or test hypotheses about. We will always denote a mean by the
Greek letter .
By contrast, an average as dened above (i.e.

X) is a random variable. It has a probability
distribution and thus has a mean and a variance. Thus it makes sense to say (as (22) and
(25) say) the mean of the average is such and such.
There is also a second concept of an average, and this was already referred to in the
die-rolling example above. This is the actual average x of the numbers that actually turned
up once the 1,000 rolls were completed. This is a number, for example 3.382. You can think
of this as the realized value of the random variable

X once the rolling had taken place.
Thus there are three related concepts: rst a mean (a parameter), second a before the
experiment average

X (a random variable, and a concept of probability theory), and third
an after the experiment average x (a calculated number, and a concept of Statistics). They
are all important and must not be confused with each other.
Why do we need all three concepts? Suppose that we wish to estimate a mean (rst
concept), or to test some hypothesis about a mean (for example that it is 3.5). We would do
this by using the third concept, the after the experiment observed average x. How good x
is an estimate of the mean, or what hypothesis testing conclusion we might draw given the
observed value of x, both depend on the properties of the random variable

X, (the second
concept), in particular its mean and variance.
19
Continuous Random Variables
Denition
Some random variables by their very nature are discrete, such as the number of heads in
2, 000 ips of a coin. Other random variables, by contrast, are continuous. Continuous
random variables can take any value in some continuous range of values. Measurements
such as height and blood pressure are of this type. Here we denote the range of a continuous
random variable by (L, H), (L = lowest possible value, H = highest possible value of the
continuous random variable), and use this notation throughout.
Probabilities for continuous random variables are not allocated to specic values, but
rather are allocated to ranges of values. The probability that a continuous random variable
takes some specied numerical value is zero.
We use the same notation for continuous random variables as we do for discrete random
variables, so that we denote a continuous random variable in upper case, for example by X.
Every continuous random variable X has an associated density function f(x). The density
function f(x) is the continuous random variable analogue of a discrete random variable
probability distribution such as (9). This density function can be drawn as a curve in the
(x, f(x)) plane. (Examples will be given in class.) The probability that the random variable
X takes a value in some given range a to b is the area under this curve between a annd b.
From a calculus point of view (for those who have a good calculus background) this
probability is obtained by integrating this density function over the range a to b. For example,
the probability that the (continuous) random variable X having density function f(x) takes
a value between a and b, (with a < b) is given by
P(a < X < b) =
_
b
a
f(x) dx. (28)
Because the probability that a continuous random variable takes some specied numerical
value is zero, the three probabilities Prob(a X < b), Prob(a < X b), and Prob(a
X b) are also given by the right-hand side in (28).
As a particular case of equation (28),
_
H
L
f(x) dx = 1. (29)
This equation simply states that a random variable must take some value in its range of
possible values.
For those who do not have a calculus background, dont worry - we will never do any of
these integration procedures.
20
The mean and variance of a continuous random variable
The mean and variance
2
of a continuous random variable X having range (L, H) and
density function f(x) are dened respectively by
=
_
H
L
xf(x)dx (30)
and

2
=
_
H
L
(x )
2
f(x)dx. (31)
Again, if you do not have a calculus background, dont worry about it. We will never do any
of these integration procedures. The main thing to remember is that these denitions are
the natural analogues of the corresponding denitions for a discrete random variable, that
is that the mean is the center of gravity, or the knife-edge balance point of the density
function f(x) and the variance is a measure of the dispersion, or spread-out-ness, of the
density function around the mean. (Examples will be given in class.)
Also, the remarks about the mean and the variance of a continuous random variable are
very similar to those of a discrete random variable given above. In particular we denote a
mean by and a variance by
2
. In a research context the mean and the variance
2
of
the random variable of interest are often unknown to us. That is, they are parameters, as is
indicated by the Greek notation that we use for them. Many statistical procedures involve
estimating, and testing hypotheses about, the mean and the variance of continuous random
variables.
The normal distribution
There are many continuous probability distributions relevant to statistical operations. We
discuss the most important one in this section, namely the normal, or Gaussian, distribution.
The (continuous) random variable X has a normal , or Gaussian, distribution if its range
(i.e. set of possible values) is (, ) and density function f(x) given by
f(x) =
1

2
e

(x)
2
2
2
. (32)
The shape of this density function is the famous (infamous?) bell-shaped curve. (Here is
the well-known geometrical value of about 3.1416 and e is the equally important exponential
constant of about 2.7183.)
It can be shown that the mean of this normal distribution is and its variance
2
, and
these parameters are built into the functional form of the distribution, as (32) shows. A
random variable having this distribution is said to be an N(,
2
) random variable.
A stated above, the probability that a continuous random variable takes a value between
a and b is found by a calculus operation, which gives the area under the density function
21
of the random variable between a and b. Thus for example the probability that a random
variable having a normal distribution with mean 6 and variance 16 takes a value between 5
and 8 is
_
8
5
1

16
e

(x6)
2
32
dx. (33)
Amazingly, the processes of mathematics do not allow us to evaluate the integral in (33):
it is just too hard. (This indicates an interesting limit as to what mathematics can do.)
So how would be nd the probability given in (33)? It has to be done using a chart. So we
now have to discuss the normal distribution chart.
There is a whole family of normal distributions, each member of the family corresponding
to some pair of (,
2
) values. (The case = 6 and
2
= 16 just considered is an example of
one member of this family.) However, probability charts are available only for one particular
member of this family, namely the normal distribution for which = 0 and
2
= 1. This is
sometimes called the standardized normal distribution, for reasons which will appear shortly.
(The normal distribution chart that you will be given refers to this specic member of the
normal distribution family.)
The way that chart works is best described by a few examples. (We will also do some
examples in class.) The chart gives less than probabilities for a variety of positive numbers,
generically denoted by z. Thus the probability that a random variable having a normal
distribution with mean 0 and variance 1 takes a value less than 0.5 is .6915. The the
probability that a random variable having a normal distribution with mean 0 and variance
1 takes a value less than 1.73 is .9582. Note that the chart only goes up to the z value
3.09; for any z greater than this, the probability that a random variable having a normal
distribution with mean 0 and variance 1 takes a value less than any z larger than 3.09 is
good enough as 1.
We usually have to consider more complicated examples than this. For example, the
probability that a random variable having a normal distribution with mean 0 and variance
1 takes a value between 0.5 and 1.73 is 0.8582 0.6915 = 0.2667. The probability that a
random variable having a normal distribution with mean 0 and variance 1 takes a value
between 1.23 and 2.46 is 0.9931 0.8907 = 0.1024.
As a dierent form of calculation, we often have to nd greater than probabilities. For
example, the probability that a random variable having a normal distribution with mean 0
and variance 1 takes a value exceeding 1.44 is 1 minus tthe probability that it takes a value
less than 1.44, namely 1 0.9251 = 0.0749.
Even more complicated calculations arise when negative numbers are involved. Here we
have to use the symmetry of the normal distribution around the value 0. For example, the
probability that a random variable having a normal distribution with mean 0 and variance
1 takes a value between 1.22 and 0 is the same as the probability that it takes a value
between 0 and +1.22, and this is 0.8888 0.5 = 0.3888. The probability that a random
variable having a normal distribution with mean 0 and variance 1 takes a value less than
0.87 is the same as the probability that it takes a value greater than +0.87, and this is
22
1 0.8078 = 0.1922.
Finally, perhaps the most complicated calculation concerns the probability that a random
variable having a normal distribution with mean 0 and variance 1 takes a value between some
given negative number and some given positive number. Suppose for example that we want
to nd the probability that a random variable having a normal distribution with mean 0 and
variance 1 takes a value between 1.28 and +0.44. This is the probability that it takes a
value between 1.28 and 0 plus the probability takes a value between 0 and +0.44. This in
turn is the probabilitiy that it takes a value between 0 and +1.28 plus the probabiiliy that
it takes a value between 0 and +0.44. This is (0.89970.5000) +(0.67000.5000) = 0.5697.
Why is there a probability chart only for this one particular member of the normal
distribution family? Suppose that a random variable X has the normal distribution (32),
that is, with arbitrary mean and arbitrary variance
2
. Then the standardized random
variable Z, dened by Z = (X )/, has a normal distribution with mean 0, variance 1
(trust me on this). This standardization procedure can be used to nd probabilities for a
random variable having any normal distribution.
For example, if X is a random variable having a normal distribution with mean 6 and
variance 16 (and thus standard deviation 4), P(7 < X < 10), that is the probability of the
event 7 < X < 10, can be found by standardizing and creating a Z statistic:
P(7 < X < 10) = P
_
7 6
4
<
X 6
4
<
10 6
4
_
= P(0.25 < Z < 1), (34)
and this probability is found from the standardized normal distribution chart, (or from
computer packages), to be 0.8413 - 0.5987 = 0.2426.
As a slightly more complicated example, the probability that the random variable X in the
previous paragraph, the probability of the event 4 < X < 11, can be found by standardizing
and creating a Z statistic:
P(4 < X < 11) = P
_
4 6
4
<
X 6
4
<
11 6
4
_
= P(0.5 < Z < 1.25), (35)
and this probability is found from the kind of manipulations discussed above to be (0.6915
0.5000) + (0.8944 0.5000) = 0.5859.
Two useful properties of the normal distribution, often used in conjunction with this
standardization procedure, is that if the random variable Z has a normal distribution, mean
0, variance 1, then
P(Z > +1.645) = 0.05 (36)
and
Pr(1.96 < Z < +1.96) = 0.95, (37)
23
or equivalently
P(Z < 1.96) + P(Z > +1.96) = 0.05. (38)
The standardized quantity Z, dened as Z = (X )/, where X is a random variable
with mean and standard deviation will be referred to often below, and the symbol Z is
reserved, in these notes and in Statistics generally, for this standardized quantity. One of
the applications of the normal distribution is to provide approximations for probabilities for
various random variables, almost always using the standardized quantity Z.
One frequently-used approximation derives from equation (37) by approximating the value
1.96 by 2. This is
P(2 < Z < +2) 0.95, (39)
Remembering that Z = (X)/, this equation implies that if X is a random variable having
a normal distribution with mean and variance
2
, then One frequently-used approximation
derives from equation (37) by approximating the value 1.96 by 2. This is
P( 2 < X < + 2) 0.95, (40)
A similar calculation, using the normal distribution chart, shows that
P( 2.575 < X < + 2.575) 0.99. (41)
Applications of (40) often arise from the Central Limit Theorem, discussed immediately
below.
The Central Limit Theorem
An important property of an average and of a sum of several random variables derives from
the so-called Central Limit Theorem. This states that if the the random variables X
1
, X
2
,
. . . , X
n
are independently and identically distributed, then no matter what the probability
distribution these random variables might be, average

X = (X
1
+X
2
+ +X
n
)/n and the
sum X
1
+X
2
+ +X
n
both have approximately a normal distribution. This approximation
becomes more accurate the larger n is, and is usually very good for values of n greater than
about 50. Since many statistical procedures deal with sums or averages, the Central Limit
Theorem ensures that we often deal with the normal distribution in these procedures. Also
we often use the formulas (21) and (22) for the mean and variance of a sum and of an average
and the approximation (40) when doing this.
We have already seen an example of this (in the Section Example of the use of
equations (22). The average

X of the numbers to turn up on 10,000 rolls of a fair die
is a random variable with mean 3.5 and variance 35/12,000, and thus standard deviation
_
35/1, 0002 0.0540. The central limit theorem states that to a very close approximation
24
this average has a normal distribution with this mean and this variance. Then application
of (40) shows that to a very close approximation,
P(3.5 2 0.540 <

X < 3.5 + 2 0.0540) 0.95, (42)
and this led to the (probability theory) statement given in the Section Example of the
use of equations (22) that the probability that

X takes a value between 3.392 and 3.608
is is about 95%. We also saw how this statement gives us a window into Statistics.
The Central Limit Theorem also applies to the binomial distribution. Suppose that X has
a binomial distribution with index n (the number of trials) and parameter (the probability
of success on each trial), and thus mean n and variance n(1 ). In the binomial context
the Central Limit Theorem states that X has, to a very close approximation, a normal
distribution with this mean and this variance. It similarly states that the proportion P of
successes has, to a very close approximation, a normal distribution with mean and this
variance (1 )/n.
Here is an application of this result. Suppose that it is equally likely that a newborn
will be a boy as a girl. If this is true, the number of boys is a sample of 2,500 newborns
has approximately a normal distribution with mean 1,250 and variance (from note (v) about
variances) of
1
2

1
2
2, 500 = 625, and hence standard deviation

625 = 25. Then (41) shows


that the probability is about 0.99 that the number of boys in this sample will be between
1, 250 2.575 25 and 1, 250 + 2.575 25,
that is about between 1185 and 1315. We saw these numbers in Homework 1.
Here is the corresponding window into Statistics. IF a newborn is equally likely to be
a boy as a girl, then the probability is about 99% that in a sample of 2,500 newborns, the
number of boys that we see will be between 1185 and 1315. (This is a probability theory
deduction, or implication. It is a zig. It is made under the assumption that newborn is
equally lilkely to be a boy as a girl.) However when we actually took this sample we saw 1,334
boys. We therefore have good evidence that it is NOT equally likely for a newborn to be a
boy as a girl. (This is an induction, or inference. It is a zag. It is a statement of Statistics.
It cannot be made without the corresponding probability theory zig calculation.)
The above example illustrates how we increase our knowledge in a context involving
randomness (here the randomness induced by the sampling process) by a probability the-
ory/Statistics zig-zag process. (In fact it is now known that it is NOT equally likely for a
newborn to be a boy as a girl.)
25
Statistics
Introduction
So far in these notes we have been contemplating the situation before some experiment is
carried out, so that we have been discussing random variables and their properties. We now
do our experiment. As indicated above, if before the experiment we had been considering
several random variables X
1
, X
2
, . . . , X
n
, we denote the actually observed value of these ran-
dom variables, once the experiment has been carried out, by x
1
, x
2
, . . . , x
n
. These observed
values are our data. As an example, if an experiment consisted of the rolling of a die n = 3
times, and after the experiment we observe that a 5 turned up on the rst roll and a 3
on both the second and third rolls, we would say that x
1
= 5, x
2
= 3, x
3
= 3. These are
our data values. It does not make sense to say, before the experiment has been done, that
X
1
= 5, X
2
= 3, X
3
= 3. This comment does not compute.
The three main activities of Statistics are the estimation of the numerical values of a
parameter or parameters, assessing the accuracy these estimates, and testing hypotheses
about the numerical values of parameters. We now consider each of these in turn.
Estimation (of a parameter)
Comments on the die-rolling example
In much of the discussion in these notes (and the course) so far the values of the various
parameters entering the probability distributions considered were taken as being known. A
good example was the fair die simulation: we knew in advance that the die is fair, so we
knew in advance the values of the mean (3.5) and the variance (35/12) of the number to
turn up on any roll of the die.
However, in practice these parameters are usually unknown, and must be estimated from
data. This means that our JMP die rolling experiment is very atypical. The reason is
that in this JMP experiment we know that the die is fair, so that we know for example the
mean of the (random variable) average of the numbers turning up after (say) 1,000 of rolls
of the die is 3.5. The real-life situation, especially in research, is that we will not know the
relevant mean. For example, we might be interested in the mean blood-sugar reading of
diabetics. To get some idea about what this mean might be we would take a sample of (say)
1,000 diabetics, measure the blood-sugar reading for each of these 1,000 people and iuse the
average of these to estimate this (unknown) mean. This is a natural (as as we see later)
correct thing to do.
So think of the JMP die-rolling example as a proof of principle: because we know that
the die is fair, we know in advance that the mean of the (random variable) average of the
numbers turning up after (say) 1,000 of rolls of the die is 3.5. We also know that it has
a small variance (35/12,000). This value (3.5) of the mean and this small variance imply
that, once we have rolled the die 1,000 times, our actually observed average should be very
26
close to the mean of 3.5. And this is what we saw happen. This suggests that in a real-life
example, where we do not knowthe numerical value of a a mean, then using an observed
average should give us a pretty good idea of what the mean is. Later we will rene this idea
more precisely.
General principles
In this section we consider general aspects of estimation procedures. Much of the theory
concerning estimation of parameters is the same for both discrete and continuous random
variables, so in this section we use the notation X for both.
Let X
1
, X
2
, ...., X
n
be n independently and identically (iid) random variables, each having
a probability distribution P(x; ) (for discrete random variables) or density function f(x; )
(for continuous random variables), depending in both cases (as the notation implies) on
some unknown parameter . We have now done our experiment, so that we now have the
corresponding data values x
1
, x
2
, . . . , x
n
. How can we use these data values to estimate the
parameter ? (Note that we are estimating the parameter , not calculating it. Even after
we have the data values we still will not know what the numerical value of is. But at least
if we use good estimation procedures we should have a reasonable approximate idea of its
value.)
Before discussing particular cases we have to consider general principles of estimation.
An estimator of the parameter is some function of the random variables X
1
, X
2
, . . . , X
n
,
and thus may be written

(X
1
, X
2
, . . . , X
n
), a notation that emphasizes that this estimator
is itself a random variable. For convenience we generally use the shorthand notation

. (We
pronounce

as -hat.) The quantity

(x
1
, x
2
, . . . , x
n
), calculated from the observed data
values x
1
, x
2
, . . . , x
n
of X
1
, X
2
, . . . , X
n
, is called the estimate of . The hat terminology is
a signal to us that we are talking abpout either an estimator or an estimate.
Note the two dierent words estimate and estimator. The estimate of is calculated from
our data, and will then be just some number. How good this estimate is depends on the
properties of the (random variable) estimator

, in particular its mean and its variance.
Various desirable criteria have been proposed for an estimator to satisfy, and we now
discuss three of these.
First, a desirable property of an estimator is that it be unbiased. An estimator

is said
to be an unbiased estimator of if its mean value is equal to . If

is an unbiased estimator
of we say that the corresponding estimate

(x
1
, x
2
, . . . , x
n
), calculated from the observed
data values x
1
, x
2
, . . . , x
n
is an unbiased estimate of . It is shooting at the right target.
Because of the randomness involved in the generation of our data, it will almost certainly
not exactly hit the target. But at least it is shooting in the right direction.
As an example, think of the average that you got of the number that turned up on your
1,000 rolls of a fair die in the JMP experiment. We will later show that, if you did not
know the mean (3.5) in the die case you would use your average to estimate it. And almost
certainly your average was not exactly 3.5. But at least it should have been close to 3.5,
27
since it was shooting at the right target.
Second, if an estimator

of some parameter is unbiased, we would also want the variance
of

to be small, since if it is, the observed value

(x
1
, x
2
, . . . , x
n
) calculated from your data,
that is your estimate of , should be close to . This is why a variance is an important
probability theory concept.
Finally, it would also be desirable if

has, either exactly or approximately, a normal
distribution, since then well-known properties of this distribution can be used to provide
properties of

. In particular, we often use the two-standard-deviation rule in assessing the
precision of our estimate, and this rule derives from the normal distribution. Fortunately,
several of the estimators we consider are unbiased, have a small variance, and have an
approximately normal distribution.
Estimation of the binomial parameter
The binomial distribution gives the probability distribution of the random variable X, the
number of successes from n binomial trials with the binomial parameter (the probability
of success on each trial). This random variable and its probability distribution are purely
probability theory concepts.
The corresponding statistical question is: I have now done my experiment and observed
x successes from n trials. What can I say about ? In particular, how should I estimate ?
How precise can I expect my estimate to be?
The classical, and perhaps natural, estimate of is p = x/n, the observed proportion of
successes. This is indeed our estimate of . What are the properties of this estimate?
These depend on the properties of the random variable P, the (random) proportion of
successes before we do the experiment. We know (from the relevant probability theory
formulas) that the mean of P is and that the variance of P is (1 )/n. What does this
imply?
First, since we know that the random variable P has a mean of , the estimate p of ,
once we have done our experiment, is an unbiased estimate of . It is shooting at the right
target. That is good news.
Just as important, we want to ask: how precise is this estimate? An estimate of a param-
eter without any indication of its precision is not of much value. This precision of p as an
estimate of depends on the variance, and thus on the standard deviation, of the random
variable P. We know that the variance of P is (1 )/n, so that the standard deviation of
P is
_
(1 )/n. We now use two facts.
(i) from the Central Limit Theorem as applied to the random variable P, we know that P
has, to a very accurate approximation, a normal distribution (with mean and variance
(1 )/n).
(ii) Once we know that P hasa normal distribution (to a sucently good approximation) we
28
are free to adapt either (40) or (41), which are normal distribution results,to the question of
the precision of p as an estimate of . First we have to nd out what these equations reduce
to, or imply, in the binomial distribution context. They become
P( 2
_
(1 )/n < P < + 2
_
(1 )/n) 0.95, (43)
and
P( 2.575
_
(1 )/n < P < + 2.575
_
(1 )/n) 0.99. (44)
The rst inequality implies, in words, something like this:
Before we do our experiment we can say that the random variable P takes a value within
2
_
(1 )/n of with probability of about 95% .
From this we can say:
After we have done our experiment, it is about 95% likely that the observed proportion
p of successes is within 2
_
(1 )/n of .
We now turn this second statement inside-out, (using the if I am within 10 yards of
you, you are and withinin 10 yards of me idea), and say:
It is about 95% likely that the once we have done our experiment, is within 2
_
(1 )/n
of observed proportion p of successes.
Writing this somewhat loosely, we can say
P(p 2
_
(1 )/n < < p + 2
_
(1 )/n) 0.95. (45)
We still have a problem. Since we do not know the value of we do not know the value
of the expression
_
(1 )/n occuring twice in (45). However at least we have an estimate
of , namely p. Since (45) is already an approximation, we make a further approximation
and say, again somewhat loosely,
P(p 2
_
p(1 p)/n < < p + 2
_
p(1 p)/n) 0.95. (46)
This leads to the so-called (approximate) 95% condence interval for of
p 2
_
p(1 p)/n to p + 2
_
p(1 p)/n. (47)
29
As an example, suppose that n = 1, 000 and p is 0.47. The sort of thing we would say is:
I estimate the value of to be 0.47, and I am (approximately) 95% certain that is between
0.47 2
_
0.47 0.53/1, 000 (i.e. 0.4384) and 0.47 + 2
_
0.47 0.53/1, 000 (i.e. 0.5016). In
saying this we have not only indicated our estimate of , but we have also given some idea
of the precision, or reliability, of that estimate.
Notes on the above.
(i) The range of values 0.4384 to 0.5016 in the above example is usually called a 95% con-
dence interval for . The interpretation of this statement is that we are (approximately)
95% certain that is within this range of values. Thus the condence interval gives us an
idea of the precision of the estimate 0.47.
(ii) In research papers, books, etc, you will often see the above result written as
= 0.47 0.0316.
(iii) The precision of the estimate 0.47 as indicated by the condence interval depends on
the variance (1 )/n of the random variable P. This is why we have to consider random
variables, their properties and in particular their variances.
(iv) It is a mathematical fact that p(1 p) can never exceed 1/4. Further, for quite a wide
range of values of p near 1/2, p(1 p) is quite close to 1/4. So if we approximate p(1 p)
by 1/4, and remember that
_
1/4 = 1/2, we arrive from (47) at a conservative condence
interval for as
p
_
1/n to p +
_
1/n. (48)
This formula is quite easy to remember and you may use it in place of (47).
(v) What was the sample size? Suppose that a TV announcer says, before an election be-
tween two candidates Smith and Jones, that a Gallup poll predicts that 52% of the voters
will vote for Smith, with a margin of error of 3%. The TV announcer has no idea where
that 3% ( = 0.03) came from, but in eect it came from the (approximate) 95% condence
interval (47) or (more likely) from (48). So we can work out, from (48), how many individu-
als were in the sample that led to the estimate 52%, or 0.52. All we have to do is to equate
_
1/n with 0.03. We nd n = 1, 111. ( Probably their sample size was 1,000, and with this
value the margin of error is
_
1/1, 000 = 0.0316. They just approximated this by 0.03.)
(vi) All of the above relates to an (approximate) 95% condence interval for . If you want to
be more conservative, and have a 99% condence interval, you can start with the inequalities
30
in (44) which, compared to (43) (which led to our 95% condence interval) replaces the 2
in (43) by 2.575. Carrying through the same sort of argument that led to (47) and (48), we
would arrive at an (approximate) 99% condence interval for of
p 2.575
_
p(1 p)/n to p + 2.575
_
p(1 p)/n (49)
in place of (47) or
p 1.2875
_
1/n to p + 1.2875
_
1/n. (50)
in place of (48).
Example. This example is from the eld of medical research. Suppose that someone proposes
an entirely new medicine for curing some illness. Beforehand we know nothing about the
properties of this medicine, and in particular we do not know the probability that it will
cure someone of the illness involved. is an (unknown) parameter. We want to carry out
a clinical trial to estimate . Suppose now that we have given the new medicine to 10,000
people with the illness and of these, 8,716 were cured. Then we estimate to be 0.8716.
Since we want to be very precise in a medical context we might prefer to use the 99% con-
dence interval (50) instead of the 95 % condence interval (48). Since
_
1/10, 000 is 0.01,
we would say: I estimate the probability of a cure by 0.8716, and I am about 99% certain
that the probability of a cure with this proposed medicine is between 0.8716 - 0.012875 (=
0.8587 ) and 0.8716 + 0.012875 (=0.8845).
(vii) Notice that the length of both condence intervals (50) and (48) are proportional to
1/

n. This means that if you want to be twice as accurate you need four times the sample
size, that if you want to be three times as accurate you need nine times the sample size, and
so on. This is why your medicines are so expensive: the FDA requires considerable accuracy
before a medicine can be put on the market, and this often implies that a very large sample
size is needed to meet this required level of accuracy.
(viii). Often in research publications the result of an estimation procedure is written as
something like: estimate some measure of precsion of the estimate. Thus the result in
the medical example above might be written as something like: = 0.8716 0.012875.
This can be misleading because, for example, it is not indicated if this is a 95% or a 99%
condence interval. Also, it is not the best way to present the conclusion.
(ix). The width of the condence interval, and hence the precision of the estimate, ultimately
depend on the variance of the random variable P. This is why we have to discuss (a) random
variables and (b) variances of random variables.
31
Estimation of a mean ()
Suppose that we wish to estimate the mean blood sugar level of diabetics. We take a random
sample of n diabetics and measure their blood sugar levels, getting data values x
1
, x
2
, ...., x
n
.
It is natural to estimate the mean by the average x of these observed values. What are
the properties of this estimate? To answer these questions we have to zig-zag backwards and
forwards between probability theory and Statistics.
We start with probability theory theory and think of the situation before we got our
data. We think of the data values x
1
, x
2
, ...., x
n
as the observed values of n iid random
variables X
1
, X
2
, ...., X
n
, all having some continuous probability density function with (un-
known to us) mean and (unknown to us) variance
2
. Our aim then is to estimate from
the data and to assess the precision of our estimate. The form of the density function of
each X is unknown to us. We can however conceptualize about this distribution graphically:-
.
.
.
.
.
.
.
There is some (unknown to us) probability (the shaded area below) that the blood sugar
level of a randomly chosen diabetic lies between the values a and b:-
.
.
32
.
This distribution has some (unknown to us) mean (which is what we want to estimate)
at the balance point of this density function, as indicated by the arrow:-
.
.
.
.
.
.
.
We continue to think of the situation before get our data. That is we continue to think in
terms of probability theory. We conceptualize about the average

X of the random variables
X
1
, X
2
, ...., X
n
. Since the mean value of

X is (from equation (22)),

X is an unbiased
estimator of . Thus x is an unbiased estimate of . That is good news: it is shooting at
the right target. So our natural estimate is the correct one.
Much more important: how precise is it as estimate of ? This depends on the variance
of

X. We know (see (22)) that the variance of

X is
2
/n, and even though we do not know
the value of
2
, this result is still useful to us. Next, the Central Limit Theorem shows
that the probability distribution of

X is approximately normal when n is large, so that to a
good approximation we can use the two-standard-deviation rule. These facts lead us to an
approximate 95% condence interval for .
The 95% condence interval for
Suppose rst that we know the numerical value of
2
. (In practice it is very unlikely that
we would know this, but we will remove this assumption soon.) The two-standard-deviation
rule, deriving from properties of the normal distribution, then shows that for large n,
P
_

2

n
<

X < +
2

n
_
0.95. (51)
33
The inequalities (51) can be written in the equivalent turned inside-out form form
P
_

X
2

n
< <

X +
2

n
_
0.95. (52)
This leads to an approximate 95% condence interval for , given the data values x
1
, x
2
, ...., x
n
,
as
x
2

n
to x +
2

n
. (53)
This interval is valuable in providing a measure of accuracy of the estimate x of . To be
told that the estimate of a mean is 14.7 and that it is approximately 95% likely that the
mean is between 14.3 and 15.1 is far more useful information than being told only that the
estimate of a mean is 14.7.
The main problem with the above is that, in practice, the variance
2
is usually unknown,
so that (53) is not immediately applicable. However it is possible to estimate
2
from the
data values x
1
, x
2
, ...., x
n
. The theory here is not easy, so here is a trust me result: the
estimate s
2
of
2
found from observed data values x
1
, x
2
, . . . , x
n
is
s
2
=
x
2
1
+x
2
2
+ +x
2
n
n( x)
2
n 1
. (54)
This leads (see (53)) to an even more approximate 95% condence interval for as
x
2s

n
to x +
2s

n
. (55)
This estimated condence interval is useful, since it provides a measure of the accuracy of
the estimate x and it can be computed entirely from the data. Further theory shows that in
practice it is reasonably accurate.
Some notes on the above
1. The number 2 appearing in the condence interval (55) comes, eventually, from the
two-standard-deviation rule. This rule is only an approximate one so that, as mentioned
above, the 95% condence interval (55) is only reasonably accurate.
2. Why do we have n1 in the denominator of the formula (54) for s
2
and not n (the sample
size)? This question leads to the concept of degrees of freedom, which we shall discuss later.
3. The eect of changing the same size. Consider two investigators both interested inthe
blood sugar levels of diabetics. Suppose that n = 10 for the rst investigator (i.e. her sample
size was 10). Suppose that n = 40 for the second investigator (i.e. his sample size was 40).
The two investigators will estimate by their respective values of x. Since both estimates
34
are unbiased, that is both are shooting at the same target (), they should be reasonably
close to each other. Similarly, their respective estimates of
2
should be reasonably close to
each other, since both are unbiased estimates of
2
.
On the other hand the length of the condence interval for for the second investigator
will be about half that of the rst investigator, since he will have a

40 involved in the
calculation of his condence interval, not the

10 that the rst investigator will have (see


(55) and note that 1/

40 is half of 1/

10). This leads to the next point.


4. To be twice as accurate you need four times the sample size. To be three times as accu-
rate you need nine times the sample size. This happens because the length of the condence
interval (55) is 4s/

n. The fact that there is a

n in the denominator and not an n explains


this phenomenon. This is why research is often expensive: to get really accurate estimates
one often needs very large sample sizes. To be 10 times as accurate you need 100 times the
sample size!
5. How large a sample size do you need before you do your experiment in order to get some
desired degree of precision of the estimate of the mean ? One cannot answer this question
in advance, since the precision of the estimate depends on
2
, which is unknown. Often one
runs a pilot experiment to estimate
2
, and from this one can get a good idea of what sample
size is needed to get the required level of precision, using the above formulas.
6. The quantity s/

n is often called the standard error of the mean. This statement


incorporates three errors. More precisely it should be: the estimated standard deviation of
the estimator of the mean.
A numerical example. For many years corn has been grown using a standard seed processing
method. A new method is proposed in which the seed is kiln-dried before sowing. We want
to assess various properties of this new method. In particular we want to estimate , the
the mean yield per acre (in pounds) under the new method and to nd two limits between
which we are approximately 95% certain that lies.
We plan to do this by sowing n = 11 separate acres of land with the new seed type and
measuring the yield per acre for each of these 11 acres. At this point, before we do the
experiment, these yields are unknown to us. They are random variables, and we think of
their values, before we carry our this experiment, as the random variables X
1
, X
2
, ...., X
11
.
We know (as above) that the mean

X of these random variables has mean , so we know
that the estimate x will be unbiased.
With this conceptualization behind us, we now apply this new style of seed to our 11
separate acre lots, and we get the following values (pounds per acre):-
1903, 1935, 1910, 2496, 2108, 1961, 2060, 1444, 1612, 1316 1511.
35
These are our data values, which we have previously generically denotes by x
1
, x
2
, ...., x
n
.
Now to our estimation and condence interval procedures. We estimate the mean of
the yield per acre by the average
1903 + 1935 + + 1511
11
= 1841.46.
We know from the above theory that this is an unbiased estimate of , the mean yield per
acre.
To calculate the approximate 95% condence interval (55) for we rst have to calculate
s
2
, our estimate of the variance
2
of the probability distribution of yield with this new seed
type. The estimate of
2
is, from (54),
s
2
=
(1903)
2
+ (1935)
2
+ + (1511)
2
11(1841.46)
2
10
= 117, 468.9.
Following (55), these calculations lead to our approximate 95% condence interval for as
1841.46
2

117468.9

11
to 1841.46 +
2

117468.9

11
, (56)
that is from 1634.78 to 2048.14.
Since the individual yields are clearly given rounded to whole numbers, it is not appropriate
to be more accurate than this in our nal statement, which is: We estimate the mean by
1841, and we are about 95% certain that it is between 1635 and 2048.
Often in research publications the above result might be written = 1841206. This can
be misleading because, for example, it is not indicated if this is a 95% or a 99% condence
interval. Also, it is not the best way to present the conclusion.
36
Estimating the dierence between two binomial parameters
Lets start with an example. Is there is a dierence between men and women on their atti-
tudes in the pro-life/pro-choice debate? We approach this question from a statistical point
of view as follows.
Let
1
be the (unknown) probability that a woman is pro-choice and let
2
be the (un-
known) probability that a man is pro-choice. So we are interested in the dierence
1

2
.
Our aim is to take a sample of n
1
women and n
2
men and nd out for each person whether
he/she is pro-life or pro-choice. Our aim then is to estimate
1

2
and to nd an approxi-
mate 95% condence interval for
1

2
.
Suppose now that we have taken our sample, and that x
1
of the n
1
women are pro-choice,
and x
2
of the n
2
men are pro-choice. We would estimate
1
by x
1
/n
1
, which we will write as
p
1
, and would estimate
2
by x
2
/n
2
, which we will write as p
2
. Thus we would (correctly)
estimate
1

2
by the dierence d = p
1
p
2
.
What are the properties of this estimate? These are determined by the properties of
the random variable D = P
1
P
2
, where, before we took our sample, P
1
= X
1
/n
1
is the
proportion of women who will be pro-choice and P
2
= X
2
/n
2
is the proportion of men who
will be pro-choice. Both P
1
and P
2
are random variables.
Notice that P
1
P
2
is a dierence. In comparing two groups we are often involved with
dierences. That is why we have done some probability theory about dierences.
Now we know that the mean of P
1
is
1
(this is one of the magic formulas for the pro-
portion of successes in the binomial context) and we also know that the mean of P
2
is
2
(this uses the same magic formula). Thus from the rst equation in (26), giving the mean of
a dierence of two random variables with possibly dierent means, the mean of D is
1

2
.
Thus D is an unbiased estimator of
1

2
and correspondingly d = p
1
p
2
is an unbiased
estimate of
1

2
. It is shooting at the right target. It is the estimate of
1

2
that we
will use.
More important: how precise is this estimate? To answer this we have to nd the variance
of the estimator P
1
P
2
. Now the variance of P
1
is
1
(1
1
)/n
1
(from the variance of the
proportion of successes in n
1
binomial trials). Similarly the variance of P
2
is
2
(1
2
)/n
2
.
From the second equation in (26), giving the variance of a dierence of two random variables
with possibly dierent variances, the variance of D is
1
(1
1
)/n
1
+
2
(1
2
)/n
2
.
We do not of course know the numerical value of this variance, since we do not know the
values of
1
and
2
. However we have an estimate of
1
, namely p
1
and an estimate of
2
,
namely p
2
. So we could estimate this variance by p
1
(1 p
1
)/n
1
+p
2
(1 p
2
)/n
2
.
37
Using the same sort of argument that led to (46), we could then say
P(p
1
p
2
2
_
p
1
(1 p
1
)/n
1
+p
2
(1 p
2
)/n
2
<
1

2
< p
1
p
2
+ 2
_
p
1
(1 p
1
)/n
1
+p
2
(1 p
2
)/n
2
) 0.95. (57)
This leads to the so-called (approximate) 95% condence interval for
1

2
of
p
1
p
2
2
_
p
1
(1 p
1
)/n
1
+p
2
(1 p
2
)/n
2
to
p
1
p
2
+ 2
_
p
1
(1 p
1
)/n
1
+p
2
(1 p
2
)/n
2
. (58)
These formulas a pretty clumsy, so we carry out the same approximation that we did
when estimating a single binomial parameter (see the discussion leading to (48)). That is,
we use the mathematical fact that neither p
1
(1 p
1
) nor p
2
(1 p
2
) can ever exceed 1/4.
Further, for quite a wide range of values of any fraction f near 1/2, f(1 f) is quite close
to 1/4. So if we approximate both p
1
(1 p
1
) and p
2
(1 p
2
by 1/4, and remember that
_
1/4 = 1/2, we arrive at a conservative condence interval for
1

2
as
p
1
p
2

_
1/n
1
+ 1/n
2
to p
1
p
2
+
_
1/n
1
+ 1/n
2
. (59)
Numerical example. Suppose that we interview n
1
= 1, 000 women and n
2
= 800 men
on the pro-life/pro-choice question. We nd that 624 of the women are pro-life and 484
of the men are. So we estimate
1
by p
1
= 624/1, 000 = 0.624 and we estimate
2
by
p
2
= 484/800 = 0.605. So we estimate the dierence between the proportion of women
who are pro-choice and the proportion of men who are pro-choice to be 0.624 - 0.605 =
0.019. Further, we are approximately 95% certain that the actual proportion is between
0.019
_
1/1, 000 + 1/800 = 0.019 0.047 = 0.028 and 0.019 +
_
1/1, 000 + 1/800 =
0.019 + 0.047 = 0.066. A TV commentator would call 0.047 the margin of error.
Later, when we do hypothesis testing, we will see if the estimate 0.019 diers signicantly
from 0.
Estimating the dierence between two means
As in the previous section, lets start with an example. In fact we will follow the structure
of the last section fairly closely. Is the mean blood pressure of women equal to the mean
blood pressure of men ? We approach this question from a statistical point of view as follows.
Let
1
be the (unknown) mean blood pressure for a a woman and
2
be the (unknown)
mean blood pressure for a man. So we are interested in the dierence
1

2
. Our aim is
38
to take a sample of n
1
women and n
2
men and measure the blood pressures of all n
1
+ n
2
people. Our aim then is to estimate
1

2
and to nd an approximate 95% condence
interval for
1

2
.
Clearly we estimate the mean blood pressure for women by x
1
, the average of the blood
pressures of the n
1
women in the sample, and similarly we estimate the mean blood pressure
for men by x
2
, the average of the blood pressures of the n
2
men in the sample. We then
estimate
1

2
by x
1
x
2
.
How accurate is this estimate? This depends on the variance of the random variable

X
1


X
2
. Using the formula for the variance of a dierence, as well as the formula for the
variance of an average, this variance is

2
1
n
1
+

2
2
n
2
, where
2
1
is the (unknown) variance of blood
pressure among women and
2
2
is the (unknown) variance of blood pressure among men. We
do not know either
2
1
or
2
2
and these will have to be estimated from the data. If the blood
pressures of the n
1
women are denoted x
11
, x
12
, ...., x
1n
1
, we estimate
2
1
(see equation 54) by
s
2
1
=
x
2
11
+x
2
12
+ +x
2
1n
1
n
1
( x
1
)
2
n
1
1
. (60)
Similarly, if the blood pressures of the n
2
men are denoted x
21
, x
22
, ...., x
2n
1
, we estimate
2
2
(see equation 54) by
s
2
2
=
x
2
21
+x
2
22
+ +x
2
2n
2
n
2
( x
2
)
2
n
2
1
. (61)
Thus we estimate

2
1
n
1
+

2
2
n
2
by
s
2
1
n
1
+
s
2
2
n
2
.
Finally our approximate 95% condence interval for
1

2
is
x
1
x
2
2

s
2
1
n
1
+
s
2
2
n
2
to x
1
x
2
+ 2

s
2
1
n
1
+
s
2
2
n
2
. (62)
39
Regression
How does one thing depend on another? How does the GNP of a country depend on the
number of people in full-time employment? How does the reaction time of a person to some
stimulus depend on the amount of sleep deprivation administered to that person? How does
the growth height of a plant in a greenhouse depend on the amount of water that we give
the plant during the growing period? Many practical questions are of the how does this
depend on that? type.
These questions are answered by the technique of regression. Regression problems can get
pretty complicated, so we consider here only one case of regression: How does some random
non-controllable quantity Y depend on some non-random controllable quantity x?
Notice two things about the notation. First, we denote the random quantity in upper case
- see Y above. This is in accordance with the notational convention of denoting random
variables in upper case. We denote the controllable non-random quantity in lower case - see
x above. Second, we are denoting the random variable by Y . Up to now we have denoted
random variables using the letter X. We switch the notation from X to Y in the regression
context to because we will later plot our data values in the standard x-y plane, and it is
natural to plot the observed values of the random variable as the y values.
We will use the plant and water example to demonstrate the central regression concepts.
First we think of the situation before our experiment, and consider some typical generic plant
to which we will plan to give x units of water. At this stage the eventual growth height Y of
this plant is a random variable - we do not know what it will be. We make the assumption
that the mean of Y is of the linear form:-
mean of Y = +x, (63)
where and are parameters, that is quantities whose value we do not know. In fact our
main aim once the experiment is nished is to estimate the numerical values of these param-
eters and to get some idea of the precision of our estimates. Note that we assume that the
mean growth height potentially depends on x, and indeed our main aim is to assess the way
it depends on x.
We also assume that
variance of Y =
2
, (64)
where
2
is another parameter whose value we do not know and which, once the experiment
is nished, we wish to estimate. The fact that there is a (positive) variance for Y derives
from the fact that there is some, perhaps muhc, uncertainty about what the value of the
plant growth will be after we have done the experiment. There are many factors that we
40
do not know about, such as soil fertility, which imply that Y is a random variable, with a
variance.
So we are involved with three parameters, , and
2
. We do not know the value of any
one of them. A stated above, one of our aims, once the experiment is over, is to estimate
these parameters from our data.
Of these three the most important on to us is . The interpretation of is that it is the
mean increase in growth height per unit increase in the amount of water given. If = 0
this mean increase is zero, and equation (63) shows that the mean growth height does not
depend on the amount of water that we give the plant. So we will be interested, later, in
seeing if our estimate of , once we have our data, is close to zero or not.
Taking a break from regression for a moment, equation (63) reminds us that the algebraic
equation y = a +bx denes a geometric line in the x-y plane, as shown:-
The interpretation of a is that it is the intercept of this line on the y axis (as shown). The
interpretation of b is hat it is the slope of the line. If b = 0 the line is horizontal, and then
the values of y for points one the line are all the same, whatever the value of x.
Now back to the regression context. We plan to use some pre-determined number n of plants
in our greenhouse experiment, planning to give the plants respectively x
1
, x
2
, ..., x
n
units of
water. These x values do not have to be all dierent from each other, but it is essential that
they are not all equal. In fact there is a strategy question about how we would choose the
values of x
1
, x
2
, ..., x
n
which is discussed later.
We are still thinking of the situation before we conduct our experiment. At this stage we
conceptualize about the growth heights Y
1
, Y
2
, ..., Y
n
of the n plants. (Y
1
corresponds to the
plant getting x
1
units of water, Y
2
corresponds to the plant getting x
2
units of water, and so
on.) These are all random variables - we do not know in advance of doing the experiment
what values they will take. Then from equation (63), the mean of Y
1
is + x
1
, the mean
of Y
2
is + x
2
, and so on. The variance of Y
1
is
2
, the variance of Y
2
is also
2
, and so
41
on. We assume that the various Y
i
values are independent. However they are clearly not
assumed to be identically distributed, since if for example x
i
= x
j
, that is the amount of
water to be given to plant i diers from that to be given to plant j, the means of Y
i
and Y
j
are dierent if = 0 and the assumptions embodied in (63) are true.
Equation (63) shows that the mean of Y is a linear function of x. This means that once
we have our data they should (if the assumption in (63) is correct) approximately lie on a
straight line. We do not expect them to lie exactly on a straight line: we can expect random
deviations from a straight line because of factors unknown to us such as dierences in soil
composition among the pots that the various plants are grown in, temperature dierences
from the environment of one plant to another, etc. The fact that deviations from a straight
line are to be expected is captured by the concept of the variance
2
. The larger this (un-
known to us) variance is, the larger these deviations from a line would tend to be.
All the above refers to the situation before we conduct our experiment. We now do the
experiment, and we obtain growth heights y
1
, y
2
, ..., y
n
. (The plant getting x
1
units of wa-
ter had growth height y
1
, the plant getting x
2
units of water had growth height y
2
, and so on.)
The rst thing that we have to do is to plot the (x
1
, y
1
), (x
2
, y
2
), ...., (x
n
, y
n
) values on a
graph. Equation (63) shows that when we do this the data points should more or less lie
on a straight line. Suppose that our data points are as shown below:-
These data points more or less lie on a straight line. If the data points are more or less
on a straight line (as above, and deciding this is really a matter of judgement) we can go
ahead with our analysis. If they are clearly not on a straight line (see example at the top of
the next page) then you should not proceed with the analysis.
42
.
There are methods for dealing with data that clearly do not lie close to being on a straight
line, but we do not consider them here. So from now on we assume that the data are more
or less on a straight line.
Our rst aim is to use the data to estimate , and
2
. To do this we have to calculate
various quantities. These are
x =
x
1
+x
2
+.... +x
n
n
, y =
y
1
+y
2
+.... +y
n
n
, (65)
as well as the quantities s
xx
, s
yy
and s
xy
, dened by
s
xx
= (x
1
x)
2
+ (x
2
x)
2
+.... + (x
n
x)
2
, (66)
s
yy
= (y
1
y)
2
+ (y
2
y)
2
+... + (y
n
y)
2
, (67)
s
xy
= (x
1
x)(y
1
y) + (x
2
x)(y
2
y) +... + (x
n
x)(y
n
y). (68)
The most important parameter is , since if = 0 the growth height for any plant does not
depend on the amount of water given to the plant. The derivation of unbiased estimates
here is complicated, so we just give the trust me results:-
We estimate by b, dened by
b = s
xy
/s
xx
. (69)
We estimate by a, dened by
a = y b x. (70)
Finally we estimate
2
by s
2
r
, dened by
s
2
r
= (s
yy
b
2
s
xx
)/(n 2). (71)
These are the three estimates that we want for our further analysis.
43
Notes on this.
1. The sux r in s
2
r
stands for the word regression. The formula for s
2
r
relates only to
the regression context.
2. You will usually do a regression analysis by a statistical package (we will do an example
in class), so in practice you will usually not have to do the computations for these estimates.
3. It can be shown (the math is too dicult to give here) that a is an unbiased estimate of
, that b is an unbiased estimate of , and that that s
2
r
is an unbiased estimate of
2
.
4. How accurate is the estimate b of ? Again here there is some dicult math that you will
have to take on trust. The bottom line is that we are about 95% certain that is between
b
2s
r

s
xx
and b +
2s
r

s
xx
. (72)
This is our (approximate) 95% condence interval for . Clearly the 2 in this result comes
from the two-standard-deviation rule. You will have to take the
2sr

sxx
part of it on trust.
5. This result introduces a strategy concept into our choice of the values x
1
, x
2
, ..., x
n
, the
amounts of water that we plan to put on the various plants. The width of this condence
interval is proportional to 1/

s
xx
. Thus the larger we make

s
xx
the shorter is the length
of this condence interval and the more precise we can be about the value of . How can we
make

s
xx
large? We can do this by spreading the x values as far away from their average as
we reasonably can. However two further considerations then come into play. We should keep
the various x values within the range of values which is of interest to us. Also, we do not want
to make half the x values at the lower end of this interesting range and the other half at the
upper end of this interesting range. If we did this we would have no idea what is happening
in the middle of this range - see the picture below to illustrate the case where we put about
half our x values at the same low value and the other half our x values at the same high value.
44
So in practice we tend to put quite a few x values near the extremes but also string quite a
few x values in the middle of the interesting range. This will be illustrated in a numerical
example later.
6. A second result goes in the other direction. Suppose that the amounts of water put on
the various plants were close to each other. In other words the x values would be close to
each other and all would then be close to their average. This would mean that s
xx
would be
small. So 2s
r
/s
xx
would be large, and the condence interval (72) would be wide. We then
have little condence in our estimate of .
An even more extreme case arises if we give all plants the same amount of water. Then
both s
xx
and s
xy
would be zero, and the denition of b shows that we would calculate b as
0/0, which mathematically makes no sense. In fact the formula for b is telling you: You
cant estimate with the data that you have. The formula here is denitely sending you
a message. In fact it is saying: You want to assess how the growth height depends on the
amount of water given to the plant. If you give all plants the same amount of water there
is no way that you can do this. It would be the same as a situation where you wanted to
assess how the height of a child depended on his/her age, and all the children in your sample
were of exactly the same age. You clearly could not make this assessment with data of that
type.
Example. We will do an example from the water and plant growth situation.
We have n = 12 plants to which we gave varying amounts of water (see below). After the
experiment we obtained the following data:-
Plant number 1 2 3 4 5 6 7 8 9 10 11 12
Amount of water 16 16 16 18 18 20 22 24 24 26 26 26
Growth height 76.2 77.1 75.7 78.1 77.8 79.2 80.2 82.5 80.7 83.1 82.2 83.6
From these we compute x = (16+16+ +26)/12 = 21 and y = (76.2+77.1+ +83.6)/12 =
79.7. Also we nd s
xx
= 188, s
yy
= 83.54 and s
xy
= 122.4.
We now compute our estimate b of as s
xy
/s
xx
= 122.4/188 = 0.6510638. (This result is
given to 7 decimal places so as to compare with the JMP printout. In practice you are not
justied in giving it to an accuracy greater than that of the data, so in practice we would
write b = 0.65).
45
Next, our estimate a of is y b x = 79.7 (0.6510638 21) = 66.02766. (Again this result
is given to 7 decimal places so as to compare with the JMP printout. In practice you are not
justied in giving it to an accuracy greater than that of the data, so in practice we would
write a = 66.03.)
Finally we estimate
2
by s
2
r
, calculated in this case (see (71)) as
83.54 (0.65)
2
(188)
10
= 0.384979.
How accurate is our estimate b of ? First, from the theory it is unbiased. It was found by
a process which is truly aiming at . next, we are approximately 95% certain that is
between
0.65
2

0.384979

188
to 0.65 +
2

0.384979

188
, (73)
that is from 0.56 to 0.74.
Notes on this
1. Our so-called estimated regression line is y = 66.03 +0.65x. That is the equation of the
line that appears on the JMP screen. We could use this line, for example, to say that we
estimate the mean growth height for a plant given 21 units of water to be 66.03 + 0.6521
= 79.68.
2. Never extrapolate between the x values in the experiment. For example it is not appro-
priate to say that we estimate the mean growth height for a plant given 1,000 units of water
is 66.03 + 0.651,000 = 716.03. (You probably would have killed the plant if you gave it
this much water.)
3. Later we will consider testing the hypothesis that the growth height of the plant does not
depend on the amount of water given to it. This is equivalent to testing the hypothesis = 0.
4. Notice the choices of the amount of water in the above example. We gave three plants
the lowest amount of water (16) and three plants the highest amount of water (26). We also
strung a few values out between these values, in acordance with the discussion above about
the choice of x values.
We will do this example by JMP in class. There will also be a handout discusing the JMP
output and the interpretation of various things in this output. That handout should be
regarded as part of these notes.
46
Testing hypotheses
Background
In hypothesis testing we attempt to answer questions. Here are some simple examples.
Is this coin fair? Is a women equally likely to be left-handed as a man is? Is there any
dierence between men and women so far as blood pressure is concerned? Is there any eect
of the amount of water given to a plant and its growth height?
We always re-phrase these questions in terms of questions about parameters:-
If the probability of a head is , is = 1/2? If the probability that a woman is left-handed
is
1
, and the probability that a man is left-handed is
2
, is
1
=
2
? If the mean blood
presure for a woman is
1
, and the mean blood presure for a man is
2
, is
1
=
2
? Is = 0?
We re-phrase these questions in this way because we know how to estimate parameters and
to get some idea of the precision of our estimates. So re-phrasing questions in terms of
questions about parameters helps us to answer them. Attempting to answer them is an
activity of hypothesis testing.
The general approach to hypothesis testing
We will consider two equivalent approaches to hypothesis testing. The rst approach pre-
dates the availability of statistical packages, while the second approach is to some extent
motivated by the availability of these packages. We will discuss both approaches. Both
approaches involve ve steps. The rst three steps in both approaches are the same, and
we consider these three steps rst. We will illustrate all steps by considering two problems
involving the binomial distribution.
Step 1
Statistical hypothesis testing involves the test of a null hypothesis (which we write in short-
hand as H
0
) against an alternative hypothesis (which we write in shorthand as H
1
). The
rst step in a hypothesis testing procedure is to declare the relevant null hypothesis H
0
and
the relevant alternative hypothesis H
1
. The null hypothesis, as the name suggests, usually
states that nothing interesting is happening. This comment is discussed in more detail
below. The choice of null and alternative hypotheses should be made before the data are
seen. Also the nature of the alternative hypothesis must be decided before the data are seen:
this is also discussed in more detail below. To decide on a hypothesis as a result of the data
is to introduce a bias into the procedure, invalidating any conclusion that might be drawn
47
from it. Our aim is eventually to accept or to reject the null hypothesis as the result of an
objective statistical procedure, using our data in making this decision.
It is important to clarify the meaning of the expression the null hypothesis is accepted.
This expression means that there is no statistically signicant evidence for rejecting the null
hypothesis in favor of the alternative hypothesis. A better expression for accepting is thus
not rejecting. So instead of saying We accept H
0
, it is best to say We do not have
signicant evidence to reject H
0
.
The alternative hypothesis will be one of three types: one-sided up, one-sided down,
and two-sided. In any one specic situation which one of these three types is appropriate
must be decided in advance of getting the data. The context of the situation will generally
make it clear which is the appropriate alternative hypothesis.
All the above seems very abstract, so as stated above we will illustrate the steps in the
hypothesis testing procedure by two examples, both involving the binomial distribution.
Example 1
It is essential for a gambling casino that the various games oered are fair, since an astute
gambler will soon notice if they are unfair and bet accordingly. As a simplied example,
suppose that one game involves ipping a coin, and it is essential, from the point of view
of the casino operator, that this coin be fair. The casino operator now plans to carry out a
hypotheis testing procedure. If the probability of getting head on any ip of the coin is
denoted , the null hypothesis H
0
for the casino operator then states that = 1/2. (No bias
in the coin. Nothing interesting happening.)
In the casino example it is important, from the point of view of the casino operator, to detect
a bias of the coin towards either heads or tails (if there is such a bias). Thus in this case the
alternative hypothesis H
1
is the two-sided alternative = 1/2. This alternative hypothesis is
said to be composite: it does not specify some numerical value for (as H
0
does). Instead
it species a whole collection of values. It often happens that the alternative hypothesis is
composite.
Example 2
This example comes from the eld of medical research. Suppose that we have been using
some medicine for some illness for many years (we will call this the current medicine),
and we in eect know from much experience that the probability of a cure with the current
medicine is 0.84. A new medicine is proposed and we wish to assess whether it is better
than the current medicine. Here the only interesting possibility is that it is better than
48
the current medicine: if it is equally eective as the current medicine, or (even worse) less
eective than the current medicine, we would not want to introduce it.
Let be the (unknown) probability of a cure with the new medicine. Here the null hypoth-
esis is = 0.84. If this null hypothesis is true the new medicine is equally eective as the
current one, since its cure rate would be equal to that of the current medicine. The natural
alternative in this case is one-sided up, namely > 0.84. This is the only case of interest
to us. This is also a composite hypothesis.
Notice how, in both examples, the nature of the alternative hypothesis is determined by the
context, and that in both cases the null and alternative hypotheses are stated before the
data are seen.
Step 2
Since the decision to accept or reject H
0
will be made on the basis of data derived from some
random process, it is possible that an incorrect decision will be made, that is, to reject H
0
when it is true (a Type I error), or false positive, or to accept H
0
when it is false (a Type
II error), or a false negative. This is illustrated in the following table:-
We accept H
0
We reject H
0
H
0
is true H
0
OK Type I error
H
0
is false H
0
Type II error OK
When testing a null hypothesis against an alternative it is not possible to ensure that the
probabilities of making a Type I error and a Type II error are both arbitrarily small unless
we are able to make the number of observations as large as is needed to do this. In practice
we are seldom able to get enough observations to do this.
This dilemma is resolved in practice by observing that there is often an asymmetry in the
implications of making the two types of error. In the two examples given above there might
be more concern about making the false positive claim and less concern about making the
false negative claim. This would be particularly true in the medicine example: we are
anxious not to claim that the new medicine is better than the current one if it is not better.
If we make this claim and the new medicine in not better than the current one, many millions
of dollars will have been spent manufacturing the new medicine, only to nd later that it is
not better than the current one. For this reason, a frequently adopted procedure is to focus
on the Type I error, and to x the numerical value of this error at some acceptably low level
(usually 1% or 5%), and not to attempt to control the numerical value of the Type II error.
The value chosen is denoted . The choice of the values 1% and 5% is reasonable, but is also
clearly arbitrary. The choice 1% is a more conservative one than the choice 5% and is often
49
made in a medical context. Step 2 of the hypothesis testing procedure consists in choosing
the numerical value for the Type I error, that is in choosing the numerical value of . This
choice is entirely at your discretion. In the two examples that we are considering we will
choose 1% for the medical example and 5% for the coin example.
Step 3
The third step in the hypothesis testing procedure consists in determining a test statistic.
This is the quantity calculated from the data whose numerical value leads to acceptance or
rejection of the null hypothesis. In the coin example the natural test statistic is number
of heads that we will get after we have ipped the coin in our testing procedure. In the
medicine case the natural test statistic is number of people cured with the new medicine in
a clinical trial. These are both more or less obvious, and both are the correct test statistics:
however, in more complicated cases the choice of a test statistic is not so straightforward.
As stated above there are two (equivalent) approaches to hypothesis testing. Which approach
we use is simply a matter of our preference. As also stated above the rst three steps (as
outlined above) are the same for both approaches. Steps 4 and 5 dier under the two
approaches, so we now consider them separately.
Approach 1, Step 4
Under Approach 1, Step 4 in the procedure consists in determining which observed values of
the test statistic lead to rejection of H
0
. This choice is made so as to ensure that the test has
the numerical value for the Type I error chosen in Step 2. We rst illustrate this step with
the medicine example, where the calculations are simpler than in the coin example. First we
review steps 1 - 3 in this example.
Step 1. We write the (unknown) probability of a cure with the new medicine as . The null
hypothesis claims that = 0.84 and the alternative hypothesis claims that > 0.84.
Step 2. Since this is a medical example we choose a Type I error of 1%.
Step 3. Suppose that we plan to give the new medicine to 5,000 patients. We will reject the
null hypothesis if the number x of patients who were cured with the new medicine is large
enough. In other words x is our test statistic.
Now we proceed to steps 4 and 5.
Step 4. How large does x, the number of patients cured with the new medicine, have to
be before we will reject the null hypothesis? We will reject the null hypothesis if x A,
where A is chosen so that the Type I error takes the desired value 1%. We now have to do
50
a probability theory zig, and consider before the clinical trial is conducted the random
variable X, the number of people who will be cured with the new medicine. Then we willl
reject the null hypothesis if x A, where (using a probability theory zig), A is chosen so
that P(X A when = 0.84) = 0.01.
How do we calculate the value of A? We will use the central limit theorem and a Z chart. If
= 0.84 the mean of X is (5,000)(0.84) = 4,200 and the variance of X is (5,000)(0.84)(.16)
= 672, using the formula for the mean and the variance of a binomial random variable.
(Why binomial? Because there are two possible outcomes on each trial for each patient -
cured or not cured). Next, to a very close approximation X can be taken as having a normal
distribution with this mean and this variance when the null hypothesis is true. So to this
level of approximation, A has to be such that
P(X A) = 0.01,
where X has a normal distribution with mean 4,200 and variance 672. We now do a z-ing:
we want
P(
X 4, 200

672

A 4, 200

672
) = 0.01.
Now when the null hypothesis is true,
X4,200

672
is a Z, and the Z charts now show that
A4,200

672
has to be equal to 2.326. (You have to use the Z chart inside - out to nd this value.)
Solving the equation
A4,200

672
= 2.326 we nd that A = 4260.30. To be conservative, we
choose the value 4261.
To conclude step 4, we have made the calculation that if the number of people cured with
the new medicine is 4,261 or more we will reject the null hypothesis and claim that the new
medicine is superior to the current one.
It is now straightforward to do step 5, so we do it.
Approach 1, Step 5
The nal step in the testing procedure is straightforward. We do the clinical trial and count
the number of people cured with the new medicine. If this number is 4,261 or larger we
reject the null hypothesis and claim that the new medicine is superior to the current one. If
this number is less than 4,261 we say that we do not have signicant evidence that the new
medicine is better than the current one.
51
Note on this
The value 4,261 is sometimes called the critical point and the range of values 4,261 or
more is sometimes called the critical region.
The coin example.
First we review steps 1, 2 and 3.
Step 1. We write as the probability of a head on each ip. The null hypothesis claims that
= 1/2 and the alternative hypothesis claims that = 1/2.
Step 2. We choose a numerical value for as 5%.
Step 3. Suppose that we plan to ip the coin 10,000 times in our experiment. The test
statistic is x, the number of heads that we will get after we have ipped the coin 10,000
times.
Now we proceed to steps 4 and 5.
Step 4. This test is two-sided, so we will reject the null hypothesis if r x is either too large or
too small. How large or how small? We will reject the null hypothesis if x A or if x B,
where A and B have to be chosen so that = 5%. We now go on a probability theory zig,
and consider the random variable X, the random number of times we will get heads before
the experiment is done. We have to choose A and B so that
P(X A) +P(X B) = 0.05 when H
0
is true.
Choosing A and B so as to satisfy this requirement ensures that the Type I error is indeed
5%. We usually adopt the symmetic requirement
P(X A) = P(X B) = 0.025 when H
0
is true.
Let us rst calculate B. When the null hypothesis is true, X has a binomial distribution with
mean 5,000 and variance 2,500 (using the formula for the mean and the variance of a binomial
random variable). The standard deviation of X is thus

2, 500 = 50. To a suciently close
approximation, when the null hypothesis is true X has a normal distribution with this mean
and this standard deviation. Thus when the null hypothesis is true, (X 5, 000)/50 is a Z.
Carrying out a Z-ing procedure, we get
P(
X 5, 000
50

B 5, 000
50
) = 0.025.
52
Since
X5,000
50
is a Z when the null hypothesis is true, the Z charts show that
B5,000
50
= 1.96,
and solving this equation for B we get B = 5, 098.
Carrying out a similar operation for A we nd A = 4, 902.
Step 5. This takes us straight to step 5. We now ip the coin 10,000 times, and if the number
of heads is 4,902 or fewer, or 5,098 or more, we reject the null hypothesis and claim that we
have signicant evidence that the coin is biased. If the number of heads between 4,903 and
5,097 inclusive, we say that we do not have signicant evidence to reject the null hypothesis.
That is, we do not have signicant evidence to claim that the coin is unfair.
Note on this
The values 4,902 and 5,098 are sometimes called the critical points and the range of
values x 4, 902 or x 5, 098 is sometimes called the critical region.
We now consider Approach 2 to hypothesis testing.
As stated above, steps 1, 2 and 3 are the same under Approach 2 as they are under Approach
1. So we now move to steps 4 and 5 under Approach 2, again using the coin and the medicine
examples.
Approach 2, Step 4
Under Approach 2 we now do our experiment and note the observed value of the test statis-
tic. Thus in the medicine example we do the clinical trial (with the 5,000 patients) and
observe the number of people cured under the new medicine. In the coin example we ip
the coin 10,000 times and observe the number of heads that we got.
Approach 2, Step 5
This step involves the calculation of a so-called P-value. Once the data are obtained we
calculate the probability of obtaining the observed value of the test statistic, or one more
extreme in the direction indicated by the alternative hypothesis, assuming that the null
hypothesis is true. This probability is called the P-value. If the P-value is less than or
equal to the chosen Type I error, the null hypothesis is rejected. This procedure always
leads to a conclusion identical to that based on the signicance point approach.
For example, suppose that in the medicine example the number of people cured under
the new medicine was 4,272. Using the normal distribution approximation to the binomial,
the P-value is the probability that a random variable X having a normal distribution with
53
mean 4,200 and variance 672 (the null hypothesis mean and variance) takes a value 4,272
or more. This is a straightforward probability theory zig operation, carried out using a
Z-ing procedure and normal distribution charts. We have
P(X 4, 272) = P(
X 4, 200

672

4, 272 4, 200

672
),
and since
X4,200

672
is a Z when the null hypothesis is true, and
4,2724,200

672
= 2.78, we obtain,
from the Z chart, a P-value of 0.0027. This is less than the chosen Type I error (0.01) so we
reject the null hypothesis. This is exactly the same conclusion that we would have reached
using Approach 1, since the observed value 4,272 exceeds the critical point 4,261.
As a dierent example, suppose that the number cured with the new medicine was 4,250.
This observed value does not exceed the critical point 4,261, so under Approach 1 we would
not reject the null hypothesis. Using the P-value approach (Approach 2), we would calculate
the P value as
P(X 4, 250) = P(
X 4, 200

672

4, 250 4, 200

672
),
and since
X4,200

672
is a Z when the null hypothesis is true, and
4,2504,200

672
= 1.93, we obtain,
from the Z chart, a P-value of 0.0268. This is more than the Type I error of 0.01, so we
do not have enough evidence to reject the null hypothesis. That is, we do not have enough
evidence to claim that the new medicine is better than the current one. This conclusion
agrees with that we found under Approach 1.
The coin example
The P-value calculation for a two-sided alternative hypothesis such as in the coin case
is more complicated than in the medicine example. Suppose for example that we obtained
5,088 heads from the 10,000 tosses. This is 88 more than the null hypothesis mean of 5,000.
The P-value is then the probability of obtaining 5,088 or more heads plus the probability
of getting 4,912 or fewer heads if the coin is fair (that is, if the null hypothesis is true),
since values 4,912 or fewer are as extreme as, or more extreme, for a two-sided alternative,
than the observed value 5,088. For example, 4,906 is more extreme than 5,088 in that it
diers from the null hypothesis mean (5,000) by 96, which is more than 5,088 does.
So using a normal distribution approximation, the P-value is the probability that a random
variable having a normal distribution with mean 5,000 and standard deviation 50 takes a
value 5,088 or more, plus the probability that a random variable having a normal distribu-
tion with mean 5,000 and standard deviation 50 takes a value 4,912 or fewer. Doing a Z-ing,
this is 0.0392 + 0.0392 = 0.0784. This exceeds the value chosen for (0.05), so we do not
have enough evidence to reject the null hypothesis. This agrees with the conclusion that we
54
reached using the signicance point approach (see Approach 1, step 5).
DOES ANY OF ALL THIS SEEM FAMILIAR?
Remember the material discussed in the rst few days of the class about deductions (impli-
cations) and inductions (inferences). In all of the above examples we have started with a
probability theory deduction, or implication. A deduction, or implication, starts with the
words If. In all the above examples, and in all hypothesis testing procedures, this is If the
null hypothesis is true, then....... The calculation that followed is a probability theory zig.
This probability theory calculation was followed by a statistical inference, or induction.
This is the zag corresponding to the probability theory zig. Here are the examples given
above, formatted in this way.
In the medicine example, under Approach 1: If the null hypothesis (that the proposed
new medicine has the same cure probability, 0.84, as the current medicine) is true, then the
probability that the new medicine will cure 4,261 or more people out of 5,000 is 0.01. This
is a probability theory zig.
Suppose we chose a Type I error of 1% (= 0.01) and that, when the experiment had been
conducted, we found that the new medicine cured 4,272 people. What is the correponding
statistical inference, or induction? We would reject the null hypothesis, since the observed
number cured (4,272) is in the critical region. That is, it is greater than the value 4,261
calculated by the probability theory zig. This is the corresponding statistical zag. It
cannot be made without the probability theory zig.
In the coin example, under Approach 1: If the null hypothesis is true, then the probability
that the coin will give 4,902 or fewer heads, or 5,098 or more heads, is 0.05. This is a prob-
ability theory zig. It leads to the criticall region: 4,902 or fewer, or 5,098 or more, heads.
Suppose we chose a Type I error of 5% (= 0.05) and that, when the experiment had been
conducted, we found that the number of heads was 4,922. This value is not in the critical
region, so we say we have no signicant evidence that the coin is biased. This is the corre-
sponding statistical zag. It cannot be made without the probability theory zig.
Under Approach 2, (carried out via P-values), we reach the same conclusions - see below.
In the medicine example, under Approach 2: If the null hypothesis (that the proposed
new medicine has the same cure probability, 0.84, as the current medicine) is true, then the
probability that the new medicine will cure 4,272 or more people out of 5,000 is 0.0027.
55
When the experiment had been conducted, we found that the new medicine cured 4,272
people. From the probability theory result just calculated, the P value is 0.0027. This is
less than our chosen Type I rerror of 0.01, so we reject the null hypothesis. This is the cor-
responding statistical zag. It cannot be made without the probability theory zig (which
led to the P-value calculation of 0.0027).
In the coin example, under Approach 2: If the null hypothesis (that the coin is fair) is true,
the probability of getting 5,088 or more heads, or 4,912 or fewer heads, is 0.0392 + 0.0392
= 0.0784. This is a probability theory zig.
When the experiment was conducted, we found that the number of heads was 5,088. From
the probability theory result just calculated, the P value is 0.0784. This is greater than our
chosen Type I rerror of 0.05, so we do not have enough evidence to reject the null hypothesis.
This is the corresponding statistical zag. It cannot be made without the probability theory
zig (which led to the P-value calculation of 0.0784).
General note All our probability calculations assume that the null hypothesis is true. This
is because we are focusing on the Type I error, which is the probability of rejecting the null
hypothesis when it is true. All our hypothesis-testing calculations will be made under the
assumption that the null hypothesis is true.
Testing for the equality of two binomial parameters
Suppose we want to test whether there is any dierence between men and women on the
attitude on the pro-choice / pro-life question. If there is a dierence, we have no a priori
view as to whether men are more, or are less, likely to be pro-choice than women are.
We can rephrase this question in terms of asking whether two binomial parameters are equal.
We write
1
as the probability that a woman is pro-choice and
2
as the probability that a
man is pro-choice. The null hypothesis (no dierence) is
H
0
:
1
=
2
= (unspecied).
Note that the value of is unspecied. That is, the null hypthesis just claims that
1
and

2
are equal: it does not speciy what their (common) numerical value is.
Since we have no a priori view as to whether men are more, or are less, likely to be pro-choice
than women are, the alternative hypothesis is two-sided:
H
1
:
1
=
2
.
56
Declaring H
0
and H
1
completes step 1 of the hypothesis testing procedure.
Step 2. In step 2 we choose , the numerical value of the Type I error. Lets choose 5%. It
is important to note that this implies that since we want to ensure that we do indeed have
a value of equal to 5%, in all the calculations that we do below, we assume that the null
hypothesis is true.
Step 3. In this step we create a test statistic. This is a much more complicated business
than it was in the medicine and the coin example. We have to think in advance what
the data will look like. Suppose that we plan to ask n
1
women what their attitude is on the
pro-choice / pro-life question, and that we plan to ask n
2
men what their attitude is on this
question. Write n
1
+ n
2
= n, the total number of people whose attitude we will ask about.
Of the women, some number x
1
will say they are pro-choice and of the men, some number
x
2
will say they are pro-choice. We can think of the data are being arranged in a so-called
two-by-two table, as shown.
pro-choice pro-life total
women x
1
n
1
x
1
n
1
men x
2
n
2
x
2
n
2
total c
1
(= x
1
+ x
2
) c
2
(= n x
1
x
2
) n(= n
1
+ n
2
)
A reasonable start for a test statistic is to compare x
1
/n
1
with x
2
/n
2
. (Comparing x
1
with x
2
makes no sense if n
1
= n
2
.) This observation is enough for us to go back to the time before
we took the survey. At this time the number of women who will say they are pro-choice is a
random variable, which we write as X
1
. Similarly the number of men who will say they are
pro-choice is a random variable, which we write as X
2
. The proportion of women who will
say they are pro-choice is X
1
/n
1
, and this is also a random variable. Similarly the proportion
of men who will say they are pro-choice is X
2
/n
2
, and this is also a random variable.
Suppose now that the null hypothesis is true. Then (from one of the magic formulas) the
mean of X
1
/n
1
is and also the mean of X
2
/n
2
is also . Therefore (from another magic
formula) the mean of X
1
/n
1
X
2
/n
2
is 0.
We next have to nd the variance of X
1
/n
1
X
2
/n
2
when the null hypothesis is true. From
yet another magic formula, this is
(1 )
n
1
+
(1 )
n
2
.
This shows that when the null hypothesis is true,
X
1
/n
1
X
2
/n
2
_
(1)
n
1
+
(1)
n
2
(74)
57
is a Z, that is to a very close approximation has a normal distribution with mean 0 and
variance 1.
The problem that we have now is that we do not know the numerical value of . So we
will have to estimate it from the data. We are assuming that the null hypothesis is true,
and so we estimate (the probability that a person, male or female, is pro-choice) by the
proportion of people in the sample who are pro-choice, namely c
1
/n. Similarly we estimate
1 (the probability that a person, male or female, is pro-life) by the proportion of people
in the sample who are pro-life, namely c
2
/n. We therefore replace the statistic in (74) by
X
1
/n
1
X
2
/n
2
_
(c
1
/n)(c
2
/n)
n
1
+
(c
1
/n)(c
2
/n)
n
2
(75)
When the null hypothesis is true, this quantity has, to a reasonable approximation, a Z
distribution - that is, a normal distribution with mean 0, variance 1.
Our eventual test statistic is calculated from the data, so it is the data analogue of the
quantity in (75), namely
x
1
/n
1
x
2
/n
2
_
(c
1
/n)(c
2
/n)
n
1
+
(c
1
/n)(c
2
/n)
n
2
(76)
This quantity is rather messy, and it can be simpled to
x
1
n
2
x
2
n
1
_
c
1
c
2
n
1
n
2
/n
(77)
This (after all these calculations) is our test statistic, and this then concludes Step 3.
The procedure in Steps 4 and 5 depends on whether we use Approach 1 or Approach 2. We
rst consider Approach 1.
Under Approach 1, in Step 4 we ask what values of the test statistic lead us to reject the null
hypothesis. Since the alternative hypothesis is two-sided, suciently large negative or su-
ciently large positive values of the quantity in (77) will lead us to reject the null hypothesis.
How large positive or how large negative? This depends on the value we chose for , namely
5%. It also depends on the fact that if the null hypothesis is true, the random variable in
(75) has, to a reasonable approximation, a Z distribution. Now Z charts show that the
probability that a Z is less than -1.96 or greater than +1.96 is 0.05 (=5%). These are our
upper and lower signicance points. This leads to
Step 5. We get our data and compute the numerical value of the test statistic (77). If this
value is -1.96 or less, or +1.96 or more, we will reject the null hypothesis. If the numerical
58
value of the test statistic (77) is betwen -1.96 and +1.96 we do not have enough signicant
evidence to reject the null hypothesis.
The ONLY statistical step is the simple one, Step 5. Essentially all the other steps are
related to the probability theory part of the procedure. That is why so much emphasis is
placed in this course on probability theory.
A second example
Testing a new medicine is often done using a comparison of the new medicine with a placebo
(i.e. a harmless and unbenecial mixture made out of (say) our, sugar and water). Suppose
that we plan to give the proposed medicine to n
1
people and the placebo to n
2
people. We
write
1
as the probability that the new medicine leads to a cure and
2
as the probability
that the placebo leads to a cure. The null hypothesis (no dierence) is
H
0
:
1
=
2
= (unspecied).
If the null hypothesis is true the proposed new medicine is ineective: its cure probability is
the same as that for the placebo. As in the previous example, the value of is unspecied.
That is, the null hypthesis just claims that
1
and
2
are equal: it does not speciy what
their (common) numerical value is.
Since we are only interested in the possibility that the proposed medicine is benecial, the
alternative hypothesis is one-sided up :
H
1
:
1
>
2
.
Declaring H
0
and H
1
completes step 1 of the hypothesis testing procedure.
Step 2. In step 2 we choose , the numerical value of the Type I error. Since this is a
medical example, we choose 1%. It is important to note that this implies that since we want
to ensure that we do indeed have a value of equal to 1%, in all the calculations that we
do below, we assume that the null hypothesis is true.
Step 3. In this step we create a test statistic. Suppose that x
1
of the n
1
people given the
proposed medicine are cured, and that x
2
of the n
2
people given the placebo are cured. We
can form the data in a table just like the one above (next page):
59
cured not cured total
given proposed medicine x
1
n
1
x
1
n
1
given placebo x
2
n
2
x
2
n
2
total c
1
(= x
1
+ x
2
) c
2
(= n x
1
x
2
) n(= n
1
+ n
2
)
With this interpretation of x
1
, n
1
, x
2
, n
2
, c
1
, c
2
and n, the test statistic is as in (77) above.
We now have to nd what values of the test statistic lead us to reject the null hypothesis.
This is a one-sided up test, so we would reject the null hypothesis if x
1
/n
1
is signicantly
larger than x
2
/n
2
. Going back to the original form of the test statistic (77), or equivalently
(77), we see that suciently large positive values of the statistic (77) will lead us to reject
the null hypothesis.
How large? In other words, what is the critical point? Here we have to remember that the
Type I error is 1%.
When the null hypothesis is true, the random variable (74), which the statistic (75) and
thus (77) is based on, has approximately a normal distribution, mean 0, variance 1. Z
charts show that the probability that such a random variable exceeds 2.326 is 0.01. So 2.326
is the required critical value: we will reject the null hypothesis if the statistic (77) is 2.326
or more, and otherwise say that we do not have enough evidence to reject the null hypothesis.
Step 5 is now straightforward. We get our data, compute the numerical value of the test
statistic (77), and reject the null hypothesis if this numerical value is equal to, or exceeds,
2.326. If it is less than 2.326 we say that we do not have enough evidence to reject the null
hypothesis.
The ONLY statistical step is the simple one, Step 5. Essentially all the other steps are
related to the probability theory part of the procedure. That is why so much emphasis is
placed in this course on probability theory.
We now consider the procedure under Approach 2. Steps 1, 2 and 3 are the same as for
Approach 1, so we start by considering Step 4.
Under Approach 2, Step 4 consists of getting the data and calculating the numerical value
of the test statistic. In both the pro-choice/pro-life example and the medicine example, this
is the test statistic (77) (with the appropriate interpretation of x
1
, n
1
, x
2
, n
2
, c
1
, c
2
and n
in each case - for example, in the pro-choice/pro-life case x
1
meant the number of women
who were pro-choice, while in the medicine case x
1
meant the number of people given the
proposed medicine who were cured).
60
Step 5 involves the calculation of the P-value corresponding to the observed numerical value
of the test statistic (77). Remember the denition of a P-value: it is the probability of get-
ting the observed value of the test statistic, or one more extreme in the direction indicated
the alternative hypothesis, when the null hypothesis is true. If this P-value is less than or
equal to the value of (the probability of making a Type I error) that we chose in Step 2,
we reject the null hypothesis. If the P-value is less than the value of that we chose in Step
2 we do not have enough evidence to reject the null hypothesis. This will be illustrated in
the numerical examples below.
Numerical example 1: the pro-choice/pro-life example.
Suppose that we ask 300 women their view on the pro-choice/pro-life question, and that
188 of these say they are pro-choice. Suppose that we also ask 200 men their view on the
pro-choice/pro-life question, and that 118 of these say they are pro-choice. The data table
now looks like this:-
pro-choice pro-life total
women 188 112 300
men 118 82 200
total 306 194 500
The numerical value of the test statistic (77) is
(188)(200) (118)(300)
_
(306)(194)(300)(200)/(500)
= 0.8423.
Remember that the alternative hypothesis in this example is two-sided and that our cho-
sen value of was 5% ( = 0.05). This meant that we would reject the null hypothesis if
numerical value of the test statistic (77) turned out to be - 1.96 or less, or +1.96 or more.
Given the numerical value 0.8423 of the test statistic, using Approach 1 we say we do not
have enough evidence to reject the null hypothesis. In more practical language, we do not
have enough evidence to claim that there is a dierence between men and women on the
pro-choice/pro-life question.
Under Approach 2 we have to calculate the P-value corresponding to the observed value
0.8423. Remembering again the alternative hypothesis in this example is two-sided, the P-
value is the probability that a Z is 0.8423 or more, or -.8423 or less. Z charts show that
this probability is about 0.2005 +0.2005 = 0.401. This exceeds the value 0.05 chosen for ,
so we draw the same conclusion that we drew using Approach 1: we do not have enough ev-
idence to claim that there is a dierence between men and women on the pro-choice/pro-life
question.
61
Numerical example 2: the medicine example.
Suppose that we give the proposed new medicine to 1,000 people and that of these, 765 are
cured. Suppose that we also gave the placebo to 800 people of these 550 are cured. The
data table now looks like this:-
cured not cured total
given proposed medicine 765 235 1, 000
given placebo 550 250 800
total 1, 315 485 1, 800
The numerical value of the test statistic (77) is
(765)(800) (550)(1, 000)
_
(1, 315)(485)(1, 000)(800)/(1, 800)
= 3.68.
We start with Approach 1. Remember that the alternative hypothesis in this example is
one-sided up and that our chosen value of was 1% ( = 0.01). This meant that we would
reject the null hypothesis if numerical value of the test statistic (77) turned out to be 2.326
or more. Since the observed value did turn out to be greater than 2.326 we reject the null
hypothesis and claim that we have signicant evidence that the proposed new medicine is
eective.
Under Approach 2 we have to calculate the P-value corresponding to the observed value
3.68. Remembering again the alternative hypothesis in this example is one-sided up, the
P-value is the probability that a Z 3.68 or more. Z charts do not go as high as the
value 3.68, which is o the chart. The P-value is less than the P-value corresponding to
the last chart value (3.09), which is 0.001. Therefore we draw the same conclusion that we
drew using Approach 1: we reject the null hypothesis and claim that we have signicant
evidence that the proposed new medicine is eective.
Some notes on this.
1. The statistic (77) is sometimes written (in research papers) as z. This is because if
the null hypothesis is true, it is the observed value of a random variable having, to a close
approximation, a normal distribution with mean 0, variance 1.
2. Lets run through the various probability theory results that we have used, using the pro-
choice/pro-life case as an example. First we recognized that we were in a binomial situation.
Second we realized that we were involved with proportions. Then we had to remember the
formulas for the mean and the variance of a proportion. Then we realized that we were in-
volved with the dierence between two proportions, so we had to remember the formulas for
62
the mean and the variance of a dierence. Then we had to remember the Z-ing procedure.
Finally we had to be able to use the Z chart.
3. What are we testing in the pro-choice/pro-life example? We are not testing whether
people are more likely to be pro-choice than pro-life. We are testing for a dierence between
men and women on their attitude on this question.
4. There were various approximations that we used. For example we did not know the vari-
ance (1 )/n
1
+ (1 )/n
2
and we approximated it by the estimate (c
1
/n)(c
2
/n)/n
1
+
(c
1
/n)(c
2
/n)/n
2
. Next, when the null hypothesis is true the test statistic does not exactly
have a Z distribution (a normal distribution witrh mean 0, variance 1). However its dis-
tribution is very close to a Z distribution so we are happy to use this approximation. On
the other hand there is a procedure, called Fishers exact test, where no approximations at
all are made. This is quite complicated and we will not consider this procedure in this course.
5. It is arbitrary, in the data table, which row we use for women (row 1 or row 2), and also
arbitrary which column we use for pro-choice (column 1 or column 2). We also could have
used the columns for men/women and the rows for pro-choice/ pro-life.
6. It is essential that you use the original numbers, and not for example percentages, in the
data table and the resulting calculations. Consider two data tables where the numbers in
the rst table are all 10 times the numbers in the second table. The percentages are the
same in both tables, but the value of z in the rst table will be

10 times the value of z in


the second table.
7. For obvious reasons, tests of the kind just considered are often called two-by-two table
tests (two rows of data, two columns of data). What you are testing using the data in such
a table is association, not cause and eect. Here is an example illustrating this.
Suppose that we take a sample of teen-aged children and classify each child as eating 8
or more junk food meals per week, on eating less than 8 junk food means per week (row
classication), and obese or not obese (column classication). Suppose that with the data ob-
tained we get a signicant value for z. It is natural to argue that eating the junk food causes
obesity. But a fast-food company might claim that children who are by nature obese have
a desire to eat junk food, in other words that the cause and eect relation goes the other way.
8.You can think of the statistic
x
1
/n
1
x
2
/n
2
_
(c
1
/n)(c
2
/n)
n
1
+
(c
1
/n)(c
2
/n)
n
2
(78)
given above, and repeated here for convenience, as a signal-to-noise ratio. The numerator,
63
x
1
n
1

x
2
n
2
, is the signal - it is your best estimate of
1

2
. However this signal on its own is
not enough - it has to be compared to the estimate of its standard deviation (the noise,
that is the denominator in (78), before you can assess whether it is signicant.
9. What is the relation of this to conditional probabilities of events? Let A be the event
that a person is pro-choice and B be the event that a person is a woman. The null hypoth-
esis claims that P(A|B) = P(A). That is, the null hypothesis claims that being given the
information that a person is a woman does not change the probability that that person is
pro-choice.
10. This note, and everything from here on in this section about two-by-two table tests,
assumes that the alternative hypothesis is two-sided.
First we change the notation, and write the entries in the data table as
Column 1 Column 2 total
Row 1 o
11
o
12
r
1
Row 2 o
21
o
22
r
2
total c
1
c
2
n
(79)
Here r
1
and r
2
denote row totals, c
1
and c
2
denote column totals, n is the grand total, and
the o
ij
notation stands for observed numbers in each of the four cells of the table. For
example, o
12
is the observed number in row 1, column 2. This new notation is exible and
allows generalizations to tables that are bigger than two-by-two.
With this new notation, the z statistic (78) becomes
o
11
/r
1
o
21
/r
2
_
(c
1
/n)(c
2
/n)
r
1
+
(c
1
/n)(c
2
/n)
r
2
(80)
Algebraic manipulations show that this can be written as
z =

n(o
11
o
22
o
21
o
12
)

r
1
r
2
c
1
c
2
. (81)
This is the form that we use for z from now on.
Suppose that we chose our value of (the numerical value for the Type I error) to be 5%.
Then we would reject the null hypothesis if z 1.96 or if z +1.96. (Remember that we
are ONLY considering two-sided tests.) This is the same as rejecting the null hypothesis if
z
2
(1.96)
2
= 3.8415. Now
z
2
=
n(o
11
o
22
o
21
o
12
)
2
r
1
r
2
c
1
c
2
. (82)
64
So we would reject the null hypothesis if z
2
3.8415.
z
2
is usually called chi-square and written as
2
. (More precisely it is usually called chi-
square with one degree of freedom. We will discuss degrees of freedom later.) However in
this class we use Greek symbols ONLY for parameters, and z
2
is not a parameter: it is the
observed value of a random variable. So we will not use the symbol
2
. So we need a new
notation for chi-square. We will soon be generalizing this sort of problem to data tables
bigger than two-by-two, where the notation z
2
is no longer appropriate. So from now on we
will use the notation c
2
for z
2
: c is the rst letter in the word chi.
Numerical example. Lets re-do the above pro-choice/pro-life example in terms of c
2
. The
data were
pro-choice pro-life total
women 188 112 300
men 118 82 200
total 306 194 500
Here c
2
= 500[(188)(82) (112)(118)]
2
/[(300)(200)(306)(194)] = 0.6794. Since this is less
than 3.8415, we do not have enough evidence to reject the null hypothesis.
Note that 0.6794 is the square of the value 0.8423 that we calculated earlier for z: this is as
it should be, since c
2
is the square of z.
We will soon generalize this type of problem to data tables that are bigger than two-by-two.
To do this it is convenient to re-write c
2
in a dierent form. To do this we rst form a table
of so-called expected numbers corresponding to the observed data values in the data table
(79). This table is
Column 1 Column 2 total
Row 1 e
11
e
12
r
1
Row 2 e
21
e
22
r
2
total c
1
c
2
n
(83)
The denition of the expected numbers e
11
, e
12
, e
21
and e
22
is that
e
11
= r
1
c
1
/n, e
12
= r
1
c
2
/n, e
21
= r
2
c
1
/n, e
22
= r
2
c
2
/n. (84)
What is the logic behind these denitions? This can be explained by referring to the pro-
choice/pro-life data table above. The proportion of women in the sample is 300/500 =
0.6, or 60%. If there is no dierence between men and women on the pro-choice/pro-life
question, we would expect that about 60% of the 306 people who were pro-choice would be
women. Now 60% of 306 is (300)(306)/500. Using the notation of the general data table
65
(79), (300)(306)/500 is r
1
c
1
/n, and this is e
11
. So if there is no association between row and
column modes of categorization, we would expect about r
1
c
1
/n observations in the upper
left-hand cell of the data table. Similar arguments lead to the interpretations of e
12
, e
21
and
e
22
.
The alternative (and more exible) formula for c
2
is
c
2
=
(o
11
e
11
)
2
e
11
+
(o
12
e
12
)
2
e
12
+
(o
21
e
21
)
2
e
21
+
(o
22
e
22
)
2
e
22
. (85)
As a check on the calculations, the numerical values of the e
ij
are
e
11
= 183.6, e
12
= 116.4, e
21
= 122.4, e
22
= 77.6. (86)
This leads to a calculation of c
2
as
c
2
=
(188 183.6)
2
183.6
+
(112 116.4)
2
116.4
+
(118 122.4)
2
122.4
+
(82 77.6)
2
77.6
, (87)
and this leads to the same value of c
2
, namely 0.6794, as found by the previous formula.
Some notes on this.
1. We use the formula (85) for c
2
because it generalizes to tables that are bigger than two-
by-two (see later).
2. You can think of c
2
as a measure of the distance betwen the set o
11
, o
12
, o
21
, o
22
of
observed values and the set e
11
, e
12
, e
21
, e
22
of expected values. If the four marginal totals
r
1
, r
2
, c
1
and c
2
are given, these expected values are actually the means of the numbers in
the respective cells of the table assuming that the null hypothesis is true.
3. Remember: you use c
2
only if the original alternative hypothesis is two-sided.
4. When you calculate each e
ij
, the value that you get will usually not be a whole number.
To be suciently accurate in your calculations, compute each e
ij
value with four decimal
place accuracy, and then report your eventual c
2
value to two decimal place accuracy.
5. Only suciently large positive values of c
2
are signicant.
6. Continuing on from note 5, if , the numerical value of our Type I error, had been 5%
(as assumed above) we reject the null hypothesis if c
2
3.8415. What would be the critical
point of c
2
if the numerical value of our Type I error, had been 1%? We calculated 3.8415
as the square of 1.96, where P(Z 1.96) + P(Z +1.96) = 0.05. Now Z charts show
that P(Z 2.575) +P(Z +2.575) = 0.01. Thus the critical point of c
2
if the numerical
66
value of our Type I error had been chosen to be 1% is (2.575)
2
= 6.6349.
7. Going on from Note 6, you have been given a chi-square chart and you will nd the
number 6.6349 on it (for a numerical value 0.01 for the Type I error and one degree of
freedom). So what are degrees of freedom and what is one degree of freedom?
Suppose that the four marginal totals r
1
, r
2
, c
1
and c
2
are given. Then you can only freely
ll in one number in the four cells of the table. The remaining three numbers will then
automatically be dened. You only have one degree of freedom. This concept becomes
important when we move to tables that are bigger than two-by-two.
8. What is the relation of the procedure to the deduction/ implications and the induc-
tions/inferences material in the rst few lectures. Here it is.
Lets go back to the time before we do our experiment. Our eventual data will be in the
form of a two-by-two table. It is simplest to think rst of the case where the row and column
totals were both chosen in advance. Before we get our data, the numbers in the four cells of
the table are random variables, and at this time we can imagine the following table:
Column 1 Column 2 total
Row 1 O
11
O
12
r
1
Row 2 O
21
O
22
r
2
total c
1
c
2
n
(88)
Here for example O
11
is the random variable describing the number that we will get for row
1, column 1. So we can think of the random quantity
C
2
=
(O
11
e
11
)
2
e
11
+
(O
12
e
12
)
2
e
12
+
(O
21
e
21
)
2
e
21
+
(O
22
e
22
)
2
e
22
. (89)
When the null hypothesis is true, C
2
is a random variable and has, to a very close ap-
proximation, the so-called chi-square distribution with one degree of freedom. This is a
trust-me result: the theory is very complicated. Tables of critical (= signicance) points
of this distribution were referred to above. Now to the deduction/ implications and the
inductions/inferences aspect of this.
If the null hypothesis is true, the probability that the random variable C
2
takes a value
greater than or equal to 3.8415 is 5%. (We were able to arrive at this number ourselves,
using many probability theory results.) This calculation is a probability theory zig.
Suppose that we do our experiment and that our data value of c
2
is 4.032. Given the
above probability theory deduction/implication, we can say that we have signicant
67
evidence that the null hypothesis is not true. This is the corresponding statistical zag. It
is an easy step to take, given the probability theory deduction/implication.
9. All the above follows Approach 1. Approach 2 would require us, in Step 4, to get the data
and calculate the value of c
2
. Step 5 would require us to nd the P-value corresponding to
this number. Doing this is possible but not easy. So for this hypothesis testing procedure
the most useful approach is Approach 1, together with using charts of critical points of the
chi-square distribution.
Tables bigger than two-by-two
We developed the statistic c
2
above since that generalizes to the case of tables that are bigger
than two-by-two. Suppose that we have a table of data with some arbitrary number r of
rows and some arbitrary number c of columns. Again we suppose that the row and column
totals are xed in advance of getting the data.
To x ideas, suppose that we have four strains of mice. There are ve coat colors: black,
brown, white, yellow and gray. We want to test for association between the strain of a mouse
and its coat color.
Step 1. The null hypothesis claims that there is no association and the alternative hypothesis
claims that there is some association (of an unspecied type). This concludes Step 1. There
is no other allowable null or alternative hypothesis.
In Step 2 we choose a value 5% for .
To x ideas, lets go straight to the data table. The data are (next page):-
68
color
black brown white yellow gray Total
1 34 41 20 23 32 150
strain 2 40 47 27 31 55 200
3 71 77 38 51 63 300
4 80 81 44 56 89 350
Total 225 246 129 161 239 1, 000
(90)
Table 1: Two-way table data.
For Step 3 we generalize the above table and suppose that the data are:-
column
1 2 3 . . . c Total
1 o
11
o
12
o
13
. . . o
1c
r
1
row 2 o
21
o
22
o
23
. . . o
2c
r
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
r o
r1
o
r2
o
r3
. . . o
rc
r
r
Total c
1
c
2
c
3
. . . c
c
n
(91)
Table 2: General two-way table data.
We rst compute the expected value corresponding to the observed value o
ij
. This is e
ij
,
dened by e
ij
= r
i
c
j
/n. The general chi-square test statistic, given (in terms of Sigma
notation), is
c
2
=
r

i=1
c

j=1
(o
ij
e
ij
)
2
e
ij
, (92)
which can be regarded as a measure of the dierence of the o
ij
and e
ij
values. (Note that
this a generalized version of the c
2
given in (85)).This is the end of Step 3: the quantity in
(92) is the test statistic. Equivalently,
c
2
=
(o
11
e
11
)
2
e
11
+
(o
12
e
12
)
2
e
12
+. . . +
(o
rc
e
rc
)
2
e
rc
.
Steps 4 and 5 depend on the approach taken. We rst consider Approach 1.
Step 4. This is a very dicult step. What values of c
2
will lead us to reject the null
hypothesis? We have to go back to the time before we get our data. At this time we are
considering random variables, and before we do our experiment we can envision a table such
as the following, where the O
ij
quantities are, at this stage, random variables (next page):-
69
column
1 2 3 . . . c Total
1 O
11
O
12
O
13
. . . O
1c
r
1
row 2 O
21
O
22
O
23
. . . O
2c
r
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
r O
r1
O
r2
O
r3
. . . O
rc
r
r
Total c
1
c
2
c
3
. . . c
c
n
(93)
Table 3: Two-way table: random variables.
Here is another trust-me result. When the null hypothesis is true the random variable
C
2
=
r

i=1
c

j=1
(O
ij
e
ij
)
2
e
ij
, (94)
has an approximate chi-square distribution with = (r 1)(c1) degrees of freedom. (Note
that this a generalized version of the C
2
given in (89)). Only suciently large values of the
observed value c
2
of C
2
lead us to reject the null hypothesis. How large? These are found
from the chi-square distribution with degrees of freedom. Critical (= signicance) points
of this distribution are given in chi-square charts. The actual signicance point relevant to
any specic case depends on (i) the value of chosen in Step 2, and (ii) the number of
degrees of freedom in that case.
Step 5. Get the data, calculate c
2
and reject the null hypothesis if the observed value of c
2
is greater than or equal to the relevant signicance point in the chi-square chart.
The mouse coat color example.
The rst thing that we have to do is to calculate the expected values. The easiest way of
remembering how to do this is that the expected value in any cell is the corresponding row
total times the corresponding column total divided by the grand total. Thus, for example,
the expected value in the upper left-hand cell is (150)(225)/1,000 = 33.7500. (Remember
(as I have done here) to carry four decimal place accuracy in the calculations of the expected
values.) Continuing in this way we get the following table of expected values (next page):-
70
color
black brown white yellow gray Total
1 33.7500 36.9000 19.3500 24.1500 35.8500 150
strain 2 45.0000 49.2000 25.8000 32.2000 47.8000 200
3 67.5000 73.8000 38.7000 48.3000 71.7000 300
4 78.7500 86.1000 45.1500 56.3500 83.6500 350
Total 225 246 129 161 239 1, 000
(95)
Table 4: Two-way table expected values.
We then evaluate c
2
as
c
2
=
(34 33.7500)
2
33.7500
+. . . +
(89 83.6500)
2
83.6500
(96)
(20 terms in all in the sum), and this is about 5.02. This is the end of Step 4.
Step 5 is easy: since 5.02 is well less than the critical value 21.03 (12 degrees of freedom,
= 5%), we have no reason to reject the null hypothesis.
All the above follows Approach 1. In Approach 2 we would go to Step 5 after completing
Step 3, and calculate the numerical value of c
2
(5.02). We would then have to nd the
P-value corresponding to this. This is extremely dicult and can only be done by
a computer package. Thus if you do not have such a package available you have to use
Approach 1 and the chi-square chart.
Notes on this.
1. Most of the notes for a 22 table continue to hold for general r c tables. One exception
is this: while in a 2 2 table one could nd a P-value by working back to the original z
statistic, in the case of tables bigger than 2 2 there is no corresponding z statistic and cal-
culation of a P-value is very dicult mathematically. You can only use Approach 2 (which is
based on the calculation of a P-value) if you do the analysis by a statistical computer package.
2. As a matter of terminology, 22, and more generally r c, tables are called contingency
tables.
3. So far we have assumed that the row and column totals in a contingency table were
xed in advance. This is often the case, and this is why we denoted these in lower case,
for example r
i
and c
j
. However sometimes only the row totals are xed in advance (which
seems to be the case in the mouse example), sometimes only the column totals are xed
in advance, and sometimes neither set of totals is xed in advance. However, all the above
theory continues to hold even when one or both sets of marginal totals were not xed in
advance. This is why we use lower-case letters for row and column totals: we might as well
71
think of them as being xed in advance.
4. Degrees of freedom. This is an elusive concept, and the calculation of the number of de-
grees of freedom in any particular statistical analysis depends on that analysis: The number
of degrees of freedom diers from one form of analysis to another.
In an rc contingency table table analysis the number of degrees of freedom is (r1)(c1).
Thus in the mouse example the number of degrees of freedom is 3 4 = 12. Why is this?
In calculating the number of degrees of freedom we take the row and column totals to be
given. Thus in the mouse example we can freely choose four of the ve numbers in row 1,
but not the fth number: it must be such that the numbers in this row add up to the total
in row 1. Similarly we can freely choose four of the ve numbers in rows 2 and 3, but not
the fth number in those rows. Having done this we cannot choose any numbers in row 4
freely: the numbers in this row must be such as to lead to the given column totals. Thus in
this case we have 4 + 4 + 4 = 12 degrees of freedom. Similar arguments lead to the general
value (r 1)(c 1) for the number of degrees of freedom in an r c contingency table. Note
that in a 2 2 table we only have one degree of freedom.
5. Finally, the details of the probability theory deductive/implication part of this is now
obscure to us, since the math it is very dicult for a table bigger than 2 2. In the mouse
example it is, in eect, the following:-
If there is no association between strain and coat color, the probability that the eventually
computed value of c
2
will be greater than or equal to 21.03 is 5%. The value 21.03 is only
arrived at after a mathematically dicult probability calculation, hidden to us because of
the complexities of the math.
The corresponding statistical induction/inference is this. The observed value of c
2
is 5.02.
Based on the value 21.03 calculated by deductive probability theory methods, we have no
signicant evidence of an association between the strain of a mouse and its coat color.
Another use of chi-square: testing for a specied probability distribution
Chi-square is like a Swiss army knife: it can be used for many quite dierent purposes. In
this section we consider a new form of the chi-square statistic that is quite dierent from the
form used in contingency tables. Never confuse this new form of chi-square discussed below
with the form used in contingency tables. In particular, do not use the statistic (92) for the
new use of chi-square described in this section.
We start with an example. Is this die fair?
72
This is the same as asking the following question. If X is the (random) number to turn up
on a (future) roll of a die, is the probability distribution of X is given by the following:-
Possible values of X :
Probability :
1 2 3 4 5 6
1/6 1/6 1/6 1/6 1/6 1/6
. (97)
We start (as always) with Step 1. Here there is no choice. The null hypothesis is (in this die
example) that the die is fair. Equivalently, the null hypothesis is that (97) is the probability
distribution of X. The alternative hypothesis is that the die is unfair, but in an unspecied
way. No other choices are possible. So Step 1 is easy: you have no choice about the null and
alternative hypotheses.
In Step 2 we choose the value of (the probability of making a Type I error when the null
hypothesis is true). As before, we usually choose either 1% or 5%.
The choice of test statistic (Step 3), that is what we will calculate from the eventual data in
order to either accept of reject the null hypothesis, is not so obvious. Suppose that we plan
to roll the die n times and record the number that turns up on each roll. One possibility for
the test statistic is the average of the numbers that will turn up on these n rolls. Before we
roll the die this average is a random variable. We know the mean of this average if the null
hypothesis is true (3.5) and we also know the variance of this average if the null hypothesis
is true (35/12n), and thus we could easily form a z statistic from the evental data which
would allow us to test the null hypothesis.
However this is not a good test statistic, as the following example shows. Suppose that we
roll the die 10,000 times, and that a 1 turns up 5,000 times and a 6 turns up 5,000
times. No other number turns up. Clearly this is almost certainly a biased die. Yet the
average of the numbers that turned up is 3.5, which is the mean of the (random variable)
average to turn up when the null hypothesis is true. If we then used the average of the
numbers that did turn up after we has rolled the die n times as our test statistic, we would
not reject the null hypothesis. This is clearly unreasonable. So we need a better test statistic.
This observation shows that the choice of a test statistic is not a straightforward matter.
Some test statistics will be better than others. There is a deep mathematical theory which
leads to the choice of a best test statistic in any given testing procedure. We do not con-
sider this theory here. Instead we go straight to what this theory shows is the best test
statistic in the die example, namely a chi-square statistic. To repeat a comment made
earlier: this chi-square statistic (see details below) is quite dierent from the chi-square
statistic used in contingency tables. Here are the details of this new chi-square statistic.
The chi-square statistic for any testing for a specied probability distribution situation is
73
similar to that in a contingency table in that it considers a test statistic of the form sum
over all possibilities of (observed expected)
2
/expected.
However the details, in particular the details about how we calculate expected numbers,
dier from those in a contingency table. To see how this works in the current context we
rst consider the die example.
Remember that by expected value we mean, loosely, more or less what we would expect
to see if the null hypothesis is true. So if we plan to roll the die n times, what would these
expected values be? We would expect to see a 1 turn up about n/6 times, a 2 turn up about
n/6 times and so on. In fact these are exact means under the null hypothesis. If we consider
a 1 turning up as a success, then the binomial formula for the mean number of times a 1
will turn up is exactly n/6. The same holds for all six numbers, 1 through 6.
Suppose now that we have rolled the die n times, and that a 1 turned up n
1
times, a 2 turned
up n
2
times, ...., a 6 turned up n
6
times. Then the appropriate test statistic is
c
2
=
6

i=1
(n
i

n
6
)
2
n
6
. (98)
This concludes Step 3: in the die case this is the correct test statistic. (The appropriate test
statistic in more general cases will be discussed later.)
Steps 4 and 5 depend on whether we use Approach 1 or Approach 2. We consider Approach
1 rst.
Step 4. What values of the test statistic (98) lead us to reject the null hypothesis (that
the die is fair)? First, this test statistic (and all chi-square statistics) cannot be negative,
and for all chi-square statistics suciently large positive values lead us the reject the null
hypothesis. In the die case, how large? First, this depends on the value of chosen in Step
2. For concreteness, suppose that we chose = 0.05 = 5%.
To proceed further, we have to go back to the time before we roll the die. At this time the
numbers N
1
, N
2
, . . . , N
6
that a 1, a 2, ...., a 6 will turn up are random variables. Thus the
quantity
C
2
=
6

i=1
(N
i

n
6
)
2
n
6
. (99)
is a random variable. To a very close approximation it has a chi-square distribution when the
null hypothesis is true. (This has been anticipated in the notations c
2
and C
2
.) How many
degrees of freedom does it have? We have to ask how many of the numbers n
1
, n
2
, . . . , n
6
can be chosen freely, given the number of rolls (n) is xed in advance. The answer is ve.
74
Once (for example) n
1
, n
2
, . . . , n
5
are given, n
6
is automatically determined. Note that this
calculation of degrees of freedom is quite dierent from that arising in contingency tables.
This implies that we will reject the null hypothesis (with = 0.05) if c
2
A, where A is
chosen so that P(C
2
A) when the null hypothesis is true = 0.05. The calculation of A is
quite complicated, and the relevant value is given in the chi-square chart. Since = 0.05
and we have ve degrees of freedom, the chart shows that A = 11.0705.
Step 5 is now easy. We get the data, compute the value of c
2
, and reject the null hypothe-
sis if the value obtained is 11.0705 or larger. This concludes the procedure under Approach 1.
Under Approach 2, Step 4 is to get the data and compute the value of c
2
(the rst part of
Step 4 under Approach 1.) . Step 5 is to nd the P-value asociated with the value of c
2
that
we found. This is very dicult and can only be done via a statistical computer package.
Thus Approach 2 is feasible only if you have access to such a package.
Notes Before doing a numerical example we make some notes.
1. The expected numbers (n/6) for the die example are all the same. This is not usually the
case: see a later example.
2. The expected numbers in the die example (and in general) are not necessarily whole
numbers. Calculate them to four decimal place accuracy, and present your eventual c
2
value
to two decimal place accuracy.
3. Although the statistic (98) is dierent from that in a contingency table, it does share the
same characteristic of the contingency table chi-square statistic in that it can be thought of
as a measure of the dierence between, or distance between, the various observed numbers
and the numbers expected (in fact the mean numbers) if the null hypothesis is true. The
larger these dierences are the larger c
2
is, and if these dierences are large enough c
2
will
be large enough for us to reject the null hypothesis.
Numerical example
We have a die and want to test if it is fair. The null hypothesis is that it is fair and the
alternative hypothesis is that it is unfair in some unspecied way Suppose that we choose a
value = 0.05, that is we choose the probability of making a false positive claim (a Type I
error) to be 0.05, or 5%. Going straight to Step 5, suppose that we roll the die 5,000 times
and get the following data:-
Number turning up:
Number of times seen:
1 2 3 4 5 6
861 812 820 865 821 821
. (100)
75
Each expected number is 5000/6 = 833.3333 (to four decimal place accuracy). The value of
c
2
is thus
(861 833.3333)
2
833.3333
+
(812 833.3333)
2
833.3333
+
(820 833.3333)
2
833.3333
+
+
(865 833.3333)
2
833.3333
+
(821 833.3333)
2
833.3333
+
(821 833.3333)
2
833.3333
, (101)
and this is 3.2464, or to two decimal place accuracy, 3.24. The 5% critical point is 11.0705,
so we do not have enough evidence to claim that the die is unfair. Thinking of this another
way, a fair die can quite easily give an array of observed numbers such as that above.
Another example
Before considering the general case we consider a second example. The seeds of a certain
plant are either yellow and smooth (YS), yellow and wrinkled (YW), green and smooth
(GS) or green and wrinkled (GW). A simple genetic theory claims that in a given system
of crossing (mating), the seed colorand smoothness of any plant will be YS with probability
9/16, YW with probability 3/16, GS with probability 3/16 or GW with probability 1/16.
We wish to test this theory. (By way of background, this theory claims that the color of the
seed is determined by the genes at one genetic locus, with the probability that these genes
lead to yellow seeds being 3/4 and lead to green seeds with probability 1/4, and that the
smoothness of the seed is determined independently by the genes at a dierent genetic locus,
with the probability that these genes lead to smooth seeds being 3/4 and lead to wrinkled
seeds with probability 1/4. There are reasons why this theory might not be correct: if for
example the color and smoothness gene loci are on the same chromosome, the independence
assumption is not correct.)
Step 1. The null hypothesis is that the theory is correct, and the alternative hypothesis is
that it is incorrect, but in an unspecied way. These are the only allowable null and alter-
native hypotheses.
Step 2. In this step we choose the value of : lets choose 5%.
Step 3. The test statistic will be of the chi-square form, but the details will dier from
those in the die example. Suppose that we observe the seeds of n plants. Then if the null
hypothesis is true, the expected number of plants with YS seeds is (9n)/16, the expected
number of plants with YW seeds is (3n)/16, the expected number of plants with GS seeds
is (3n)/16, the expected number of plants with GW seeds is n/16. These expected numbers
(unlike those in the die example) are not all equal, and this is usually the case in chi-square
procedures of this kind. If we eventually see, in our sample of n plants, n
1
YS plants, n
2
YW
plants, n
3
GS plants, and n
4
GW plants, the test statistic will be of the standard (chi-square)
76
form for this kind of test, namely
c
2
=
(n
1
(9n/16))
2
9n/16
+
(n
2
(3n/16))
2
3n/16
+
(n
3
(3n/16))
2
3n/16
+
(n
4
(n/16))
2
n/16
. (102)
Steps 4 and 5 depend on whether we use Approach 1 or Approach 2. We consider Approach
1 rst. Under Approach 1 we have to determine what is the critical value of c
2
will lead us
to reject the null hypothesis. The probability theory zig that determines this is that c
2
is the observed value of a random variable which, when the null hypothesis is true, has (to
a very close approximation) a chi-square distribution with three degrees of freedom. Why
three? Because if n, the sample size, is given, only three of the numbers n
1
, n
2
, n
3
and n
4
can be freely chosen.
Chi-square charts of critical points then show that the null hypothesis will be rejected if
c
2
7.8147.
As always, Step 5 is now straightforward. We get the data (that is the values of n
1
, n
2
, n
3
, n
4
and n), compute the value of c
2
, and reject the null hypothesis if c
2
7.8147. If c
2
< 7.8147
we say we do not have enough evidence to reject the null hypothesis.
Approach 2. Under 2 we get the data and compute the value of c
2
(Step 4). In Step 5 we
nd the P-value associated with thois value of c
2
. This is very dicult and can only be done
with a computer statisical package. If the P-value is 0.05 or less we reject reject the null
hypothesis. If the P-value is greater than 0.05 we say we do not have enough evidence to
reject the null hypothesis.
Numerical example (This example describes the classic experiment by Mendel that led to the
discovery of genes, chromosomes, and eventually to genomics, so crucial in todays scientic
research.) The situation (yellow or green, smooth or wrinkled) is as above, and as above we
choose = 0.05. In this example n = 250, so that the four expected numbers are 9(250)/16
= 140.6250 for YS, 3(250)/16 = 46.8750 for YW, 3(250)/16 = 46.8750 for GS, and 250/16 =
15.6250 for GW. The observed numbers were n
1
= 152, n
2
= 39, n
3
= 53 and n
4
= 6. Thus,
following (102),
c
2
=
(152 140.6250)
2
140.6250
+
(39 46.8750))
2
46.8750
+
(53 46.8750)
2
46.8750
+
(6 15.2625)
2
15.2625
= 8.982.
(103)
This value exceeds 7.8147, and so we reject the null hypothesis. (This turned out to be the
correct conclusion: the assumption of independence was not correct. The genes for color
and smoothness are on the same chromosome. Finding this out was a denite step forward
in genetics, in particular in establishing the fact that genes lie on chromosomes. )
77
The general case
The die and the genetics examples should indicate how this test is done in the general case.
We have k categories, which we call categories 1, 2, ...., k. For example, in the genetics case,
category 1 is yellow and smooth.
Suppose that we plan to take n independent observations. The null hypothesis states that
the probability that any of these observations will fall in category 1 is P
0
(1), the probability
that an observation is in category 2 is P
0
(2), . . . , the probability that an observation is in
category k is P
0
(k). (The sux 0 denotes null hypothesis.) These probabilities are given
numerical values. That is, they do not involve parameters.
Suppose that we have now done our experiment, and that n
1
obervations did fall in category
1, n
2
did fall in category 2, . . . , n
k
did fall in category k.
Lets go through the ve hypothesis testing steps.
Step 1. The null hypothesis is as given above, stating the respective probabilities for an
observation to fall in the various catogories. The alternative hypothesis is that these are not
the correct probabilities, but the alternative hypothesis does not specify in what way they
are incorrect.
Step 2. Choose the value of (1%, 5%).
Step 3. The test statistic is the general form of that used in the die and the genetics
examples namely,
c
2
=
(n
1
nP
0
(1))
2
nP
0
(1)
+. . . +
(n
k
nP
0
(k))
2
nP
0
(k)
. (104)
Approach 1, Step 4. What values of c
2
lead to rejection of the null hypothesis? Suciently
large (positive) values. How large? This depends on the value of chosen in Step 2, and the
number of degrees of freedom. So, how many degrees of freedom do we have? The answer is
k 1. NOTE: not n-1. The chart of critical points of chi-square will now show how large c
2
has to be before we reject the null hypothesis.
Approach 1, Step 5. Get the data, calculate c
2
, and if the value found is greater than or
equal to the appropriate chart value, reject the null hypothesis. If the value found is less
than the appropriate chart value, we do not ave enough evidence to reject the null hypothesis.
Approach 2, Step 4. Get the data and calculate c
2
.
78
Approach 2, Step 5. Find the P-value asociated with the observed value of c
2
. This can
only be done by a computer statistical package. If the P-value is less than or equal to we
reject the null hypothesis. If the P-value is greater than we do not have enough evidence
to reject the null hypothesis.
Notes on this.
1. Most of the notes for r-by-c tables also apply for this form of chi-square. In particular,
use the actual counts and not percentages.
2. Remember that the number of degrees of freedom is k 1, not n 1. Remember also
that this is a dierent use of chi-square from the use in r-by-c tables.
3. Most of the time the categories are descriptive, such as green and smooth. In the die
example we could follow this and say that category 1 is a 1 turns up, that category 2 is
a 2 turns up, etc.
4. Remember that this is this the correct distribution? use of chi-square is completely
dierent from the use of chi-square in an r-by-c table.
A problem for you.
We conduct 1,000 binomial trials. The probability of success on each trial is . The null
hypothesis claims that = 0.3. We actually saw 321 (= n
1
) successes (and thus n
2
= 679
failures) once the trials were conducted.
1. Focus only on the number of successes. Calculate a z statistic, using the observed number
of successes (321) and the null hypothesis mean and variance of the number of successes in
1,000 trials. From this calculate z
2
. Do this exactly.
2. Focus only on the proportion of trials giving success. Calculate a z statistic, using the
observed value of this proportion (321/1,000) and the null hypothesis mean and variance of
this proportion in 1,000 trials. From this calculate z
2
Do this exactly.
3. We see 321 successes and 679 failures in 1,000 trials. calculate a c
2
statistic, using equa-
tion (104). (This will be the sum of two terms, one relating to successes, one relating to
failures).
Which was the correct calculation to make in order to test the null hypothesis, the one
calculated as in 1, the one calculated as in 2, or the one calculated as in 3? (Your numerical
calculations should tell you the answer to this question.)
79
Here are the answers to the problem above.
1. z =
(321300)

(1000(.3)(.7)
=
21

210
,
so that z
2
=
441
210
= 2.1.
2. z =
.(321.3)

(.3)(.7)/1,000
=
.021

.00021
,
so that z
2
=
.000441
.00021
= 2.1.
3. c
2
=
(321300)
2
300
+
(679700)
2
700
=
441
300
+
441
700
= 1.47 + 0.63 = 2.1.
Thus all three methods give the same answer. They are all correct. Note however that
methods 1 and 2 rely on binomial calculations, which are OK here since there are only two
categories (success and failure). The c
2
method extends to any number of categories (such
as 6 in the die example and 4 in the genetics example).
A more complicated situation
In all the above examples the probabilities for the k catgegories as given by the null hypothesis
were all numbers, for example 1/6, 1/6, . . . , 1/6 in the die example and 9/16, 3/16, 3/16, 1/16
in the genetics example. In some more complex cases the probabilities are given in terms of
a parameter or several parameters. The procedure in such cases is as follows:-
(i) Estimate the parameters from the data. (There is advanced statistical theory that shows
you how to do this. You cannot be expected to know this theory, so you will always be told
how this estimation is to be done.)
(ii) Calculate c
2
in the normal way, but with the parameter(s) replaced by the parameter
estimate(s).
(iii) You lose one further degree of freedom in the chi-square for every parameter that you
estimate. (So if there are k categories and you estimate m parameters, the number of degrees
of freedom is k m1.)
Example (again from genetics).
Under the null hypothesis, individuals is a certain population are either of genetic type aa
(with probability (1 )
2
), of genetic type Aa (with probability 2(1 )), or of genetic
type AA (with probability
2
). Here is a parameter whose numerical value is not specied
80
. (You can see the binomial distribution is relevant here.) The alternative hypothesis is that
the null hypothesis is not true in an unspecied way.
Given that in our data we have n
1
individuals of type aa, n
2
individuals of type Aa, and n
3
individuals of type AA, with n
1
+ n
2
+ n
3
= n, how do we test the null hypothesis against
the alternative hypothesis? That is, what is our test statistic and what values of the test
statistic will lead us to reject the null hypothesis?
Suppose rst (and unrealistically) that the numerical value of is given. Then we would
form the chi-square statistic
(n
1
n(1 )
2
)
2
n(1 )
2
+
(n
2
2n(1 ))
2
2n(1 )
+
(n
3
n
2
)
2
n
2
. (105)
We would test the null hypothesis using chi-square charts with two degrees of freedom.
However, of course, we do not know the numerical value of . We will have to estimate it
from the data. So here is a trust me result: the estimate of is

= (n
2
+ 2n
3
)/(2n). So
we form the appropriate c
2
statsiitc by replacing wherever we see it in (105) by

. This
gives the test statistic
c
2
=
(n
1
n(1

)
2
)
2
n(1

)
2
+
(n
2
2n

(1

))
2
2n

(1

)
+
(n
3
n

2
)
2
n

2
. (106)
We refer the value so calculated to chi-square charts with 3 - 1 -1 = 1 degree of freedom (since
we always lose one degree of freedom, and we lose an extra one here because we estimated
one parameter).
Numerical example.
Suppose that n
1
= 466, n
2
= 444 and n
3
= 90, so that n = 1, 000. Then

= (444 +
180)/2, 000 = 0.312. Then c
2
is calculated as
c
2
=
(466 1, 000(0.688)
2
)
2
1, 000(0.688)
2
+
(444 2, 000(0.312)(0.688))
2
2, 000(0.312)90.688)
+
(90 1, 000(0.312)
2
)
2
1, 000(0.312)
2
, (107)
and the numerical value of this is
(466 473.344)
2
473.344
+
(444 429.312)
2
429.312
+
(90 97.344)
2
97.344
,
which is about 1.17. Lets assume that we had chosen = 0.05 in Step 2. Then the relevant
critical point value from the chi-square chart is 3.8415. Since 1.17 is less than this, we do
not have enough evidence to reject the null hypothesis.
81
A nal thought about P-values
As indicated often in class, if you want an exact P-value in a chi-square test with more than
one degree of freedom, you will have to use a statistical computer package (Approach 2).
However even with Approach 1, where you use a chart of critical points of chi-square, you
can put bounds on the P-value. Here is how it works.
Consider for example the case where the chi-square has 10 degrees of freedom. For = 0.05
the critical value is 18.3070 and for = 0.01 the critical value is 23.2093. What this means is
that if C
2
is a random variable having a chi-square distribution with 10 degrees of freedom,
P(C
2
18.3070) = 0.05 and P(C
2
23.2093) = 0.01. So we know then, for example, that
P(C
2
21.3456) is somewhere between 0.01 and 0.05, but we do not know exactly where it
is in that range.
Suppose now that you have your data in a chi-square test with 10 degrees of freedom and
have computed c
2
, and that numerically the c
2
value is 21.3456. What can be said about the
P-value corresponding to this? Remember what a P-value is: it is the probability of getting
the observed value of your test statistic (here, 21.3456) or one more extreme in the direction
of the alternative hypothesis (here greater than or equal to 21.3456) when the null hypothesis
is true (here, that C
2
does have a chi-square distribution with 10 degrees of freedom). From
the above, the P-value must be between 0.01 and 0.05. That is all you can say about the
numerical value of the P-value. Thus if a researcher used Approach 1 (in which an exact
P-value cannot be obtained), he or she will use the chi-square chart and say something like:

2
= 21.3456, 0.01 P 0.05. You will often see this sort of statement in research papers.
Tests on means
Perhaps the most frequently used tests in Statistics concern tests on means. Tests on means
(in this course) are carried out using so-called t tests. (So-called ANOVA tests, which also
are tests on means but for data more complicated than that which we consider in this course,
are discussed in STAT 112.) We shall discuss four dierent t tests in this course.
The one-sample t test.
It is easiest to start with an example. Dierent people have dierent body temperatures,
due to dierences in physiology, age, and so on. That is, the temperature of a person taken
at random is a (continuous) random variable. So there is a probability distribution of tem-
peratures. From considerable past experience, suppose that we know that the mean of this
distribution is 98.6 for normal healthy individuals.
We are concerned that the mean temperature for people 24 hours after an operation ex-
82
ceeds 98.6. To investigate this we plan to get some data from n people 24 hours after they
have had this operation. At this stage these temperatures are random variables: we do not
know what values they will take. So we denote them in upper case: X
1
for the rst person
in the sample, X
2
for the second person in the sample, ...., X
n
for the nth person in the
sample. X
1
, X
2
, . . . , X
n
are iid random variables each having some probability distribution
with unknown mean, which we denote by . Under the null hypothesis (that the mean body
temperature of people 24 hours after the operation is the same as that for normal healthy
people), = 98.6. From the context of the situation, the natural alternative hypothesis is
one-sided up, that is that is greater than 98.6. This concludes Step 1 in the hypothesis
testing procedure: we have declared our null and alternative hypotheses.
Step 2 consists of choosing the false positive rate , the probability of claiming that is
greater than 98.6 when it is in fact equal to 98.6. In other words, this is the probability
we allow ourselves for rejecting the null hypothesis when it is true. Since this is a medical
situation, lets choose = 0.01.
Step 3 consists of choosing the test statistic. Looking ahead to our actual data x
1
, x
2
, . . . , x
n
,
it is clear that a key component in the test statistic will be a comparison of the average of
these values ( x) with the null hypothesis mean 98.6. That is, a key component in the test
statistic will be the dierence x 98.6.
(This is the comparison of an average with a mean. This then is why it has been emphasized
so frequently in this course that an average and a mean are two entirely dierent concepts.
To use these two words as meaning the same thing makes it almost impossible to understand
the testing procedure that we are developing.)
However, the dierence x 98.6 is not enough. To see why this is so, we consider rst the
quite unrealistic case where we know the variance
2
of the probability distribution of tem-
peratures of people 24 hours after they have had this operation. (We consider the realistic
case, where we do not know this variance, later.)
Now lets go on a probability theory zig. Consider the situation before the data were
obtained. At this stage the average

X is a random variable. If the null hypothesis is true, it
has mean 98.6 (the mean of each individual X, by one of the magic formulas), and variance

2
/n (by another magic formula). If we also assume that the Xs have a normal distribution,
then

X 98.6
/

n
=
(

X 98.6)

is a Z if the null hypothesis is true. So once we have our data we can compute
( x 98.6)

(108)
83
and reject the null hypothesis if the value of this is 2.326 or more. (The value 2.326 comes
from the Z chart and the fact that = 0.01: make sure that you can work out where 2.326
came from, using the Z chart.)
The above procedure is very suggestive to us for the far more realistic case where we do not
know the variance
2
of the probability distribution of temperatures of people 24 hours after
they have had this operation. However we know how to estimate a variance, by
s
2
=
x
2
1
+x
2
2
+. . . +x
2
n
n( x)
2
n 1
.
So a sensible thing to do is to replace the test statistic (108) by t, dened as
t =
( x 98.6)

n
s
. (109)
This is our test statistic for the temperatures example. This concludes Step 3 of the test-
ing procedure in the temperatures example.
More generally, if the null hypothesis claims that the mean of the temperature of a person
24 hours after the operation is
0
, the test statistic is
t =
( x
0
)

n
s
. (110)
This is the general form of the test statistic t in the one-sample case.
Approach 1, Step 4. What values of t will lead us to reject the null hypothesis? Because
of the nature of the alternative hypothesis in the temperatures example, suciently large
positive values. This is a one-sided up example. How large? To answer this question we
go back to the situation before we get our data. The random variable analogue of t is the
random variable T, dened (in parallel with (111)) as
T =
(

X 98.6)

n
S
, (111)
where
S
2
=
X
2
1
+X
2
2
+. . . +X
2
n
n(

X)
2
n 1
.
So T is very similar to a Z. The only dierence is that T has the estimator of a standard
deviation in the denominator instead of a known standard deviation. Because of this dif-
ference T does not have the Z distribution. Instead it has the so-called t distribution with
n 1 degrees of freedom. (Why n 1? That will be discussed later.) This is just one more
well-known and well-studied probability distribution of central importance in Statistics (like
the chi-square distribution). Again like the chi-square distribution, the mathematical form
84
is not given here. So there will be several trust me results coming along soon.
The form of the t distribution tells us how large t has to be in the tempteratures example
for us to reject the null hypothesis. Charts indicating this will be given out in class. For
example, if n = 10 (so that you have 9 degrees of freedom) and = 0.01, we would reject
the null hypothesis if t 2.821.
Note. The t chart that you will be given is designed for one-sided up tests (such as the
temperature example). If the alternative hypothesis had been that (the mean temperature
for people 24 hours after the operation) is less that 98.6, then with n = 10 and = 0.01
we would reject the null hypothesis if t 2.821. If the alternative hypothesis has been
= 98.6 then with n = 10 and = 0.01 we would reject the null hypothesis if t 3.250 or
if t 3.250.
Approach 1, Step 5. Get the data, compute t, compare the value with the relevant chart
value, and accept or reject the null hypothesis accordingly.
Approach 2, Step 4. Get the data and calculate t.
Approach 2, Step 5. Find the P-value assiociated with the observed value of t. This can
only be done with a computer package. An example using JMP will be discussed in class.
Numerical example - the temperature case
Suppose that we get the temperatures of n = 10 people 24 hours after they have the opera-
tion. These values were 98.9, 98.6, 99.3, 98.7, 98.7, 98.4, 99.0, 98.5, 98.8, 98.8. The average
of these is
98.9 + 98.6 + 99.3 + 98.7 + 98.7 + 98.4 + 99.0 + 98.5 + 98.8 + 98.8
10
= 98.77.
Next we have to calculate s
2
. This is
98.9
2
+ 98.6
2
+ 99.3
2
+ 98.7
2
+ 98.7
2
+ 98.4
2
+ 99.0
2
+ 98.5
2
+ 98.8
2
+ 98.8
2
10(98.77)
2
9
,
and this is 0.066778. From this, s = 0.258414. Thus the calculated value of t is
t =
(98.77 98.6)

10
0.258414
= 2.0803.
Since there are 9 degrees of freedom and we chose = 0.01, the critical value (from the
t chart is 2.821 (as given above). Our observed value is less than this, so we do not have
85
enough evidence to reject the null hypothesis. In other words, we do not have enough evi-
dence to claim that there is a signicant increase in temperature 24 hours after the operation.
Approach 2, Step 4. Get the data and calculate t. As above, we get 2.0803.
Approach 2, Step 5. Use a statistical package to nd the P-value corresponding to t = 2.0803.
You will be given a JMP handout doing this. The P-value shown is 0.0336. Since this is not
less than or equal to 0.01, we draw the same conclusion as we did under Approach 1: we
do not have enough evidence to claim that there is a signicant increase in temperature 24
hours after the operation.
Notes on the one-sample t test.
1. Degrees of freedom. In a one-sample t test the number of degrees of freedom is one les
than the number of data values, (that is , degrees of freedom = n 1). Trust me.
2. Think of t as a signal to noise ratio. The signal is the numerator quantity ( x ), telling
you the dierence of the average of the data values from the null hypothesis mean, together
with the square root of the sample size

n. The noise is s. (An example will be given later
illustrating the importance of the noise.)
3. The only dierence between a one-sample t statistic and a z statistic is that t has a
standard deviation estimate in the denominator (s) and a z has a known standard deviation
().
4. The t chart that you will be given is designed for one-sided up tests. It can still be used
for one-sided down and two-sided tests - see examples later.
5. The critical points in a t chart are calculated under the assumption that the random
variables X
1
, X
2
, . . . , X
n
have a normal distribution.
6. What do you do if you are not prepared to assume that X
1
, X
2
, . . . , X
n
have a normal
distribution? You use a non-parametric (= distribution-free) test. See later.
7. The general form of a one-sample t statistic is t =
( x
0
)

n
s
. Here
0
is the null hypothesis
mean.
8. The t test is scale-free. If one person measures temperature in degrees Fahrenheit and
uses a null hypothesis mean of 98.6, and another measures the same temperatures in degrees
centigrade and used the equivalent null hypothesis mean of 37.0, they would get the same
numerical value for t. (It would be disaster if this did not happen.)
86
9. More on degrees of freedom. Remember that if in the operation example we knew
the variance
2
of the temperatures of people 24 hours after the operation, and if we chose
= 0.01, we would reject the null hypothesis is z 2.326. How many degrees of freedom
does 2.326 correspond to in a t test? Innity.
10. Suppose that in the temperatures example the alternative hypothesis had been that
the temperature of a person 24 hours the operation tends to be lower than 98.6. Then the
test would be one-sided down, and with = 0.01 and n = 10 as before, we would reject the
null hypothesis if t 2.821. If the alternative hypothesis had been that the temperature
of a person 24 hours the operation diers from 98.6, but in an unspecied direction, then
the test would be two-sided, and with = 0.01 and n = 10 as before, we would reject the
null hypothesis if t 3.250 or if t +3.250.
11. Under Approach 1 we do not attempt to compute a P-value. However we can usually
put a bound on the P-value. For example, in the original one-sided up temperature t
test, suppose that (again with n = 10) we had obtained (as above) a value of t of 2.0803.
What bounds can we put on the P-value from the values in the t chart? Remember: the
P-value in this example is the probability of getting a t value of 2.0803 or more when the null
hypothesis is true. The t chart shows that when the null hypothesis is true, the probability
of getting a value of t greater than or equal to 1.833 is 0.05 and the probability of getting a
value of t greater than or equal to 2.821 is 0.01. Thus the probability of getting a value of t
greater than or equal to the observed value 2.0803 when the null hypothesis is true must be
between 0.01 and 0.05. And (as we saw from the computer output using Approach 2) it is
0.0336, which is indeed between 0.01 and 0.05.
You will often see a result such as this written as: t = 2.0803, 0.01 < P < 0.05.
12. The t chart and the two-standard-deviation rule. If the true mean of X
1
, X
2
, . . . , X
n
is
, then t = ( x )

n/s has the t distribution with n 1 degrees of freedom. Suppose for


example that n = 21, so that you have 20 degrees of freedom. Then the critical points in
the t chart show that P(2.086 t +2.086) = 0.95. Some algebraic manipulations then
show that
P( x
2.086s

n
x +
2.086s

n
) = 0.95.
Thus an exact 95% condence interval for is
x
2.086s

n
to x +
2.086s

n
.
Similarly if n = 41, so that you have 40 degrees of freedom, an exact 95% condence interval
87
for is
x
2.021s

n
to x +
2.021s

n
.
Both of these are close to the approximate 95% condence interval for deriving from the
two-standard-deviation rule, namely
x
2s

n
to x +
2s

n
.
Looking down the chart of the critical values of t corresponding to = 0.025, you can see
that from n = 20 onwards the exact 95% condence interval for is very close to the ap-
porximate condence interval. In fact for n = 60 they are essentially identical.
This of course is why the approximate two-standard deviation rule has been emphasized
so much in class.
13. Robustness As noted above, if you do a t test and use the t chart to assess signicance,
you are implicitly assuming that the random variables X
1
, X
2
, . . . , X
n
have a normal distri-
bution. (The critical points in the t chart were calculated under this assumption.) So it is
natural to ask: How far wrong do you go in using a t test if X
1
, X
2
, . . . , X
n
do not have a
normal distribution?
The answer is: not much. The t test is said to be robust. The critical points in the t chart
are quite accurate even if X
1
, X
2
, . . . , X
n
do not have a normal distribution.
In part this happens because of the Central Limit Theorem. A crucial component of the t
statistic is x, and from the Central Limit Theorem one can reasonably assume that

X has
close to a normal distribution whatever the probability distribution of X
1
, X
2
, . . . , X
n
might
be.
********************
Example 2. This (a) gives example of a one-sided down test, and (b) shows the importance
of the noise (s) in the denominator of the t statistic.
You are a member of a consumer group and you are concerned that your local supermarket
is putting in less than 5 pounds of coee in bags which they label 5 pounds. We want to
check on this and, to do this, we plan to buy 10 bags of coee and weigh them. (With 10
bags we will have 9 degrees of freedom on our eventual t test.)
Step 1. Let be the true mean weight of coee put in the bags. The null hypothesis is = 5
and (because of the context) the alternative hypothesis is < 5.
88
Step 2. Lets choose = 0.05. Because of the nature of the alternative hypothesis and the
fact that we will have 9 degrees of freedom, we will reject the null hypothesis if our eventual
value of t is -1.833 or less.
Steps 3, 4 and 5 combined.
Case 1. Suppose that the weights of the 10 bags are (in pounds)
4.96, 5.04, 4.93, 5.06, 4.92, 4.96, 5.03, 4.91, 4.98, 5.01.
From these values we nd x = 4.98 and s
2
= 0.0028. From this,
t =
(4.98 5.00)

10

0.0028
= 1.1952.
Because this is not less than -1.833 we do not have enough evidence to reject the null hy-
pothesis. ( In an upcoming homework you will do this test using JMP, and will also nd the
P-value asociated with the t value of -1.1952.)
Case 2. As above, but suppose that the weights of the 10 bags are (in pounds)
4.98, 4.97, 4.99, 4.99, 5.01, 4.97, 4.98, 4.96, 4.99, 4.96.
From these values we nd x = 4.98 (the same as in Case 1), but now s
2
= 0.00002444 (much
less than in case 1). From this,
t =
(4.98 5.00)

10

0.00002444
= 4.0452.
Because this now is less than -1.833 we have enough evidence to reject the null hypothesis.
Note that the averages are the same in Cases 1 and 2, and that the only dierence between
the two cases is that there is a smaller level of noise in Case 2.
A two-sided example
Remember: the t chart that you have is designed for one-sided up tests. However they are
easily adapted for two-sided tests, provided you remember what the chart values are telling
you. This was discussed in Note 10 above.
If for example in the temperatures case we are concerned about both an increase and a
decrease in the temperatures of people 24 hours after the operation, we will have to conduct
a two-sided test. If as before we choose = 0.01, we put half of this value (i.e. 0.005) on
the up side and the other half on the down side. If n = 10, so that we have 9 degrees of
freedom, the chart shows that we will reject the null hypothesis if t 3.250 or if t 3.250.
89
Two-sample t tests
These are used very often and thus are very important.
As the name suggests, with two-sample t tests we have two sets of data. Here is an example.
Suppose that we want to assess whether people tend to lose the ability to remember a set
of instructions as they get older. To address this question we plan to get a sample of m
people aged 18-45 (group 1) and a sample of n people aged 60-75 (group 2) and sub-
ject each person to a memory test (see how long they take to remember a set of instructions).
We are still in the planning phase. We will do all this tomorrow. So at this stage the lengths
of time for the people in group 1 are random variables, which we denote by X
11
, X
12
, . . . , X
1m
.
Similarly the lengths of time for the people in group 2 are random variables, which we denote
by X
21
, X
22
, . . . , X
2n
. (Notice that the rst sux indicates group membership.)
We assume that X
11
, X
12
, . . . , X
1m
are iid random variables, each having a normal distribu-
tion with mean
1
and (unknown) variance
2
. (The normal distribution assumption will be
discussed later). Similarly we assume that X
21
, X
22
, . . . , X
2n
are iid random variables, each
having a normal distribution with mean
2
and (unknown) variance
2
. (Again, the normal
distribution assumption will be discussed later. Note also that we are assuming the same
variance in both groups: this assumption also will be discussed later.)
Step 1. The null hypothesis claims that
1
=
2
.
(Note here the parallel with what we did in the binomial case. We rst tested for the value
of one single binomial parameter: Is this coin fair? and moved on from that to testing for
the equality of two binomial parameters: Is the probability (
1
) that a woman is pro-choice
the same as the probability (
2
) that a man is pro-choice?. Similarly here we rst tested
for the value of one single mean: Is the mean 98.6?, and we have now moved to testing
for the equality of two means.)
The alternative comes from the context, and in this example it is that
1
>
2
. So this will
be a one-sided test. (Whether it will be one-sided up or one-sided down will be discussed
later.) In other cases the alternative hypothesis could be two-sided (
1
=
2
). It all depends
on the context.
Step 2. Here we decide on the value of , usually either 5% or 1%. This is the false positive
probability. In the memory example, it is the probability that you are prepared to adopt
that you reject the null hypothesis (that the mean memory time is the same for both age
groups) when it is true.
90
Step 3. What is the test statistic? Lets now go forward in time to when we have our
data. We will have data values x
11
, x
12
, . . . , x
1m
from the people in group 1 and data values
x
21
, x
22
, . . . , x
2n
from the people in group 2. It is clear that a key part of the signal in the
t statistic that we are developing will be the dierence between the averages x
1
and x
2
. We
could use the dierence x
1
x
2
or the dierence x
2
x
1
. Which choice we make will eventu-
ally decide if the test is one-sided up or one-sided down: more on this later. To be specic
here, lets choose the dierence x
1
x
2
. As stated above, this will be a key componrent of
thesignal in the numerator in the t test statistic that we are developing.
However, as with the one-sample t test, this on its own is not enough. We have to consider
the noise. To see what this should be, lets pretend for the moment that the variance
2
(discussed above) is known. We now have to do a probability theory zig, and consider
the situation before we get our data. Since we know that our eventual t statistic will have
x
1
x
2
in the numerator, at this stage we think about the random variable

X
1


X
2
. What
is the variance of this random variable?
Her we have to use two of the magic formulas of probability theory. The formula for the
variance of an average shows us that the variance of

X
1
is
2
/m and variance of

X
2
is
2
/n.
Next we have to use the magic formula for the dierence of two random variables, namely
the sum of their variances. Putting all of this together, the variance of

X
1


X
2
is
Variance of

X
1


X
2
=

2
m
+

2
n
. (112)
Again using magic formulas, the mean of

X
1
is
1
and mean of

X
2
is
2
. Now suppose that
the null hypothesis
1
=
2
is true. Using a further magic formula, the mean of

X
1


X
2
is
then zero. Putting this together with the result given in (112), when the null hypothesis is
true, the quantity

X
1


X
2
_

2
m
+

2
n
(113)
is a Z, that is it is a random variables having a normal distribtion with mean zero, variance
1. We could use the Z chart to test for its signicance.
Algebraic manipulations in (113) show that the statistic dened in (113) can be written in
the more convenient form
(

X
1


X
2
)
_
mn
m+n

. (114)
To say it again, if the null hypothesis
1
=
2
is true, the quantity in (114) is a Z, that
is, has a normal distribution with mean zero, variance 1.
91
However, the problem with the above is that we do not know . So we will have to estimate
it from the data. This will eventually lead us to a t statistic, since as discussed previously,
the only dierence between a z statistic and a t statistic is that in a t we have an estimate of
a standard deviation in the denominator, whereas in a Z statistic we have a known standard
deviation in the denominator, as in (114).
This leads to a trust me result. We want to use the data from both groups, combined
in some way, to estimate
2
, and from this to estimate . The data x
11
, x
12
, . . . , x
1m
from
group 1, taken on their own, would lead to the estimate
s
2
1
=
x
2
11
+x
2
12
+ +x
2
1m
m( x
1
)
2
m1
. (115)
Similarly the data x
21
, x
22
, . . . , x
2m
from group 2, taken on their own, would lead to the
estimate
s
2
2
=
x
2
21
+x
2
22
+ +x
2
2n
n( x
2
)
2
n 1
. (116)
Here is the trust me result. Our estimate of
2
, combining the estimates s
2
1
from group 1
and s
2
2
from group 2, is s
2
, dened by
s
2
=
(m1)s
2
1
+ (n 1)s
2
2
m+n 2
. (117)
This is a weighted average of s
2
1
and s
2
2
, the weights being thew respective degrees of freedom
in the two groups. Combining this with the thinking that led to the z in (114), the test
statistic that we will use is t, dened by
t =
( x
1
x
2
)
_
mn
m+n
s
. (118)
This is the end of Step 3. We have arrived at our test statistic. It is what we will eventually
calculate fom our data.
Step 4, Approach 1. In this step we ask: What values of t will lead us to reject the null
hypothesis? We answer this question in two stages. But before starting, recall that we
arbitrarily have x
1
x
2
in the numerator of the test statistic t. We could equally have
decided to have x
2
x
1
in the numerator. If we had done this our test statistic would have
been t

, dened by
t

=
( x
2
x
1
)
_
mn
m+n
s
. (119)
Either choice of test statistic is allowed. However it is crucial that you think very carefully,
once you have decided which statistic to use, (118) or (119), about the sided-ness of your
test. This is determined by the alternative hypothesis and the choice of statistic that you
92
make.
Suppose, for example, that we use the test statistic t as given in (118), and that the alterna-
tive hypothesis had been
1
<
2
. What value of t would we expect to get if this alternative
hypothesis is true? If the alternative hypothesis is true, x
1
will tend to be less than x
2
and
so t will tend to be negative. Thus sucently large negative values of t, as dened in (118),
will lead us to reject the null hypothesis.
Similarly, suppose that the alternative hypothesis had been
1
>
2
. If this alternative
hypothesis is true, x
1
will tend to be greater than x
2
and so t will tend to be positive. Thus
sucently large positive values of t, as dened in (118), would in that case lead us to reject
the null hypothesis.
What would the corresponding procedure be if you decided to use the test statistic t

? Sup-
pose rst that alternative hypothesis had been
1
<
2
. What value of t

would we expect
to get if this alternative hypothesis is true? If this alternative hypothesis is true, x
1
will tend
to be less than x
2
and so t

will tend to be positive. Thus sucently large positive values of


t

would lead you to reject the null hypothesis.


Does this contradict the conclusion that you would have reached using t as the test statistic?
NO. Observe that t

is the negative of t: if for example t = 1.66, then t

= +1.66. So
large negative values of t correpond to large positve values of t

, and you will reach the same


conclusion whether you use t or t

.
To be concrete we will use t as dened in (118) as our test statistic, but the above discussion
shows that we must be careful to work out, if we have a one-sided test, whether we will have
a one-sided up or a one-sided down test in any specic example.
If our test is two-sided then it does not matter: sucently large positive or sucently large
negative of t will lead us to reject the null hypothesis, and this will correspond exactly to
sucently large positive or sucently large negative of t

.
How large? Here is another trust me result. If the null hypothesis is true, then the statstic
t is the observed value of a random variable having the t distribution with m+n2 degrees
of freedom. (You get m 1 degrees of freedom from group 1 and n 1 degrees of freedom
from group 2, and thus n +m2 degrees of freedom altogether.)
The degrees of freedom calculation together with the value of will tell you how large pos-
itive, or how large negative, or how large positive or negative, t has to be before the null
hypothesis is rejected. Here are some examples.
93
Suppose that m = 10 and n = 13. You therefore have 10 + 21 2 = 21 degrees of freedom.
Suppose that the alternative hypothesis is
1
>
2
and that you chose = 0.05. Then you
have a one-sided up test, and the t chart shows that you will reject the null hypothesis if
t > 1.721. If the alternative hypothesis had been
1
<
2
and you chose = 0.05, then you
have a one-sided down test, and you will reject this null hypothesis if t < 1.721. If the
alternative hypothesis had been
1
=
2
you have a two-sided test, and with = 0.05 you
would reject this null hypothesis if t > 2.080 or if t < 2.080.
Step 5, Approach 1. This is the easy step. Get the data, calculate t, refer the value you get
to the approptriate value on the t chart, and accept or reject the null hypothesis depending
on this comparison.
Step 4, Approach 2. In this step we calculate the value of t.
Step 5, Approach 2. In this step we nd the P-value associated with the value of t calculated
in Step 4. This can only be done via a statistical computer package. JMP prints out three
P-values, one for the case where the test is one-sided up, one for the case where the test is
one-sided down, and one for the case where the test is two-sided.
Notes on the two-sample t test.
Most of the notes for the one-sample t test apply also for a two-sample t test. There are
however some dierences. Here are three of them.
1. An obvious dierence is in the formula for the number of degrees of freedom in the two
cases (n 1 compared to m+n 2).
2. A key assumption made above is that the variance corresponding to group 1 is the same
as that corresponding to group 2. (No corresponding assumption is needed for one-sample
t tests.) The assumption of equal variances is one that you might, or might not, like to
make, depending perhaps on your experience with the sort of data you are involved with.
There is a way of proceeding if you are not prepared to make this assumption. However this
procedure is complex and we will not consider it in this class. Note that when using JMP
for a two-sample t test you can specify whether you want to make this assumption or not.
JMP will compute the value of t and its P-value corresponding to whatever choice you made.
3. The main part of the signal in a one-sample t test is the dierence between an average
and a mean, i.e. x. The main part of the signal in a two-sample t test is the dierence
between two averages ( x
1
x
2
).
4. This note concerns a computational simplication for the calculation of s
2
. The formula
94
given in (117) for s
2
is (to repeat that formula)
s
2
=
(m1)s
2
1
+ (n 1)s
2
2
m+n 2
.
Now s
2
1
is dened by
s
2
1
=
x
2
11
+x
2
12
+ +x
2
1m
m( x
1
)
2
m1
,
and similarly s
2
2
is dened by
s
2
2
=
x
2
21
+x
2
22
+ +x
2
2n
n( x
2
)
2
n 1
.
It follows from these three equations that
s
2
=
x
2
11
+x
2
12
+ +x
2
1m
m( x
1
)
2
+x
2
21
+x
2
22
+ +x
2
2n
n( x
2
)
2
m+n 2
.
This is perhaps the simplest formula for computing s
2
.
5. There is no need, in a two-sample t test, for m and n to be equal. However in practice it
is wise to make them as reasonably equal as possible, given the constraints of the situation
involved.
6. The null hypothesis in any two-sample t test is that two means are equal (in the notation
above, that
1
=
2
). However no statement is made or required as to what the common
value of the mean is under the null hypothesis.
Numerical example of a two-sample t test.
A group of m = 10 young people were given a certain stimulus and the reaction time of each
of these 10 people was measured (in thousandths of a second). At the same time a group
of n = 8 old people were given the same stimulus and the reaction time of each of these 8
people was measured (again, in thousandths of a second). The null hypothesis is that the
mean reaction time for younger poeple is the same as that of older people. The natural
alternative hypothesis is that the mean reaction time for older people is longer than that
for younger people. These comments in eect conclude Step 1: we have stated our null and
alternative hypotheses.
In Step 2 we choose the value of , the false positive rate (here the probability that we
conclude that the mean mean reaction time for older people is longer than that for younger
people when in fact it is the same). Lets choose 1%.
95
Steps 3, 4 and 5 combined. Here group 1 will be the young people and group 2 will be the
old people. At this point we have to ask ourselves: what kind of values of the t statistic will
lead us to reject the null hypothesis? Suppose that m, the number of young people, is 10
and that n, the number of older people, is 8. Looking at the numerator in the t statistic we
see that if the alternative hypothesis is true, t will tend to be negative. This implies that we
will reject the null hypothesis if t is sucently large negative. How large negative? We will
have 10 + 8 -2 = 16 degrees of freedom, and with = 0.01 the t chart shows that we will
reject the null hypothesis if t 2.583.
We are now set up to do the calculations. The data are as follows. For the 10 young people
the reaction times were
382, 446, 483, 378, 414, 420, 452, 391, 399, 426
and for 8 old the reaction times were
423, 474, 456, 432, 513, 480, 498, 448.
From these values we compute x
1
= 419.1 and x
2
= 465.5. The value of s
2
is 17143/16 =
1071. Thus
t =
(419.1 465.5)
_
80
18

1071
= 2.9890.
Since this is less than the critical value -2.583 we reject the null hypothesis, and claim that
we have signicant evidence that the mean reaction time for older people exceeds that for
younger people.
Lets review the logic behind this, focusing on the probability theory implication/deduction
activity (the zig) and the statistical inference/induction activity (the zag).
If the null hypothesis is true, then dicult math which we will not see shows that the prob-
ability that the eventual value of t will be les than -2.583 is very small, in fact 0.01. This is
the probability theory zig.
After we got our data we found that the calculated value of t was -2.9890. Since this is less
than -2.583, we make the inference, or induction, that we have good evidence that the null
hypothesis is not true. This is the statistical zag.
The paired two-sample t test
In this test, although we start with two samples of data, we nish up doing a one-sample
test. Here is an example from the temperature 24 hours after operation context.
96
As before, we are concerned that the temperature of people 24 hours after an operation
is unduly elevated. This time we measure the temperature of the people undergoing the
operation both 24 hours before, and also 24 hours after, the operation. Suppose that we
consider n people who will have this operation. Going straight to the data, we will have n
24 hour before the operation temperatures x
11
, x
12
, . . . , x
1n
and also n 24 hour after the
operation temperatures x
21
, x
22
, . . . , x
2n
. here the temperatures x
11
and x
21
are from the
same person, (person number 1) x
12
and x
22
are from person number 2, and so on.
There is now a natural pairing of the temperatures taken from the same person. We could
ignore this and do a two-sample t test. But doing this does not take advantage of this natural
pairing. We will see how we do take advantage of this pairing in Step 3 below.
First, lets consider Step 1.
Step 1.
As before, the null hypothesis is that the mean temperature 24 hours before the operation
and the mean temperature 24 hours after the operation are equal. The alternative hypothesis
of interest to us is that mean temperature goes up after the operation. This concludes Step
1 - we have specied both our hypotheses.
Step 2. Lets again choose = 0.01.
Steps 3, 4 and 5
Lets look forward to our eventual data. It is arbitrary whether we choose Group 1 as the
24 hour before readings of the 24 hour after readings, so long as we think about what
this means about whether the test will be one-sided up or one-sided down. Lets choose the
24 hour before readings as Group 1.
Suppose that there are n people in the experiment and we denote the n 24 hour be-
fore data values as x
11
, x
12
, . . . x
1n
and the n 24 hour after data values as x
21
, x
22
, . . . x
2n
.
We now take advantage of the pairing to focus on the dierences d
1
= x
11
x
21
, d
2
=
x
12
x
22
, . . . , d
n
= x
1n
x
2n
. From now on we discard the original x
ij
data values and
use only these dierences. This means that we will be doing a one-sample t test (on the d
i
values), even though our original data formed two samples.
To nd the relevant test statistic we do a probability theory zig. If X
1
is the temperature
of a randomly chosen person 24 hours before the operation and X
2
is the temperature of
this same person 24 hours after the operation, then under the null hypothesis the mean of
the dierence D = X
1
X
2
between these two temperatures is zero. Under the alternative
97
hypothesis the mean of this dierence is negative.
Following the standard procedure for a one-sample test, we will form a t statistic from the
respective dierences d
1
, d
1
, . . . , d
n
. The numerator in this statistic will be the average

d
of these dierences. This implies that the test will be one-sided down: if the alternative
hypothesis is true, this average will tend to be negative. We also calculate s
2
d
using the
formula
s
2
d
=
d
2
1
+d
2
2
++d
2
n
n(

d)
2
n1
.
The eventual t statistic is then
t =

n
s
d
. (120)
Here is a trust me result. If the null hypothesis is true, this is the observed value of a
random variable havcing thre t distribution with n 1 degrees of freedom. So we now refer
the calculated value of this t to the t chart with n1 degrees of freedom, remembering that
in this example the test is one-sided down. Thus if for example n = 15 and = 0.01, we
would reject the null hypothesis if t is less than or equal to - 2.624.
Two questions. 1. If we had done a two-sample t test (which is quite posible, using the
data) we would have 2n 2 degrees of freedom. What has happened to the remaining n 1
degrees of freedom?
2. What is the advantage of using the paired t test? It eliminates the natural person to
person variation in temperature. This is illustrated by a the following example. Before doing
the example recall that we dened d
j
as x
ij
x
2j
, and this led to the test being one-sided
down. We now address these two points in turn.
1. Suppose that we take the temperatures of n = 8 people both before and after the opera-
tion. Going straight to the data, suppose that we get:-
Person number 1 2 3 4 5 6 7 8
Temperature before 98.0 98.9 98.1 97.8 98.0 98.8 98.8 98.1
Temperature after 98.4 98.8 98.4 98.1 98.2 99.0 98.7 98.5
From these we form the dierences:-
Dierence -.4 +.1 -.3 -.3 -.2 -.2 +.1 -.4
The average of these dierences is -.2. Next,
98
s
2
d
=
(.4)
2
+ (+.1)
2
+ + (.4)
2
8(.2)
2
7
,
and this is 0.04. Thus the numerical value of t is
t =
(.2)

0.04
= 2.828.
This would be signicant if = 0.05 and would almost be signicant if = 0.01 (7 degrees
of freedom).
We now do the analysis using the ordinary (i.e. unpaired)two-sample t test and then com-
pare the results of the two procedures.
The average ( x
1
) of the before temperatures is 98.3125 and the average ( x
2
) of the after
temperatures is 98.5125. The dierence( x
1
x
2
) between these two averages is 0.2. (This is
identical to the average of the dierences calculated doing the paired t test, and this equal-
ity will always happen. Thus this part of the signal will be identical in the two procedures.)
Further calculation shows that the pooled estimate s
2
of variance is s
2
=
2.0175
14
= 0.144107.
Thus under this approach the numerical value of t is
t =
(0.2)
_
64
16
0.144107
= 1.0537.
This is nowhere near signicant (14 degrees of freedom).
What has happened to cause the dierence in the two results is that the natural individual-
to-individual variation in temperature (which is not of direct interest to us) has contributed
to the noise, decreasing the numerical value of the t statistic. This makes it more dicult to
detect a signicant signal. The paired t test overcomes this problem by taking dierences of
temperatures within individuals, and this removes any individual-to-individual variation
in temperature from the procedure.
The take-home message is: If there is a logical reason to pair, then do a paired t test. This
will give you a sharper result.
2. What happened to the lost degrees of freedom, the dierence between the 2n2 degrees
of freedom for the ordinary(i.e. unpaired) unpaired test and the n 1 degrees of freedom
for the paired test?
99
Suppose that we had also been interested in investigating if there is signicant individual-
to-individual variation in temperature. These lost n 1 degrees of freedom are used for
this test. This is the beginning of the concept of ANOVA (the Analysis of Variance). The
thinking of ANOVA is the following.
Any body of data will almost always exhibit variation. We can ascribe some of this variation
to one cause (here, before and after the operation) and some of this variation to another
cause (here, indivdual-to-individual variation). ANOVA does two tests (one for a before
to after dierence, one for an individual-to-individual dierence) for the price of one set of
data. Some of the overall degrees of freedom are used for one test and some for the other test.
The word Analysis (in the Analysis of Variance) means sub-division. We subdivide the
variation in the data into meaningful components, and tests each component for signicance.
Final note. It was of course arbitrary that we chose the before temperatures as Group
1. What would have happened if we chose the after temperatures as Group 1? First, the
test would now be one-sided up, since under the alternative hypothesis we now expect the
average of (the new) Group 1 to exceed that of (the new) Group 2. Second, with the above
data the value of t would now be +2.828.
Putting these two observations together, we would reach the same conclusion as we did be-
fore. It would be a disaster if we did not.
t tests in regression.
Preview. We rst review our previous discussion on regression, where our focus was on
estimating parameters, not on testing hypotheses about them. We will get onto tests of
hypotheses in regression soon.
Regression concerns the question: How does one thing depend on another? In the example
that we considered earlier, we asked: How does the growth height of a plant depend on the
amount of water given the plant in the growing period?
The amount of water given to any plant is NOT random: we can choose this for ourselves.
So we denote this in lower case letters. The growth height of any plant IS random: this will
depend on various factors such as soil fertility that we do not know much about. The basic
(linear model) assumption is that if Y is the (random) growth height for a plant given x
units of water, then the mean of y is of the form + x and the variance of Y is
2
. Here
, x and
2
are all (unknown) parameters which we previously learned how to estimate
from our eventual data.
100
What are the data, once the experiment is completed? These are the respective amounts
of water x
1
, x
2
, . . . , x
n
given to n plants and the corresponding growth heights y
1
, y
2
, . . . , y
n
.
The rst thing that we did was to calculate ve quantities from these data. There were
x =
x
1
+x
2
+.... +x
n
n
, y =
y
1
+y
2
+.... +y
n
n
, (121)
as well as the quantities s
xx
, s
yy
and s
xy
, dened by
s
xx
= (x
1
x)
2
+ (x
2
x)
2
+.... + (x
n
x)
2
, (122)
s
yy
= (y
1
y)
2
+ (y
2
y)
2
+... + (y
n
y)
2
, (123)
s
xy
= (x
1
x)(y
1
y) + (x
2
x)(y
2
y) +... + (x
n
x)(y
n
y). (124)
From these we estimated by b, dend as b = s
xy
/s
xx
. We also estimated
2
by s
2
r
, dened
by s
2
r
= (s
yy
b
2
s
sxx
)(n 2).
In hypothesis testing the most interesting question is: Is there any eect of the amount of
water on the growth height? This is identical to asking the question: Is = 0?. So our
estimate b of will be the key component in the numerator signal of our eventual t statistic.
Before going further we note three things about s
2
r
.
1. The sux r denotes regression: this estimate is specic to the regression context.
2. Why is there an n2 in the calculation of s
2
r
? This is because here we have n2 degrees
of freedom: more on this later.
3. Another trust me result: the estimate of the variance of the estimator corresponding
to b is not s
2
r
. It is s
2
r
/s
xx
.
We now put all of this together to formulate our regression t test.
First we have to assume that before the experiment the (random variable) growth heights
have a normal distribution. (This assumption was NOT necessary in our previous estimation
activities.)
The null hypothesis is no eect of the amount of water on the growth height, which trans-
lates into = 0. In general the alternative hypothesis could be one-sided up ( > 0),
one-sided down ( < 0), or two-sided up ( = 0). In the water example the natural
alternative hypothesis,from the context, is > 0. This concludes Step 1: we have formu-
lated our null and alternative hypotheses.
101
Step 2. Here we choose , the false positive rate (the probability of our claiming that there
is a positive eect of water on growth height if in fact there is no eect). Lets go with 5%.
Step 3. What is the test statistic? Clearly it will be a t statistic with b in the numera-
tor and our estimate of the standard deviation of the estimator corresponding to b in the
denominator. Thus from the above, this statistic is
t =
b
s
r
/

s
xx
=
b

s
x
x
s
r
. (125)
This concludes Step 3: we have formulated our test statistic.
Approach 1,Step 4. What values of t will lead us to reject the null hypothesis? This depends
on three things:-
(i) The sided-ness of the test. If the alternative hypothesisis is > 0 then suciently large
positive values of t lead us to reject the null hypothesis. If the alternative hypothesisis is
< 0 then suciently large negative values of t lead us to reject the null hypothesis. If the
alternative hypothesis is = 0 then suciently large positive or negative values of t lead us
to reject the null hypothesis.
(ii) The value chosen for . This will determine the appropriate column in the t chart.
(iii) The number of degrees of freedom that we have. As above: trust me. We have n 2
degrees of freedom.
Step 5. This is easy. get the data, calculate t, do the test.
Approach 2,Step 4. Calculate t.
Approach 2,Step 5. Find the P-value corresponding to the calculated value of t. This can
only be done with a computer package. There will be a JMP handout in class about this.
(The JMP handout is not well formatted, and it is not immediately clear where you have
to look in the handout to get the information that you want. The so-called Intercept t
statistic tests the null hypothesis = 0. This is not of interest to us. The so-called Water
t statistic tests the null hypothesis = 0. This is the test of interest to us. The value of
t given is 14.39 (with 10 degrees of freedom). Unfortunatley the print-out only gives the
P-value for a two-sided test. It should also give the P-value for a one -sided up, and also for
a one-sided down, test. The P-value for a one-sided test would be half the value given for a
two-sided test.)
102
Generalization. In the above example the null hypothesis was = 0. In some cases the
alternative hypothesis is of the form =
0
, where
0
is some specied numerical value.
The relevant t statisic is then
t =
(b
0
)

s
xx
s
r
. (126)
In general the alternative hypothesis could be one-sided up ( >
0
), one-sided down
( <
0
), or two-sided up ( =
0
). There are still n 2 degrees of freedom.
A note on degrees of freedom. As stated in class, the numerical value for the number of
degrees of freedom is sometimes mysterious. Your best approach is possibly just to trust me
in any situation as to how many degrees of freedom you have. However, if you want some
idea of how this number is calculated in the t test context, it is the number of observations
minus the number of parameters that you have estimated before you estimate the relevant
variance. In the one-sample t test we estimated one parameter (the mean, estimated by x)
before we estimated the variance. So we have n 1 degrees of freedom. The same is true in
the paired t test. In the two-sample t test we estimated two means
1
and
2
(respectively
by x
1
and x
2
) before estimating the variance, so we have m + n 2 degrees of freedom. In
the regression case we estimated two parameters ( and ) before estimating the variance
(
2
). So we have n 2 degrees of freedom.
Another way of looking at this is as follows. Suppoise that in the plant example you only
had two plants (n = 2). Then you have no degrees of freedom. Why is this?
With n = 2 you will have two points in the x y data diagram plane. You can always t
a line exactly through these two points. But there would be no scatter of points around
that line, and such a scatter is what is needed to estimate a variance. You then just cannot
estimate
2
if n = 2.
Numerical example
We consider the plant and water example discussed when estimating parameters in the
regression context. We have from that example b = 0.6501638, s
xx
= 188 and s
2
r
= 0.384979.
(The value given for s
2
r
in the slides and handouts was incorrect. It will be corrected in the
canvas version.) This gives
t =
0.6510638

188

0.384979
= 14.38744
in agreement with the value in the JMP output.
103
Non-parametric (= distribution-free) tests.
Remember: any time that you use a t test you are implicitly assuming that the data are
the observed values of random variables having a normal distribution. Although (as noted
above) t tests are quite robust, that is they work quite well even if the data are not the
observed values of random variables having a normal distribution, it is handy to have tests
available that do not make this implicit assumption. These tests are called non-parametric
tests. A better expression is perhaps distribution-free: they are free of the normal distri-
bution assumption. However non-parametric tests do make some assumptions, as indicated
below.
We start with three non-parametric tests that are alternative to the two-sample t test. Re-
member the background to these tests: we have m observations (i.e. data values) in Group 1
and n observations (i.e. data values) in Group 2. The null hypothesis in all three tests is that
these m + n observations are the observed values of random variables all having the same
probability distribution. To be concrete we will only consider the case where the alternative
hypothesis is that the probability distribution corresponding to the observations in Group
1 is the same as that corresponding to the observations in Group 1 except that it is moved
to the right. Thus if the alternative hypothesis is true we expect that the observations in
Group 1 to tend to be larger than those in Group 2.
Non-parametric alternative number 1 to two-sample t tests: the two-by-two table test.
Thus a rather crude and quick test. We start by putting all m + n observations into an
ordered sequence from lowest to highest and then nd some number A close to the half-way
point. For example if the data are:-.
Group 1: 56, 66, 65, 44, 51.
Group 2: 51, 65, 63, 48, 47 50 61,
then the ordered sequence is
44, 47, 48, 50, 51, 51, 56, 61, 63, 65, 65, 66.
We could choose A = 53. With this choice half the data values are less than A and half are
greater than A.
The next step is to form a two-by-two table. Each data value will be categorized by rows as
either Group 1 (in row 1) or Group 2 (in row 2). Also, each data value will be categorized by
columns as either less than A (in column 1) or greater that A (in column 2). For example,
104
the rst observation in Group 1 (56) would go in row 1 (since it is in Group 1) and column 2
(since it is greater than 53). We now form a two-by-two table of counts, each count indicating
how many observations are in each of the four cells. With the above data this table would
be as follows.
Less than A Greater than A Total
Group 1 2 3 5
Group 2 4 3 7
Total 6 6 12
(127)
If the null hypothesis is true there is no association between the row mode of categorization
(Group membership) and the column mode of categorization (less than A or greater than
A). So we now do a standard two-by-two table test, using z (since this is a one-sided test).
We have to be very careful about whether this is a one-sided up or a one-sided down test.
You have to think about about the logic of the situation.
The general form of the data in a two-by-two table is, for this test,
Less than A Greater than A Total
Group 1 x
1
n
1
x
1
n
1
Group 2 x
2
n
2
x
2
n
2
Total c
1
c
2
n
(128)
and the original form of the z statistic was
z =
x
1
/n
1
x
2
/n
2
_
(c
1
/n)(c
2
/n)
n
1
+
(c
1
/n)(c
2
/n)
n
2
(129)
Now ask yourself: If the alternative hypothesis is true, what sort of value should z tend to
take? Well, if the alternative hypothesis (that the Group 1 probability distribution is moved
to the right of the Group 2 probability distribution, so that the observations in Group 1
should tend to be larger than those in group 2) is true, then the fraction
x
1
n
1
should tend to
be less than the fraction
x
2
n
2
, so that z should tend to be negative. Thus this test is one-sided
down.
Non-parametric alternative number 2 to the two-sample t tests: the Mann-Whitney (aka
Wilcoxon two-sample) test.
In this procedure we put all the m + n data values into one sequence, ordered from lowest
to highest, as in the previous procedure. We then assign them ranks (1, 2, . . . , m+n), with
the smallest data value getting rank 1, the next smallest rank 2, . . . , the largest rank m+n,
105
(If there are ties you share out the ranks - see example below.) Thus with the data values
for the previous procedure we would have
Data values: 44, 47, 48, 50, 51, 51, 56, 61, 63, 65, 65, 66.
Rank: 1 2 3 4 5.5 5.5 7 8 9 10.5 10.5 12.
The test statistic is the sum u of the ranks for the Group 1 data values. In the above example
this is 1 + 5.5 + 7 + 10.5 + 12 = 36. To assess the signicance of this value we have to nd
the null hypothesis mean and variance of the corresponding random variable U the (random)
sum of the ranks for Group 1 before the experiment was performed.
To nd this mean we have to do two things:-
(i) Remember that the sum 1 + 2 + 3 + 4 + ... + a of the rst a numbers is a(a + 1)/2.
If m = 5, n = 7 as above, so that m+n = 12, the sum of all the ranks (taking both groups
together) is 1 + 2 + 3 + 4 + ... + 12 = 78. In general this sum is (m+n)(m+n + 1)/2.
Use a proportionality argument (see the Thanksgiving cake homework question) to argue
that since Group 1 comprises a proportion 5/12 of all the data values, then if the null hypothe-
sis is true, the mean of U in the above numerical example is (5/12)(78) = 32.5. This argument
is correct. In general, the mean of U is {m/(m+n)}(m+n)(m+n + 1)/2 = m(m+n+1)/2
if the null hypothesis is true.
The null hypothesis variance of U is much harder to establish, so here is a trust me
result: this null hypothesis variance is mn(m + n + 1)/12. In the above example this is
(5)(7)(13)/12 = 37.9167.
We now compute a z statistic from the observed sum u of the ranks for group 1 and this
mean and variance as
z =
u
m(m+n+1)
2
_
mn(m+n+1)
12
. (130)
This concludes Step 3: we will use this z as our (new) test statistic.
Step 4. What values of z will lead us to reject the null hypothesis? If the null hypothesis
is true z is the observed value of a random variable having very close to a Z distribution.
(The Central Limit Theorem helps with this claim, since u is a sum.) So we will use Z
charts in the testing procedure. To proceed further we have to think what values of z would
tend to arise if the alternative hypothesis is true. In the above example, where the alternative
hypothesis is that the probabioity distribution for Group 1 is shifted to the right relative to
106
that of Group 2, we would expect the Group 1 data values to tend to attract higher ranks
than the Group 2 data values, so that u would tend to exceed m(m+n +1)/2 and z would
tend to be positive. Thus signicantly large positive values of z will lead us to reject the null
hypothesis. So the test is one-sided up. How large z has to be depends on the value chosen
for : if = 0.05 we reject the null hypothesis if z 1.645 and if = 0.01 we reject the
null hypothesis if z 2.326.
Step 5. In the above example
z =
36 32.5

37.9167
= 0.57,
and this is not signicant for any reasonable value of . We do not have enough evidence to
reject the null hypothesis.
Notes on this procedure.
1. This non-parametric procedure uses the data more eciently than did non-parametric
procedure 1. In procedure 1 all we looked was whether an observation was more than A or
less than A. We disregarded how much more than A it was, or how much less than A it
was, whereas the procedure just described to some extent takes this information into account.
2. Despite this comment, we have lost some information by replacing each data value by its
ranking value. The next procedure that we consider does not do this.
3. This procedure is very popular, since it can be shown that in some sense of the word
ecient, it is about 95% as ecient as the two-sample t test even when the data do have
a normal distribution.
Non-parametric alternative number 3 to the two-sample t tests: the permutation test.
In this procedure we retain the original data values. We permute the data in all possible
ways and calculate the two-sample t statistic for each permutation. (One of these permu-
tations will correspond to the actual data.) The null hypothesis is rejected if the value of t
calculated from the actual data is a signicantly extreme one of all these permuted values.
It is best to demonstrate the idea with an example. Suppose that the null hypothesis is
that men and women have the same mean blood pressure, and that we plan to test this
null hypothesis by taking the blood pressures of m = 5 men and n = 5 women. The blood
pressures of the ve men are 12, 131, 98, 114, 132 and the blood pressures of the ve women
are 113, 110, 127, 99, 119.
There are
_
10
5
_
= 252 permutations of the data such that 5 values are men values and the
107
remaining 5 are women values. each will lead to a value of t. Here are some of the 252
permutations and the corresponding value of t:-
Permutation 1. (The real data)
. men women
122, 131, 98, 114, 132 113, 110, 127, 99, 119 t = 1.27.
Permutation 2.
. men women
127, 114, 132, 99, 113 110, 122, 131, 119, 98 t = 1.01.
Permutation 3.
. men women
98, 99, 114, 110, 119 127, 131, 132, 113, 122 t = 2.30.
...................................................................................
Permutation 252.
. men women
132, 99, 113, 127, 131 110, 119, 122, 98, 114 t = 1.58.
Suppose that the alternative hypothesis is that the mean blood pressure for men is higher
than the mean blood pressure for women. If we had chosen = 0.05 we would reject the null
hypothesis (that men and women have the same probability distribution of blood pressure)
if the observed value of t, here 1.27, is among the highest 0.05252 = 12.6, or conservatively
in practice 12, of these 252 permutation t values.
Here is the logic behind this. If the null hypothesis is true, then given the 10 data values,
but with no labeling as to gender, all of the 252 permutation values of t are equally likely.
Thus if the null hypothesis is true the probability that the actual value of t is among the
12 largest of these 252 t values is 12/252, or 4.76%. (It is not possible with 5 + 5 = 10
data values to have a test with an exact value of equal to 5%.) Our test is a bit conservative.
Notes on these three non-parametric procedures.
1. If the data are the observed values of random variables having a normal distribution, it
can be shown that in some sense the t test is optimal. If no assumption is made as to
whether the data are the observed values of random variables having a normal distribution
there is no unique optimal procedure. That is why there are several non-parametric pro-
cedures.
2. What do you potentially lose by using a non-parametric procedure? If the data had
been the observed values of random variables having a normal distribution you have used a
108
non-optimal procedure. On the other hand, as noted above, the Mann-Whitney (Wilcoxon
two-sample) test is about 95% as ecient as the two-sample t test even when the data do
have a normal distribution.
3. The third non-parametric procedure described above is clearly computer intensive. If
m = n = 10 there are 184,756 possible permutations of the data. However present-day
computers could easily handle this. However if m = n = 20 there are about 1.378 10
11
possible permutations, and this might be too many even for a powerful computer. In this
case you would randomly perform a random ample of (say) a million permutations (including
the actual one) and see if the actual t is among the most extreme of these.
109
Non-parametric alternatives to the one sample t test.
Remember that if you use a t test you are implicitly assuming that the data are the observed
values of random variables having a normal distribution. Remember also that the normal
distribution is symmetric about its mean . We do have to make an assumption in the two
non-parametric alternatives to the one sample t test discussed below, namely that the data
x
1
, x
2
, . . . , x
n
are the observed values of n iid random variables X
1
, X
2
, . . . , X
n
, each X
having the same symmetric probability distribution. We denote the point of symmetry of
this distribution, which is also the mean of the distribution, by .
Step 1. In both tests that we consider the null hypothesis claims that =
0
, where
0
is
some prescribed numerical value. The alternative hypothesis can be one-sided up ( >
0
),
one-sided down ( <
0
), or two-sided ( =
0
). Which is the appropriate alternative hy-
pothesis depends on the context.
Step 2. Here we choose the value of , typically 5% or 1%.
The remaining steps depend on the test considered. We consider two tests in turn.
Non-parametric alternative number 1 to the one sample t test: The sign, or binomial, test.
Step 3. The test statistic is y, the number of data values greater than the null hypothesis
mean
0
.
Step 4. What values of y lead us to reject the null hypothesis? If the alternative hypothesis is
one-sided up ( >
0
) then sucently large values of y lead us to reject the null hypothesis.
If the alternative hypothesis is one-sided down ( <
0
) then sucently small values of y
lead us to reject the null hypothesis. If the alternative hypothesis is two-sided ( =
0
) then
sucently large values or sucently small values of y lead us to reject the null hypothesis.
How large or small? To answer this question we consider the probability distribution of
the random variable Y corresponding to y. If the null hypothesis is true, Y has a binomial
distribution with parameter 1/2 (this is where the assumption that the distribution of the
random variables X
1
, X
2
, . . . , X
n
is symmetric comes in) and index n. So we now do a stan-
dard binomial test, as discussed previously in these notes.
This is a weak test since it takes no account of the extent to which any data value is less
than, or more than,
0
. So we will not consider a numerical example. The next test takes
this information into account, at least to some extent.
110
Non-parametric alternative number 2 to the one sample t test: The Wilcoxon one-sample test.
Steps 3, 4 and 5 rolled together.
Suppose that we have our data values x
1
, x
2
, . . . , x
n
. We rst construct the dierences
x
1

0
, x
2

0
, . . . , x
n

0
. Some of these dierences will probably be negative, some
positive. Next we ignore the signs of these dierences and construct the absolute values
|x
1

0
|, |x
2

0
|, . . . , |x
n

0
|. These absolute values are then put in order from smallest
to largest, and ranks are then assigned to these absolute dierences, the smallest getting
rank 1, ..., the largest getting rank n. The test statistic is w, the sum of the ranks of the
originally positive dierences.
Before going further we consider a numerical example. We are interested in the heights of
adult males and wish to test the null hypothesis that the mean height of adult males is 69.
(All heights are measured in inches.) Thus
0
= 69. We are not willing to believe that the
height of an adult male taken at random has a normal distribution, but we are willing to
believe that it has a symmetric distribution about its mean, so we plan to use the Wilcoxon
one-sample test. Suppose that the heights of n = 7 randomly chosen adult males are are:-
Heights : 67.2 69.3 69.7 68.5 69.8 68.9 69.4
Dierences: -1.8 0.3 0.7 -0.5 0.8 -0.1 0.4
Absolute dierences: 1.8 0.3 0.7 0.5 0.8 0.1 0.4
Rank: 7 2 5 4 6 1 3
Thus here w = 2 + 5 + 6 + 3 = 16.
What values of w lead us to reject the null hypothesis? If the alternative hypothesis is one-
sided up ( > 69), and this alternative hypothesis is true, we will expect that many of the
dierences to be positive and thus w would tend to be large. Hence sucently large values
of w lead us to reject the null hypothesis. If the alternative hypothesis is one-sided down
( < 69), and this alternative hypothesis is true, we will expect that few of the dierences
to be positive and thus w would tend to be small. Hence sucently large values of w lead
us to reject the null hypothesis. If the alternative hypothesis is two-sided ( = 69), and
this alternative hypothesis is true, we will expect that either many of the dierences to be
positive and thus w would tend to be large, or that many of the dierences to be negative
and thus w would tend to be small. Hence sucently large or small values of w lead us to
reject the null hypothesis.
How large or how small? To answer this question we consider the before the experiment
random variable W corresponding to w. With n data values the sum of all the ranks is
1 + 2 + + n = n(n + 1)/2. This is 7(8)/2 = 28 in the above example. Under the null
111
hypothesis the probability that any given one of the X values is greater than
0
is 1/2: (this
is where the assumption of symmetry of the distribution on each X comes in). This implies
that if the null hypothesis is true, the mean value of W is n(n + 1)/4.
It is much harder to calculate the variance of W when the null hypothesis is true, so here is
a trust me result: When the null hypothesis is true the variance of W is n(n+1)(2n+1)/24.
Further, when the null hypothesis is true, W has close to a normal distribution (the Central
Limit Theorem helps with this assertion). Thus instead of using w as test statistic it is
convenient to use z, dened by
z =
w
n(n+1)
4
_
n(n+1)(2n+1)
24
.
This is the end of Step 3: we take z as just dened as the test statistic.
What values of z lead us to reject the null hypothesis? When the null hypothesis is true
z is the observed value of a Z, that is a random variable having very close to a normal
distribution with mean 0 and variance 1. From the discussion above concerning the values of
w which would lead to rejection of the null hypothesis, we would reject the null hypothesis
in favor of the alternative hypothesis >
0
if z is sucently large positive, we would reject
the null hypothesis in favor of the alternative hypothesis <
0
if z is sucently large
negative, we would reject the null hypothesis in favor of the alternative hypothesis =
0
if z is sucently large positive or negative,. How large depends on the value chosen for
in Step 2, and the critical value(s) will be numbers such 1.645, 2.326, 1.96, and 2.575
found from the z chart.
Suppose that in the above example the alternative hypothesis had been > 69 and that
we chose = 0.05. Then we would reject the null hypothesis if z 1.645. With the above
values
z =
16
7(8)
4
_
7(8)(15)
24
= 0.34,
and we do not have enough evidence to reject the null hypothesis.
112
On the horizon
The topics which naturally follow from those considered in this course are:-
1. ANOVA (The Analysis of Variance) considered initally as the extension of the two-sample
t tests to more than two samples, with many generalizations.
2. Corelation. (The relation between two random variables, for example the heights and
weights of randomly chosen adult males.)
3. Non-parametric methods in regression and other areas.
4. Sampling from a nite population.
5. Descriptive statistics.
6. Multiple regression.
These and other topics will be discussed in STAT 112.
113

Vous aimerez peut-être aussi