Académique Documents
Professionnel Documents
Culture Documents
Contents
1. Probability 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Sample spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Partitioning sets and events . . . . . . . . . . . . . . . . . . 1.5 Probability: a way of measuring sets . . . . . . . . . . . . . 1.6 Probabilities of combined events . . . . . . . . . . . . . . . . 1.7 The Partition Theorem . . . . . . . . . . . . . . . . . . . . . 1.8 Examples of basic probability calculations . . . . . . . . . . 1.9 Formal probability proofs: non-examinable . . . . . . . . . . 1.10 Conditional Probability . . . . . . . . . . . . . . . . . . . . . 1.11 Examples of conditional probability and partitions . . . . . . 1.12 Bayes Theorem: inverting conditional probabilities . . . . . 1.13 Statistical Independence . . . . . . . . . . . . . . . . . . . . 1.14 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 1.15 Key Probability Results for Chapter 1 . . . . . . . . . . . . 1.16 Chains of events and probability trees: non-examinable . . . 1.17 Equally likely outcomes and combinatorics: non-examinable
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
2. Discrete Probability Distributions 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The probability function, fX (x) . . . . . . . . . . . . . . . . . . 2.3 Bernoulli trials . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Example of the probability function: the Binomial Distribution 2.5 The cumulative distribution function, FX (x) . . . . . . . . . . . 2.6 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Example: Presidents and deep-sea divers . . . . . . . . . . . . . 2.8 Example: Birthdays and sports professionals . . . . . . . . . . . 2.9 Likelihood and estimation . . . . . . . . . . . . . . . . . . . . . 2.10 Random numbers and histograms . . . . . . . . . . . . . . . . . 2.11 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.12 Variable transformations . . . . . . . . . . . . . . . . . . . . . . 2.13 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.14 Mean and variance of the Binomial(n, p) distribution . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
3. Modelling with Discrete Probability Distributions 3.1 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . 3.2 Geometric distribution . . . . . . . . . . . . . . . . . . . . . 3.3 Negative Binomial distribution . . . . . . . . . . . . . . . . . 3.4 Hypergeometric distribution: sampling without replacement 3.5 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . 3.6 Subjective modelling . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
118 . 118 . 119 . 124 . 127 . 130 . 136 138 . 138 . 140 . 151 . 156 . 158 . 160 . 162 . 166 . 168 . 173 . 175 . 178
4. Continuous Random Variables 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 The probability density function . . . . . . . . . . . . . . . . . . . . . 4.3 The Exponential distribution . . . . . . . . . . . . . . . . . . . . . . 4.4 Likelihood and estimation for continuous random variables . . . . . . 4.5 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Expectation and variance . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Exponential distribution mean and variance . . . . . . . . . . . . . . 4.8 The Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 The Change of Variable Technique: nding the distribution of g(X) . 4.10 Change of variable for non-monotone functions: non-examinable . . . 4.11 The Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . 4.12 The Beta Distribution: non-examinable . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
5. The Normal Distribution and the Central Limit Theorem 179 5.1 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 5.2 The Central Limit Theorem (CLT) . . . . . . . . . . . . . . . . . . . . . . . . . 184 6. Wrapping Up 191 6.1 Estimators the good, the bad, and the estimator PDF . . . . . . . . . . . . . 191 6.2 Hypothesis tests: in search of a distribution . . . . . . . . . . . . . . . . . . . . 196
Chapter 1: Probability
1.1 Introduction Denition: A probability is a number between 0 and 1 representing how likely it is that an event will occur. Probabilities can be: 1. Frequentist (based on frequencies),
e.g.
2. Subjective: probability represents a persons degree of belief that an event will occur,
e.g. I think there is an 80% chance it will rain today, written as P(rain) = 0.80.
Regardless of how we obtain probabilities, we always combine and manipulate them according to the same rules. 1.2 Sample spaces Denition: A random experiment is an experiment whose outcome is not known
until it is observed.
Denition: A sample space, , is a set of outcomes of a random experiment. Every possible outcome must be listed once and only once. Denition: A sample point is an element of the sample space. For example, if the sample space is = {s1, s2 , s3}, then each si is a sample point.
Examples: Experiment: Toss a coin twice and observe the result. Sample space: = {HH, HT, T H, T T } An example of a sample point is: HT Experiment: Toss a coin twice and count the number of heads. Sample space: = {0, 1, 2} Experiment: Toss a coin twice and observe whether the two tosses are the same (e.g. HH or TT). Sample space: = {same, dierent} Discrete and continuous sample spaces Denition: A sample space is nite if it has a nite number of elements. Denition: A sample space is discrete if there are gaps between the dierent elements, or if the elements can be listed, even if an innite list (eg. 1, 2, 3, . . .).
In mathematical language, a sample space is discrete if it is nite or countable. Denition: A sample space is continuous if there are no gaps between the elements, so the elements cannot be listed (eg. the interval [0, 1]). Examples: = {0, 1, 2, 3} (discrete and nite) = {4.5, 4.6, 4.7} (discrete, nite)
= {0, 1, 2, 3, . . .} (discrete, innite) = {HH, HT, T H, T T } (discrete, nite) = [0, 90), [90, 360)
(discrete, nite)
1.3 Events Suppose you are setting out to create a science of randomness. Somehow you need to harness the idea of randomness, which is all about the unknown, and express it in terms of mathematics. How would you do it? So far, we have introduced the sample space, , which lists all possible outcomes of a random experiment, and might seem unexciting.
Kolmogorov (1903-1987). One of the founders of probability theory.
However, is a set. It lays the ground for a whole mathematical formulation of randomness, in terms of set theory. The next concept that you would need to formulate is that of something that happens at random, or an event. How would you express the idea of an event in terms of set theory? Denition: An event is a subset of the sample space. That is, any collection of outcomes forms an event.
Example: Toss a coin twice. Sample space: = {HH, HT, T H, T T } Let event A be the event that there is exactly one head. We write: A =exactly one head Then A = {HT, T H}. A is a subset of , as in the denition. We write A . Denition: Event A occurs if we observe an outcome that is a member of the set
A.
Note: is a subset of itself, so is an event. The empty set, = {}, is also a subset of . This is called the null event, or the event with no outcomes.
Example: Experiment: throw 2 dice. Sample space: = {(1, 1), (1, 2), . . . , (1, 6), (2, 1), (2, 2), . . . , (2, 6), . . . , (6, 6)} Event A = sum of two faces is 5 = {(1, 4), (2, 3), (3, 2), (4, 1)}
Combining Events Formulating random events in terms of sets gives us the power of set theory to describe all possible ways of combining or manipulating events. For example, we need to describe things like coincidences (events happening together), alternatives, opposites, and so on. We do this in the language of set theory. Example: Suppose our random experiment is to pick a person in the class and see what form(s) of transport they used to get to campus today.
diagram.
1. Alternatives: the union or operator, We wish to describe an event that is composed of several dierent alternatives. For example, the event that you used a motor vehicle to get to campus is the event that your journey involved a car, or a bus, or both. To represent the set of journeys involving both alternatives, we shade all
Overall, we have shaded all outcomes in the UNION of Bus and Car.
We write the event that you used a motor vehicle as the event Bus Car, read The union operator, , denotes Bus OR Car OR both.
Note: Be careful not to confuse Or and And. To shade the union of Bus and Car, we had to shade everything in Bus AND everything in Car. To remember whether union refers to Or or And, you have to consider what
or s B or both} .
2. Concurrences and coincidences: the intersection and operator, The intersection is an event that occurs when two or more events ALL occur
together.
For example, consider the event that your journey today involved BOTH a car AND a train. To represent this event, we shade all outcomes in the OVERLAP
We write the event that you used both car and train as Car Train, read as The intersection operator, , denotes both Car AND Train together.
3. Opposites: the complement or not operator The complement of an event is the opposite of the event: everything EXCEPT
the event.
For example, consider the event that your journey today did NOT involve walking. To represent this event, we shade all outcomes in except those in the
event Walk.
We write the event not Walk as Walk. Denition: The complement of event A is written A and is given by A = {s : s A}. / Examples: Experiment: Pick a person in this class at random. Sample space: = {all people in class}. Let event A =person is male and event B =person travelled by bike today. Suppose I pick a male who did not travel by bike. Say whether the following events have occurred: 1) A 3) A
Yes. No.
2) B 4) B
No. Yes.
5) A B = {female or bike rider or both}. No. 6) A B = {male and non-biker}. Yes. 7) A B = {male and bike rider}. No. 8) A B = everything outside A B . A B did not occur, so A B did occur. Question: What is the event ? = Challenge: can you express A B using only unions and complements? Answer: A B = (A B).
Yes.
10
Limitations of Venn diagrams Venn diagrams are generally useful for up to 3 events, although they are not used to provide formal proofs. For more than 3 events, the diagram might not be able to represent all possible overlaps of events. (This was probably the case for our transport Venn diagram.) Example:
A B A B
C
(a) A B C
C
(b) A B C
Properties of union, intersection, and complement The following properties hold. (i) = and = . (ii) For any event A, and AA = AA =
A B = B A, A B = B A.
Commutative.
(iv) (a) (A B) = A B.
A B
(b) (A B) = A B.
A B
11
Distributive laws We are familiar with the fact that multiplication is distributive over addition. This means that, if a, b, and c are any numbers, then a (b + c) = a b + a c. However, addition is not distributive over multiplication: a + (b c) = (a + b) (a + c). For set union and set intersection, union is distributive over intersection, AND
and
A (B C) = (A B) (A C).
A B B
More generally, for several events A and B1 , B2, . . . , Bn, A (B1 B2 . . . Bn ) = (A B1) (A B2 ) . . . (A Bn )
n n
i.e. and
Bi
i=1
i=1
(A Bi),
A (B1 B2 . . . Bn ) = (A B1) (A B2 ) . . . (A Bn )
n n
i.e.
Bi
i=1
i=1
(A Bi).
12
1.4 Partitioning sets and events The idea of a partition is fundamental in probability manipulations. Later in this chapter we will encounter the important Partition Theorem. For now, we give some background denitions. Denition: Two events A and B are mutually exclusive, or disjoint, if A B = .
This means events A and B cannot happen together. If A happens, it excludes B from happening, and vice-versa.
A 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 B 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000
No: quite the opposite. A EXCLUDES B from happening, so B depends strongly on whether or not A happens.
Denition: Any number of events A1, A2, . . . , Ak are mutually exclusive if every pair of the events is mutually exclusive: ie. Ai Aj = for all i, j with i = j .
A1 A2 00000 A3 11111 1111 1111 0000 0000 11111 00000 1111 1111 0000 0000 11111 00000 1111 1111 0000 0000 11111 00000 1111 1111 0000 0000 11111 00000
Denition: A partition of the sample space is a collection of mutually exclusive events whose union is . That is, sets B1, B2, . . . , Bk form a partition of if and Bi Bj = for all i, j with i = j , B1 B2 . . . Bk = .
13
B1 , . . . , B5 partition :
B2
B4
B2
B4
B5
Partitioning an event A
Any set or event A can be partitioned: it doesnt have to be . If B1, . . . , Bk form a partition of , then (A B1), . . . , (A Bk ) form a partition of A.
B1
B2
B4
We will see that this is very useful for nding the probability of event A. This is because it is often easier to nd the probability of small chunks of A (the partitioned sections) than to nd the whole probability of A at once. The partition idea shows us how to add the probabilities of these chunks together: see later.
14
1.5 Probability: a way of measuring sets Remember that you are given the job of building the science of randomness. This means somehow measuring chance. If I sent you away to measure heights, the rst thing you would ask is what you are supposed to be measuring the heights of. People? Trees? Mountains? We have the same question when setting out to measure chance. Chance of what? The answer is sets. It was clever to formulate our notions of events and sample spaces in terms of sets: it gives us something to measure. Probability, the name that we give to our chance-measure, is a way of measuring sets. You probably already have a good idea for a suitable way to measure the size of a set or event. Why not just count the number of elements in it? In fact, this is often what we do to measure probability (although counting the number of elements can be far from easy!) But there are circumstances where this is not appropriate. What happens, for example, if one set is far more likely than another, but they have the same number of elements? Should they be the same probability? First set: {Springboks win}.
111111111 000000000 111111111 000000000 111111111 000000000 111 000 11 111 00 000 111 000 1111 0000
or continuous.
Should the intervals [3, 4] and [13, 14] be the same probability, just because they are the same length? Yes they should, if (say) our random experiment is to pick a random number on [0, 20] but no they shouldnt (hopefully!) if our experiment was the time in years taken by a student to nish their degree.
15
Most of this course is about probability distributions. A probability distribution is a rule according to which probability is apportioned,
At its simplest, a probability distribution just lists every element in the sample space and allots it a probability between 0 and 1, such that the total sum of
probabilities is 1.
In the rugby example, we could use the following probability distribution: P(Springboks win)= 0.3, P(All Blacks win)= 0.7.
In general, we have the following denition for discrete sample spaces. Discrete probability distributions Denition: Let = {s1 , s2, . . .} be a discrete sample space. A discrete probability distribution on is a set of real numbers {p1, p2, . . .} associated with the sample points {s1 , s2, . . .} such that:
1. 0 pi 1 for all i; 2.
i
pi = 1.
pi is called the probability of the event that the outcome is si . We write: pi = P(si ). The rule for measuring the probability of any set, or event, A , is to sum the probabilities of the elements of A: P(A) =
iA
pi .
16
Continuous probability distributions On a continuous sample space , e.g. = [0, 1], we can not list all the elements and give them an individual probability. We will need more sophisticated methods detailed later in the course. However, the same principle applies. A continuous probability distribution is a rule under which we can calculate a probability between 0 and 1 for any set, or event, A .
Probability Axioms For any sample space, discrete or continuous, all of probability theory is based on the following three denitions, or axioms.
P() = 1.
Note: The axioms can never be proved: they are denitions. If our rule for measuring sets satises the three axioms, it is a valid probability distribution. The idea of the axioms is that ALL possible properties of probability should be derivable using ONLY these three axioms. To see how this works, see Section 1.9 (non-examinable). The denition of a discrete probability distribution on page 15 clearly satises the axioms. The challenge of dening a probability distribution on a continuous sample space is left till later. Note: P() = 0. Note: Remember that an EVENT is a SET: an event is a subset of the sample space.
17
1.6 Probabilities of combined events In Section 1.3 we discussed unions, intersections, and complements of events. We now look at the probabilities of these combinations. Everything below applies to events (sets) in either a discrete or a continuous sample space. 1. Probability of a union Let A and B be events on a sample space . There are two cases for the probability of the union A B: 1. A and B are mutually exclusive (no overlap): i.e. A B = . 2. A and B are not mutually exclusive: A B = . For Case 1, we get the probability of A B straight from Axiom 3:
Note: The formula for Case 2 applies also to Case 1: just substitute P(A B) = P() = 0. For three or more events: e.g. for any A, B, and C,
18
Explanation For any events A and B, P(A B) = P(A) + P(B) P(A B).
The formal proof of this formula is in Section 1.9 (non-examinable). To understand the formula, think of the Venn diagrams:
A 11111111 00000000 11111111 00000000 111111111 000000000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 111111111 000000000 B 111111111 000000000 A 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 B \ (A n000000000 B) 111111111
Alternatively, think of A B as two disjoint sets: all of A, and the bits of B without the intersection. So P(A B) = P(A) + P(B) P(A B) .
2. Probability of an intersection There is no easy formula for P(A B). We might be able to use statistical independence (Section 1.13). If A and B are not statistically independent, we often use conditional probability (Section 1.10.) 3. Probability of a complement
A
1111 0000 1111 0000 1111 0000 1111 0000 1111 0000
B
P(A) = 1 P(A). This is obvious, but a formal proof is given in Sec. 1.9.
19
1.7 The Partition Theorem The Partition Theorem is one of the most useful tools for probability calculations. It is based on the fact that probabilities are often easier to calculate if we break down a set into smaller parts. Recall that a partition of is a collection of non-overlapping sets B1 , . . . , Bm which together cover everything in .
B1 B3 B2 B4
Also, if B1 , . . . , Bm form a partition of , then (A B1), . . . , (A Bm ) form a partition of the set or event A.
B1 B2 A B1 A B2
B3
B4 A B3 A B4
The probability of event A is therefore the sum of its parts: P(A) = P(A B1) + P(A B2 ) + P(A B3) + P(A B4 ). The Partition Theorem is a mathematical way of saying the whole is the sum
of its parts.
Theorem 1.7: The Partition Theorem. (Proof in Section 1.9.)
P(A) =
i=1
P(A Bi ).
Note: Recall the formal denition of a partition. Sets B1 , B2, . . . , Bm form a partition of if Bi Bj = for all i = j , and m Bi = . i=1
20
1.8 Examples of basic probability calculations An Australian survey asked people what sort of car they would like if they could choose any car at all. 13% of respondents had children and chose a large car. 12% of respondents did not have children and chose a large car. 33% of respondents had children. Find the probability that a respondent: (a) chose a large car; (b) either had children or chose a large car (or both).
(Partition Theorem)
(Section 1.6)
21
Example 2: Facebook statistics for New Zealand university students aged between 18 and 24 suggest that 22% are interested in music, while 34% are interested in sport. Formulate events: M = interested in music, S = interested in sport. (a) What is P(M)? (b) What is P(M S)?
P(M) = 0.22
P(S) = 0.34.
Information given:
P(M S) = 0.48.
Thus
(Section 1.6)
22
(d) Find the probability that a student is interested in music, but not sport.
(Partition Theorem)
1.9 Formal probability proofs: non-examinable If you are a mathematician, you will be interested to see how properties of probability are proved formally. Only the Axioms, together with standard settheoretic results, may be used. Theorem : The probability measure P has the following properties. (i) P() = 0. (ii) P(A) = 1 P(A) for any event A. (iii) (Partition Theorem.) If B1 , B2, . . . , Bm form a partition of , then for any event A,
m
P(A) =
i=1
P(A Bi).
(iv) P(A B) = P(A) + P(B) P(A B) for any events A, B. Proof: i) For any A, we have A = A ; and A = So P(A) = P(A ) = P(A) + P() (Axiom 3) P() = 0. (mutually exclusive).
23
iii) Suppose B1 , . . . , Bm are a partition of : then Bi Bj = if i = j, and m Bi = . i=1 Thus, (A Bi ) (A Bj ) = A (Bi Bj ) = A = , for i = j, ie. (A B1 ), . . . , (A Bm) are mutually exclusive also.
m m
So,
i=1
P(A Bi) = P
i=1
(A Bi )
m
(Axiom 3)
= P A = P(A) . iv)
Bi
i=1
(Distributive laws)
= P(A )
= (A B) (A B) (A B).
These 3 events are mutually exclusive: eg. (A B) (A B) = A (B B) = A = , etc. So, P(A B) = P(A B) + P(A B) + P(A B) (Axiom 3) = P(A) P(A B)
from (iii) using B and B
P(B) P(A B)
from (iii) using A and A
+ P(A B)
24
1.10 Conditional Probability Conditioning is another of the fundamental tools of probability: probably the most fundamental tool. It is especially helpful for calculating the probabilities of intersections, such as P(A B), which themselves are critical for the useful Partition Theorem. Additionally, the whole eld of stochastic processes (Stats 320 and 325) is based on the idea of conditional probability. What happens next in a process depends, or is conditional, on what has happened beforehand. Dependent events Suppose A and B are two events on the same sample space. There will often be dependence between A and B. This means that if we know that B has occurred, it changes our knowledge of the chance that A will occur. Example: Toss a die once. Let event A = get 6 Let event B= get 4 or better If the die is fair, then P(A) =
1 6
and P(B) = 1 . 2
However, if we know that B has occurred, then there is an increased chance that A has occurred: P(A occurs given that B has occurred) = We write
1 3.
result 6 result 4 or 5 or 6
Question: what would be P(B | A)? P(B | A) = P(B occurs, given that A has occurred) = P(get 4, given that we know we got a 6) = 1.
25
Conditioning as reducing the sample space Sally wants to use Facebook to nd a boyfriend at Uni. Her friend Kate tells her not to bother, because there are more women than men on Facebook. Here are the 2012 gures for Facebook users at the University of Auckland: Relationship status Male Female Total Single In a relationship 700 460 560 660 1220 1260 1120 2380
Total 1160
No, because out of the SINGLE people on Facebook, there are a lot more men than women!
Conditioning is all about the sample space of interest. The table above shows the following sample space: = {Facebook users at UoA}. But the sample space that should interest Sally is dierent: it is S = {members of who are SINGLE}. Suppose we pick a person from those in the table. Dene event M to be: M ={ person is male }. Kate is referring to the following probability: P(M) = 1160 # Ms = = 0.49. total # in table 2380
Kate is correct that there are more women than men on Facebook, but she is using the wrong sample space so her answer is not relevant.
26
Now suppose we reduce our sample space from = {everyone in the table}
to
S = {single people in the table}. Then P(person is male, given that the person is single) = = =
= 0.56.
We write:
P(M | S) = 0.56.
This is the probability that Sally is interested in and she can rest assured that there are more single men than single women out there. Example: Dene event R that a person is in a relationship. What is the proportion of males among people in a relationship, P(M | R)?
P(M | R) = = =
= 0.41.
27
We could follow the same working for any pair of events, A and B: P(A | B) = =
# who are A and B # who are B (# who are A and B ) / (# in ) (# who are B ) / (# in )
P(A B) . P(B)
This is our denition of conditional probability: Denition: Let A and B be two events. The conditional probability that event A occurs, given that event B has occurred, is written P(A | B), and is given by P(A | B) = P(A B) . P(B)
Read P(A | B) as probability of A, given B. Note: P(A | B) gives P(A and B , from within the set of Bs only).
Note: Follow the reasoning above carefully. It is important to understand why the conditional probability is the probability of the intersection within the new sample space.
28
Note: In the Facebook example, we found that P(M | S) = 0.56, and P(M | R) = 0.41. This means that a single person on UoA Facebook is more likely to be male than female, but a person in a relationship is more likely to be female than male! Why the dierence? Your guess is as good as mine, but I think its because men in a relationship are too busy buying owers for their girlfriends to have time to spend on Facebook. The symbol P belongs to the sample space Recall the rst of our probability axioms on page 16: P() = 1. This shows that the symbol P is dened with respect to . That is, P BELONGS to the sample space . If we change the sample space, we need to change the symbol P. This is what we do in conditional probability:
to change the sample space from to B , say, we change from the symbol P to the symbol P( | B).
The symbol P( | B) should behave exactly like the symbol P. For example: P(C D) = P(C) + P(D) P(C D),
so
P(C D | B) = P(C | B) + P(D | B) P(C D | B). Trick for checking conditional probability calculations: A useful trick for checking a conditional probability expression is to replace the conditioned set by , and see whether the expression is still true. For example, is P(A | B) + P(A | B) = 1? Answer: Replace B by : this gives P(A | ) + P(A | ) = P(A) + P(A) = 1.
29
Is P(A | B) + P(A | B) = 1?
Try to replace the conditioning set by : we cant! There are two conditioning sets: B and B . The expression is NOT true. It doesnt make sense to try to add together probabilities from two dierent sample spaces.
The Multiplication Rule For any events A and B, Proof: P(A B) = P(A | B)P(B) = P(B | A)P(A).
and
New statement of the Partition Theorem The Multiplication Rule gives us a new statement of the Partition Theorem: If B1 , . . . , Bm partition S, then for any event A,
m m
P(A) =
i=1
P(A Bi ) =
i=1
P(A | Bi)P(Bi).
Both formulations of the Partition Theorem are very widely used, but especially m the conditional formulation i=1 P(A | Bi )P(Bi ). Warning: Be sure to use this new version of the Partition Theorem correctly:
it is
30
Conditional probability and Peter Pan When Peter Pan was hungry but had nothing to eat, he would pretend to eat. (An excellent strategy, I have always found.) Conditional probability is the Peter Pan of Stats 210. When you dont know something that you need to know, pretend you know it. Conditioning on an event is like pretending that you know that the event has happened. For example, if you know the probability of getting to work on time in dierent weather conditions, but you dont know what the weather will be like today,
1.11 Examples of conditional probability and partitions Tom gets the bus to campus every day. The bus is on time with probability 0.6, and late with probability 0.4. The sample space can be written as = {bus journeys}. We can formulate events as follows: T = on time; L = late. From the information given, the events have probabilities: P(T ) = 0.6 ; P(L) = 0.4.
(a) Do the events T and L form a partition of the sample space ? Explain why or why not.
Yes: they cover all possible journeys (probabilities sum to 1), and there is no overlap in the events by denition.
31
The buses are sometimes crowded and sometimes noisy, both of which are problems for Tom as he likes to use the bus journeys to do his Stats assignments. When the bus is on time, it is crowded with probability 0.5. When it is late, it is crowded with probability 0.7. The bus is noisy with probability 0.8 when it is crowded, and with probability 0.4 when it is not crowded. (b) Formulate events C and N corresponding to the bus being crowded and noisy. Do the events C and N form a partition of the sample space? Explain why or why not.
Let C = crowded, N =noisy. C and N do NOT form a partition of . It is possible for the bus to be noisy when it is crowded, so there must be some overlap between C and N .
(c) Write down probability statements corresponding to the information given above. Your answer should involve two statements linking C with T and L, and two statements linking N with C. P(C | T ) = 0.5; P(C | L) = 0.7.
P(N | C) = 0.8;
P(N | C) = 0.4.
P(C) = P(C | T )P(T ) + P(C | L)P(L) = 0.5 0.6 + 0.7 0.4 = 0.58.
(Partition Theorem)
P(N ) = P(N | C)P(C) + P(N | C)P(C) = 0.8 0.58 + 0.4 (1 0.58) = 0.632.
(Partition Theorem)
32
Thus
P(B | A) =
()
This is the simplest form of Bayes Theorem, named after Thomas Bayes (170261), English clergyman and founder of Bayesian Statistics. Bayes Theorem allows us to invert the conditioning, i.e. to express P(B | A) in terms of P(A | B). This is very useful. For example, it might be easy to calculate, but we might only observe the later event and wish to deduce the probability that the earlier event occurred, P(earlier event | later event). Full statement of Bayes Theorem: Theorem 1.12: Let B1, B2, . . . , Bm form a partition of . Then for any event A, and for any j = 1, . . . , m, P(A | Bj )P(Bj ) m i=1 P(A | Bi )P(Bi ) P(later event|earlier event),
P(Bj | A) =
(Bayes Theorem)
Proof:
Immediate from () (put B = Bj ), and the Partition Rule, which gives P(A) = m P(A | Bi)P(Bi). i=1
33
then
P(B | A) =
Example: The case of the Perdious Gardener. Mr Smith owns a hysterical rosebush. It will die with probability 1/2 if watered, and with probability 3/4 if not watered. Worse still, Smith employs a perdious gardener who will fail to water the rosebush with probability 2/3. Smith returns from holiday to nd the rosebush . . . DEAD!!! What is the probability that the gardener did not water it?
P(D | W ) =
3 4
P(W ) =
2 3
1 (so P(W ) = 3 )
Fourth step: compare this to what we know Need to invert the conditioning, so use Bayes Theorem:
P(W | D) = 3 P(D | W )P(W ) 3/4 2/3 = = P(D | W )P(W ) + P(D | W )P(W ) 3/4 2/3 + 1/2 1/3 4
Example: The case of the Defective Ketchup Bottle. Ketchup bottles are produced in 3 dierent factories, accounting for 50%, 30%, and 20% of the total output respectively. The percentage of bottles from the 3 factories that are defective is respectively 0.4%, 0.6%, and 1.2%. A statistics lecturer who eats only ketchup nds a defective bottle in her lunchbox. What is the probability that it came from Factory 1?
Solution: 1. Events: let Fi = bottle comes from Factory i (i=1,2,3) let D = bottle is defective 2. Information given:
P(F1) = 0.5 P(F2) = 0.3 P(F3) = 0.2 P(D | F1) = 0.004 P(D | F2 ) = 0.006 P(D | F3 ) = 0.012
3. Looking for:
P(F1 | D)
4. Bayes Theorem:
P(F1 | D) = = = P(D | F1)P(F1) P(D | F1)P(F1) + P(D | F2)P(F2) + P(D | F3 )P(F3) 0.004 0.5 0.004 0.5 + 0.006 0.3 + 0.012 0.2 0.002 0.0062
= 0.322.
35
1.13 Statistical Independence Two events A and B are statistically independent if the occurrence of one does
Now P(A | B) =
P(A B) , P(B)
36
Statistical independence for calculating the probability of an intersection In section 1.6 we said that it is often hard to calculate P(A B). We usually have two choices. 1. IF A and B are statistically independent, then P(A B) = P(A) P(B). 2. If A and B are not known to be statistically independent, we usually have to
Example: Toss a fair coin and a fair die together. The coin and die are physically independent. Sample space: = {H1, H2, H3, H4, H5, H6, T 1, T 2, T 3, T 4, T 5, T 6}
1 6
1 12
also,
37
Pairwise independence does not imply mutual independence Example: A jar contains 4 balls: one red, one white, one blue, and one red, white & blue. Draw one ball at random. Let A =ball has red on it, B =ball has white on it, C =ball has blue on it. Two balls satisfy A, so P(A) = Pairwise independent: Consider P(A B) = But, P(A) P(B) =
1 4 1 2 2 4 1 = 2.
Likewise, P(A C) = P(A)P(C), and P(B C) = P(B)P(C). So A, B and C are pairwise independent. Mutually independent? Consider P(A B C) = while
1 4
(one of 4 balls)
1 2
1 P(A)P(B)P(C) = 2 1 2
1 8
= P(A B C).
dent.
1.14 Random Variables We have one more job to do in laying the foundations of our science of randomness. So far we have come up with the following ideas: 1. Things that happen are sets, also called events. 2. We measure chance by measuring sets, using a measure called probability. Finally, what are the sets that we are measuring? It is a nuisance to have lots of dierent sample spaces: = {head, tail}; = {same, dierent}; = {Springboks, All Blacks}.
38
All of these sample spaces could be represented more concisely in terms of numbers: = {0, 1}.
On the other hand, there are many random experiments that genuinely produce
A random experiment whose possible outcomes are real numbers is called a random variable.
In fact, any random experiment can be made to have outcomes that are real numbers, simply by mapping the sample space to a set of real numbers using
a function.
For example: function X : R X(Springboks win) = 0; X(All Blacks win) = 1.
This gives us our formal denition of a random variable: Denition: A random variable (r.v.) is a function from a sample space to the real numbers R. We write X : R.
39
Although this is the formal denition, the intuitive denition of a random variable is probably more useful. Intuitively, remember that a random variable
2. Giving a name to a large class of random experiments that genuinely produce random numbers, and for which we want to develop general rules for nding averages, variances, relationships, and so on. Example: Toss a coin 3 times. The sample space is = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} One example of a random variable is X : R such that, for sample point si , we have X(si ) = # heads in outcome si . So X(HHH) = 3, X(T HT ) = 1, etc. Another example is Y : R such that Y (si) = 1 if 2nd toss is a head, 0 otherwise.
40
Example: toss a fair coin 3 times. All outcomes are equally likely: P(HHH) = P(HHT) = . . . = P(TTT) = 1/8. Let X : R, such that X(s) = # heads in s. Then P(X = 0) = P({T T T }) = 1/8. P(X = 1) = P({HT T, T HT, T T H}) = 3/8. P(X = 2) = P({HHT, HT H, T HH}) = 3/8. P(X = 3) = P({HHH}) = 1/8.
41
2. Conditional probability:
P(A | B) =
P(A B) P(B)
for any A, B.
Or:
This is a simplied version of Bayes Theorem. It shows how to invert the conditioning, i.e. how to nd P(A | B) when you know P(B | A).
4. Bayes Theorem slightly more generalized: for any A, B, P(A | B) = P(B | A)P(A) . P(B | A)P(A) + P(B | A)P(A)
5. Complete version of Bayes Theorem: If sets A1 , . . . , Am form a partition of the sample space, i.e. they do not overlap (mutually exclusive) and collectively cover all possible outcomes (their union is the sample space), then P(B | Aj )P(Aj ) P(B | A1 )P(A1 ) + . . . + P(B | Am )P(Am )
m i=1
P(Aj | B) =
42 6. Partition Theorem: if A1 , . . . , Am form a partition of the sample space, then P(B) = P(B A1 ) + P(B A2 ) + . . . + P(B Am ) . This can also be written as: P(B) = P(B | A1 )P(A1 ) + P(B | A2 )P(A2 ) + . . . + P(B | Am )P(Am ) . These are both very useful formulations.
8. Statistical independence: if A and B are independent, then P(A B) = P(A) P(B) and and P(A | B) = P(A) P(B | A) = P(B) . 9. Conditional probability: If P(B) > 0, then we can treat P( | B) just like P: e.g. if A1 and A2 are mutually exclusive, then P(A1 A2 | B) = P(A1 | B) + P(A2 | B) (compare with P(A1 A2 ) = P(A1 ) + P(A2 )); and P(A | B) = 1 P(A | B) for any A.
The fact that P( | B) is a valid probability measure is easily veried by checking that it satises Axioms 1, 2, and 3.
10. Unions: For any A, B, C, P(A B) = P(A) + P(B) P(A B) ; P(A B C) = P(A) + P(B) + P(C) P(A B) P(A C) P(B C) + P(A B C) . The second expression is obtained by writing P(AB C) = P A(B C) and applying the rst expression to A and (B C), then applying it again to expand P(B C).
43
1.16 Chains of events and probability trees: non-examinable The multiplication rule is very helpful for calculating probabilities when events
happen in sequence.
Example: Two balls are drawn at random without replacement from a box containing 4 white and 2 red balls. Find the probability that: (a) they are both white, (b) the second ball is red.
Solution
Let event Wi = ith ball is white and Ri = ith ball is red. a) P(W1 W2 ) = P(W2 W1) = P(W2 | W1)P(W1) Now P(W1) =
4 3 and P(W2 | W1) = . 6 5
W1
3 4 2 = . 5 6 5
b) Looking for P(2nd ball is red). To nd this, we have to condition on what happened in the rst draw. Event 2nd ball is red is actually event {W1R2 , R1R2 } = (W1 R2 ) (R1 R2 ). So P(2nd ball is red) = P(W1 R2 ) + P(R1 R2 ) (mutually exclusive)
= P(R2 | W1)P(W1) + P(R2 | R1)P(R1) = 2 4 5 6 + 1 2 5 6 = 1 . 3
W1
R1
44
Probability trees Probability trees are a graphical way of representing the multiplication rule.
W2 P(W2 | W1 ) = P(R2 | W1 ) = P(W1 ) =
4 6 2 5 3 5
W1
R2
P(R1 ) =
2 6
W2 P(W2 | R1 ) = P(R2 | R1 ) = R2
1 5 4 5
R1
First Draw
Second Draw
Write conditional probabilities on the branches, and multiply to get probability of an intersection: eg. P(W1 W2) =
4 3 2 4 , or P(R1 W2) = . 6 5 6 5
More than two events To nd P(A1 A2 A3) we can apply the multiplication rule successively: P(A1 A2 A3) = P(A3 (A1 A2 )) = P(A3 | A1 A2 )P(A1 A2)
Remember as:
45
P(A1 )
In general, for n events A1, A2, . . . , An, we have P(A1 A2 . . .An ) = P(A1)P(A2 | A1)P(A3 | A2 A1) . . . P(An | An1 . . .A1).
Example: A box contains w white balls and r red balls. Draw 3 balls without replacement. What is the probability of getting the sequence white, red, white?
Answer:
P(W1 R2 W3) = P(W1)P(R2 | W1)P(W3 | R2 W1) = w w+r r w+r1 w1 . w+r2
46
1.17 Equally likely outcomes and combinatorics: non-examinable Sometimes, all the outcomes in a discrete nite sample space are equally likely. This makes it easy to calculate probabilities. If: i) = {s1 , . . . , sk };
1 ii) each outcome si is equally likely, so p1 = p2 = . . . = pk = k ;
# outcomes in A r = . k # outcomes in
Example: For a 3-child family, possible outcomes from oldest to youngest are: = {GGG, GGB, GBG, GBB, BGG, BGB, BBG, BBB} = {s1, s2, s3 , s4, s5, s6, s7 , s8} Let {p1, p2, . . . , p8} be a probability distribution on . If every baby is equally likely to be a boy or a girl, then all of the 8 outcomes in are equally likely, so 1 p1 = p2 = . . . = p8 = 8 . Let event A be A = oldest child is a girl. Then A ={GGG, GGB, GBG, GBB}. Event A contains 4 of the 8 equally likely outcomes, so event A occurs with probability P(A) = 4 = 1 . 8 2 Counting equally likely outcomes To count the number of equally likely outcomes in an event, we often need to use permutations or combinations. These give the number of ways of choosing r objects from n distinct objects. For example, if we wish to select 3 objects from n = 5 objects (a, b, c, d, e), we have choices abc, abd, abe, acd, ace, . . . .
47
1. Number of Permutations, Pr The number of permutations, n Pr , is the number of ways of selecting r objects from n distinct objects when dierent orderings constitute dierent choices. That is, choice (a, b, c) counts separately from choice (b, a, c). Then
n! . (n r)!
2. Number of Combinations, n Cr =
n r
The number of combinations, n Cr , is the number of ways of selecting r objects from n distinct objects when dierent orderings constitute the same choice. That is, choice (a, b, c) and choice (b, a, c) are the same. Then
#combinations = Cr =
n r
n! Pr = . r! (n r)!r!
(because n Pr counts each permutation r! times, and we only want to count it once: so divide n Pr by r!)
Use the same rule on the numerator and the denominator # outcomes in A When P(A) = , we can often think about the problem # outcomes in either with dierent orderings constituting dierent choices, or with dierent orderings constituting the same choice. The critical thing is to use the same
48
Example: (a) Tom has ve elderly great-aunts who live together in a tiny bungalow. They insist on each receiving separate Christmas cards, and threaten to disinherit Tom if he sends two of them the same picture. Tom has Christmas cards with 12 dierent designs. In how many dierent ways can he select 5 dierent designs from the 12 designs available?
Order of cards is not important, so use combinations. Number of ways of selecting 5 distinct designs from 12 is
12
C5 =
12 5
b) The next year, Tom buys a pack of 40 Christmas cards, featuring 10 dierent pictures with 4 cards of each picture. He selects 5 cards at random to send to his great-aunts. What is the probability that at least two of the great-aunts receive the same picture?
Looking for P(at least 2 cards the same) = P(A) (say). Easiest to nd P(all 5 cards are dierent) = P(A). Number of outcomes in A is (# ways of selecting 5 dierent designs) = 40 36 32 28 24 . (40 choices for rst card; 36 for second, because the 4 cards with the rst design are excluded; etc. Note that order matters: e.g. we are counting choice 12345 separately from 23154.) Total number of outcomes is (total # ways of selecting 5 cards from 40) = 40 39 38 37 36 . (Note: order mattered above, so we need order to matter here too.) So
P(A) = 40 36 32 28 24 = 0.392. 40 39 38 37 36
Thus
P(A) = P(at least 2 cards are the same design) = 1 P(A) = 1 0.392 = 0.608.
49
Alternative solution if order does not matter on numerator and denominator: (much harder method) 10 5 4 5 P(A) = . 40 5 This works because there are 10 ways of choosing 5 dierent designs from 10, 5 and there are 4 choices of card within each of the 5 chosen groups. So the total number of ways of choosing 5 cards of dierent designs is 10 45. The total 5 40 number of ways of choosing 5 cards from 40 is 5 . Exercise: Check that this gives the same answer for P(A) as before. Note: Problems like these belong to the branch of mathematics called Combinatorics: the science of counting.
50
the probability function of a random variable lists the values the random variable can take, and their probabilities. 2. Hypothesis testing: I toss a coin ten times and get nine heads. How unlikely is that? Can we continue to believe that the coin is fair when it produces nine heads out of ten tosses? 3. Likelihood and estimation: what if we know that our random variable is (say) Binomial(5, p), for some p, but we dont know the value of p? We will see how to estimate the value of p using maximum likelihood estimation. 4. Expectation and variance of a random variable: the expectation of a random variable is the value it takes on average. the variance of a random variable measures how much the random variable
5. Change of variable procedures: calculating probabilities and expectations of g(X), where X is a random variable and g(X) is a function, e.g. g(X) = X or g(X) = X 2 . 6. Modelling: we have a situation in real life that we know is random. But what does the randomness look like? Is it highly variable, or little variability? Does it sometimes give results much higher than average, but never give results much lower (long-tailed distribution)? We will see how dierent probability distributions are suitable for dierent circumstances. Choosing a probability distribution to t a situation is called modelling.
51
The probability function fX (x) lists all possible values of X, and gives a probability to each value.
Recall that a random variable, X, assigns a real number to every possible outcome of a random experiment. The random variable is discrete if the set of real values it can take is nite or countable, eg. {0,1,2,. . . }.
Porsche Ferrari
Random variable: X . X gives numbers to the possible outcomes. X=1 Ferrari Porsche X=2 If he chooses. . . MG X=3 Denition: The probability function, fX (x), for a discrete random variable X , is
given by,
fX (x) = P(X = x), Example: Which car? Outcome: Ferrari Porsche x 1 2 1 1 Probability function, fX (x) = P(X = x) 6 6 MG 3
4 6
is 1 . 6
Example: Toss a fair coin once, and let X=number of heads. Then X= 0 with probability 0.5, 1 with probability 0.5.
52 1/6 if x = 1, 1/6 if x = 2, We can also write the probability function as: fX (x) = 4/6 if x = 3, 0 otherwise.
The probability function of X is given by: x fX (x) = P(X = x) 0 1 or 0.5 if x=0 fX (x) = 0.5 if x=1 0 otherwise
0.5
0.5
fX (x) = 1;
iii) P (X A) =
e.g. in the car example, P(X {1, 2}) = P(X = 1 or 2) = P(X = 1) + P(X = 2) =
1 6
1 6
2 = 6.
53
2.3 Bernoulli trials Many of the discrete random variables that we meet are based on counting the outcomes of a series of trials called Bernoulli trials. Jacques Bernoulli was a Swiss mathematician in the late 1600s. He and his brother Jean, who were bitter rivals, both studied mathematics secretly against their fathers will. Their father wanted Jacques to be a theologist and Jean to be a merchant. Denition: A random experiment is called a set of Bernoulli trials if it consists
ure;
ii) The probability of success, p, remains constant for all trials; iii) The trials are independent, ie. the event success in trial i does not depend
That is,
P(Y = 1) = P(success) = p, P(Y = 0) = P(failure) = 1 p.
54
The Binomial distribution counts the number of successes in a xed number of Bernoulli trials. Denition: Let X be the number of successes in n independent Bernoulli trials each with probability of success = p. Then X has the Binomial distribution with parameters n and p. We write X Bin(n, p), or X Binomial(n, p).
Thus X Bin(n, p) if X is the number of successes out of n independent trials, each of which has probability p of success.
Probability function If X Binomial(n, p), then the probability function for X is n x p (1 p)nx x
fX (x) = P(X = x) =
for x = 0, 1, . . . , n
Explanation:
For X = x, we need an outcome with x successes and (n x) failures. A single outcome with x successes and n x failures has probability:
px (1 p)nx
(2)
(1)
where: (1) succeeds x times, each with probability p (2) fails (n x) times, each with probability (1 p).
55
There are
n x
Notes: 1. fX (x) = 0 if
n
2. Check that
x=0 n
fX (x) =
x=0 x=0
n x p (1 p)nx x
= 1n
= 1.
It is this connection with the Binomial Theorem that gives the Binomial Distribution its name.
56
Example 1: Let X Binomial(n = 4, p = 0.2). Write down the probability function of X. x 0 1 2 3 4 fX (x) = P(X = x) 0.4096 0.4096 0.1536 0.0256 0.0016
Example 2: Let X be the number of times I get a 6 out of 10 rolls of a fair die. 1. What is the distribution of X? 2. What is the probability that X 2? 1. X Binomial(n = 10, p = 1/6). 2. P(X 2) = 1 P(X < 2) = 1 P(X = 0) P(X = 1) 0 100 10 10 1 1 = 1 1 6 6 0 1 = 0.515.
1 6
1 1 6
101
Example 3: Let X be the number of girls in a three-child family. What is the distribution of X?
Assume: (i) each child is equally likely to be a boy or a girl; (ii) all children are independent of each other. Then X Binomial(n = 3, p = 0.5).
57
Shape of the Binomial distribution The shape of the Binomial distribution depends upon the values of n and p. For small n, the distribution is almost symmetrical for values of p close to 0.5, but highly skewed for values of p close to 0 or 1. As n increases, the distribution becomes more and more symmetrical, and there is noticeable skew only if p is very close to 0 or 1. The probability functions for various values of n and p are shown below.
n = 10, p = 0.5
0.25 0.4
n = 10, p = 0.9
0.12
n = 100, p = 0.9
0.20
0.3
0.15
0.2
0.10
0.05
0.1
0.0
0.0
9 10
9 10
0.0 80
0.02
0.04
0.06
0.08
0.10
90
100
Sum of independent Binomial random variables: If X and Y are independent, and X Binomial(n, p), Y Binomial(m, p), then X + Y Bin(n + m, p). This is because X counts the number of successes out of n trials, and Y counts the number of successes out of m trials: so overall, X + Y counts the total number of successes out of n + m trials. Note: X and Y must both share the same value of p.
58
2.5 The cumulative distribution function, FX (x) We have dened the probability function, fX (x), as fX (x) = P(X = x). The probability function tells us everything there is to know about X. The cumulative distribution function, or just distribution function, written as FX (x), is an alternative function that also tells us everything there is to know about X. Denition: The (cumulative) distribution function (c.d.f.) is
FX (x) = P(X x) for < x < If you are asked to give the distribution of X, you could answer by giving either the distribution function, FX (x), or the probability function, fX (x). Each of these functions encapsulate all possible information about X. The distribution function FX (x) as a probability sweeper The cumulative distribution function, FX (x),
0.20
0.25
0.15
0.10
0.00
0.05
10
0.0
0.1
0.2
0.3
0.4
10
X ~ Bin(10, 0.5)
X ~ Bin(10, 0.9)
59
x 0 1 2 fX (x) = P(X = x) 1 1 1 4 2 4
0 0.25 Then FX (x) = P(X x) = 0.25 + 0.5 = 0.75 0.25 + 0.5 + 0.25 = 1
f (x)
1 2
if if if if
1 4
x 0 F (x) 1
3 4 1 2 1 4
x 0 1 2
fX (y)
Note that FX (x) is a step function: it jumps by amount fX (y) at every point y with positive probability.
60
Reading o probabilities from the distribution function As well as using the probability function to nd the distribution function, we can also use the distribution function to nd probabilities. fX (x) = P(X = x) = P(X x) P(X x 1) = FX (x) FX (x 1).
This is why the distribution function FX (x) contains as much information as the probability function, fX (x), because we can use either one to nd the other. In general:
if b > a.
So
FX (b) = FX (a) + P(a < X b) FX (b) FX (a) = P(a < X b).
Warning: endpoints Be careful of endpoints and the dierence between and <. For example, P(X < 10) = P(X 9) = FX (9). Examples: Let X Binomial(100, 0.4). In terms of FX (x), what is: 1. P(X 30)? 2. P(X < 30)? 3. P(X 56)? 1 P(X < 56) = 1 P(X 55) = 1 FX (55). 4. P(X > 42)? 1 P(X 42) = 1 FX (42). 5. P(50 X 60)? P(X 60) P(X 49) = FX (60) FX (49). FX (30). FX (29).
Properties of the distribution function 1) F () =P(X ) = 0. F (+) =P(X +) = 1. (These are true because values are strictly between and ). 2) FX (x) is a non-decreasing function of x: that is, if x1 < x2, then FX (x1) FX (x2). 3) P(a < X b) = FX (b) FX (a) if b > a. 4) F is right-continuous: that is, limh0 F (x + h) = F (x).
62
2.6 Hypothesis testing You have probably come across the idea of hypothesis tests, p-values, and signicance in other courses. Common hypothesis tests include t-tests and chisquared tests. However, hypothesis tests can be conducted in much simpler circumstances than these. The concept of the hypothesis test is at its easiest to understand with the Binomial distribution in the following example. All other hypothesis tests throughout statistics are based on the same idea. Example: Weird Coin? I toss a coin 10 times and get 9 heads. How weird is that? What is weird ? Getting 9 heads out of 10 tosses: well call this weird. Getting 10 heads out of 10 tosses: even more weird! Getting 8 heads out of 10 tosses: less weird. Getting 1 head out of 10 tosses: same as getting 9 tails out of 10 tosses:
H H
just as weird as 9 heads if the coin is fair. 9 heads if the coin is fair.
Getting 0 heads out of 10 tosses: same as getting 10 tails: more weird than Set of weird outcomes If our coin is fair, the outcomes that are as weird or weirder than 9 heads are:
63
Probability of observing something at least as weird as 9 heads, if the coin is fair: We can add the probabilities of all the outcomes that are at least as weird as 9 heads out of 10 tosses, assuming that the coin is fair. P(X = 9)+P(X = 10)+P(X = 1)+P(X = 0)
5 x
10
For X Binomial(10, 0.5), we have: P(X = 9) + P(X = 10) + P(X = 1) + P(X = 0) = 10 10 (0.5)9(0.5)1 + (0.5)10(0.5)0 + 9 10 10 10 (0.5)1(0.5)9 + (0.5)0(0.5)10 1 0 = 0.00977 + 0.00098 + 0.00977 + 0.00098 = 0.021.
Is this weird?
Yes, it is quite weird. If we had a fair coin and tossed it 10 times, we would only expect to see something as extreme as 9 heads on about 2.1% of occasions.
64
Is the coin fair? Obviously, we cant say. It might be: after all, on 2.1% of occasions that you toss a fair coin 10 times, you do get something as weird as 9 heads or more. However, 2.1% is a small probability, so it is still very unusual for a fair coin to produce something as weird as what weve seen. If the coin really was fair, it would be very unusual to get 9 heads or more. We can deduce that, EITHER we have observed a very unusual event with a fair
Our observed information is X, the number of heads out of 10 tosses. We write down the distribution of X if the coin is fair: X Binomial(10, 0.5). We calculate the probability of observing something AT LEAST AS EXTREME as our observation, X = 9, if the coin is fair: prob=0.021.
The probability is small (2.1%). We conclude that this is unlikely with a fair coin, so we have observed some evidence that the coin is NOT fair.
65
Null hypothesis and alternative hypothesis We express the steps above as two competing hypotheses. Null hypothesis: the rst alternative, that the coin IS fair.
We expect to believe the null hypothesis unless we see convincing evidence that it is wrong.
Alternative hypothesis: the second alternative, that the coin is NOT fair.
In hypothesis testing, we often use this same formulation. The null hypothesis is specic.
It species an exact distribution for our observation: X Binomial(10, 0.5). It simply states that the null hypothesis is wrong. It does not say what the right answer is.
We use H0 and H1 to denote the null and alternative hypotheses respectively. The null hypothesis is H0 : the coin is fair. The alternative hypothesis is H1 : the coin is NOT fair. More precisely, we write:
Think of null hypothesis as meaning the default: the hypothesis we will accept unless we have a good reason not to.
66
p-values mean
Little
evidence.
Note: Be careful not to confuse the term p-value, which is 0.021 in our example, with the Binomial probability p. Our hypothesis test is designed to test whether the Binomial probability is p = 0.5. To test this, we calculate the p-value of 0.021 as a measure of the strength of evidence against the hypothesis that p = 0.5.
67
Interpreting the hypothesis test There are dierent schools of thought about how a p-value should be interpreted. Most people agree that the p-value is a useful measure of the strength of evidence against the null hypothesis. The smaller the p-value, the stronger the evidence against H0. Some people go further and use an accept/reject framework. Under this framework, the null hypothesis H0 should be rejected if the p-value is less than 0.05 (say), and accepted if the p-value is greater than 0.05. In this course we use the strength of evidence interpretation. The p-value measures how far out our observation lies in the tails of the distribution specied by H0. We do not talk about accepting or rejecting H0 . This decision should usually be taken in the context of other scientic information. However, as a rule of thumb, consider that p-values of 0.05 and less start to suggest that the null hypothesis is doubtful. Statistical signicance You have probably encountered the idea of statistical signicance in other courses.
68
In the coin example, we can say that our test of H0 : p = 0.5 against H1 : p = 0.5 is signicant at the 5% level, because the p-value is 0.021 which is < 0.05. This means: we have some evidence that p = 0.5. It does not mean: the dierence between p and 0.5 is large, or the dierence between p and 0.5 is important in practical terms. Statistically signicant means that we have evidence that
there IS a dierence. It says NOTHING about the SIZE, or the IMPORTANCE, of the dierence. Substantial evidence of a dierence, not Evidence of a substantial dierence.
Beware! The p-value gives the probability of seeing something as weird as what we did see, if H0 is true. This means that 5% of the time, we will get a p-value < 0.05 WHEN H0 IS
TRUE!!
Similarly, about once in every thousand tests, we will get a p-value < 0.001, when H0 is true!
69
2.7 Example: Presidents and deep-sea divers Men in the class: would you like to have daughters? Then become a deep-sea diver, a ghter pilot, or a heavy smoker. Would you prefer sons? Easy! Just become a US president. Numbers suggest that men in dierent professions tend to have more sons than daughters, or the reverse. Presidents have sons, ghter pilots have daughters. But is it real, or just chance? We can use hypothesis tests to decide. The facts The 44 US presidents from George Washington to Barack Obama have had a total of 153 children, comprising 88 sons and only 65 daughters: a sex ratio of 1.4 sons for every daughter. Two studies of deep-sea divers revealed that the men had a total of 190 children, comprising 65 sons and 125 daughters: a sex ratio of 1.9 daughters for every son. Could this happen by chance? Is it possible that the men in each group really had a 50-50 chance of producing
For the divers: If I tossed a coin 190 times and got only 65 heads, could I continue to believe that the coin was fair?
70
Hypothesis test for the presidents We set up the competing hypotheses as follows.
Let X be the number of daughters out of 153 presidential children. Then X Binomial(153, p), where p is the probability that each child is a daughter.
H0 : p = 0.5. H1 : p = 0.5.
We need the probability of getting a result AT LEAST AS EXTREME as X = 65 daughters, if H0 is true and p really is 0.5.
Which results are at least as extreme as X = 65? X = 0, 1, 2, . . . , 65, for even fewer daughters. X = (153 65), . . . , 153, for too many daughters, because we would be just as surprised if we saw 65 sons, i.e. (153 65) = 88 daughters. Probabilities for X Binomial(n = 153, p = 0.5)
0.06 0.00 0.02 0.04
20
40
60
80
100
120
140
71
Calculating the p-value The p-value for the president problem is given by P(X 65) + P(X 88) where X Binomial(153, 0.5). In principle, we could calculate this as P(X = 0) + P(X = 1) + . . . + P(X = 65) + P(X = 88) + . . . + P(X = 153) = 153 153 (0.5)0(0.5)153 + (0.5)1(0.5)152 + . . . 0 1
This would take a lot of calculator time! Instead, we use a computer with a package such as R. R command for the p-value The R command for calculating the lower-tail p-value for the Binomial(n = 153, p = 0.5) distribution is pbinom(65, 153, 0.5). Typing this in R gives: > pbinom(65, 153, 0.5) [1] 0.03748079
0.02 0.00 0.04 0.06
20
40
60
80
100
120
140
This gives us the lower-tail p-value only: P(X 65) = 0.0375. To get the overall p-value, we have two choices: 1. Multiply the lower-tail p-value by 2: 2 0.0375 = 0.0750. In R: > 2 * pbinom(65, 153, 0.5) [1] 0.07496158
72
This works because the upper-tail p-value, by denition, is always going to be the same as the lower-tail p-value. The upper tail gives us the probability of nding something equally surprising at the opposite end of the distribution. 2. Calculate the upper-tail p-value explicitly (only works for H0 : p = 0.5):
(Same as before.)
Note: The R command pbinom is equivalent to the cumulative distribution function for the Binomial distribution:
The overall p-value in this example is 2 FX (65). Note: In the R command pbinom(65, 153, 0.5), the order that you enter the numbers 65, 153, and 0.5 is important. If you enter them in a dierent order, you will get an error. An alternative is to use the longhand command pbinom(q=65, size=153, prob=0.5), in which case you can enter the terms in any order.
73
Summary: are presidents more likely to have sons? Back to our hypothesis test. Recall that X was the number of daughters out of 153 presidential children, and X Binomial(153, p), where p is the probability that each child is a daughter. Null hypothesis: Alternative hypothesis: p-value: H0 : p = 0.5. H1 : p = 0.5.
2 FX (65) = 0.075.
What does this mean? The p-value of 0.075 means that, if the presidents really were as likely to have
daughters as sons, there would only be 7.5% chance of observing something as unusual as only 65 daughters out of the total 153 children.
This is slightly unusual, but not very unusual. We conclude that there is no real evidence that presidents are more likely to have
sons than daughters. The observations are compatible with the possibility that there is no dierence.
Does this mean presidents are equally likely to have sons and daughters? No:
the observations are also compatible with the possibility that there is a dierence. We just dont have enough evidence either way.
Hypothesis test for the deep-sea divers For the deep-sea divers, there were 190 children: 65 sons, and 125 daughters. Let X be the number of sons out of 190 diver children. Then X Binomial(190, p), where p is the probability that each child is a son. Note: We could just as easily formulate our hypotheses in terms of daughters instead of sons. Because pbinom is dened as a lower-tail probability, however, it is usually easiest to formulate them in terms of the low result (sons).
74
H0 : p = 0.5. H1 : p = 0.5.
Probability of getting a result AT LEAST AS EXTREME as X = 65 sons, if H0 is true and p really is 0.5.
Results at least as extreme as X = 65 are: X = 0, 1, 2, . . . , 65, for even fewer sons. X = (19065), . . . , 190, for the equally surprising result in the opposite direction
20
40
60
80
100
120
140
160
180
R command for the p-value p-value = 2pbinom(65, 190, 0.5). Typing this in R gives: > 2*pbinom(65, 190, 0.5) [1] 1.603136e-05 This is 0.000016, or a little more than one chance in 100 thousand.
75
We conclude that it is extremely unlikely that this observation could have oc-
curred by chance, if the deep-sea divers had equal probabilities of having sons and daughters.
We have very strong evidence that deep-sea divers are more likely to have daughters than sons. The data are not really compatible with H0 . What next? p-values are often badly used in science and business. They are regularly treated as the end point of an analysis, after which no more work is needed. Many scientic journals insist that scientists quote a p-value with every set of results, and often only p-values less than 0.05 are regarded as interesting. The outcome is that some scientists do every analysis they can think of until they nally come up with a p-value of 0.05 or less. A good statistician will recommend a dierent attitude. It is very rare in science
2.8 Example: Birthdays and sports professionals Have you ever wondered what makes a professional sports player? Talent? Dedication? Good coaching? Or is it just that they happen to have the right birthday. . . ? The following text is taken from Malcolm Gladwells book Outliers. It describes the play-by-play for the rst goal scored in the 2007 nals of the Canadian ice hockey junior league for star players aged 17 to 19. The two teams are the Tigers and Giants. Theres one slight dierence . . . instead of the players names, were given their birthdays. March 11 starts around one side of the Tigers net, leaving the puck for his teammate January 4, who passes it to January 22, who ips it back to March 12, who shoots point-blank at the Tigers goalie, April 27. April 27 blocks the shot, but its rebounded by Giants March 6. He shoots! Tigers defensemen February 9 and February 14 dive to block the puck while January 10 looks on helplessly. March 6 scores! Notice anything funny? Here are some gures. There were 25 players in the Tigers squad, born between 1986 and 1990. Out of these 25 players, 14 of them were born in January, February, or March. Is it believable that this should happen by chance, or do we have evidence that there is a birthday-eect in becoming a star ice hockey player? Hypothesis test
Let X be the number of the 25 players who are born from January to March.
We need to set up hypotheses of the following form: Null hypothesis: H0 : there is no birthday eect. Alternative hypothesis: H1 : there is a birthday eect.
77
Under H0 , there is no birthday eect. So the probability that each player has a birthday in Jan to March is about 1/4. (3 months out of a possible 12 months).
Thus the distribution of X under H0 is X Binomial(25, 1/4). Under H1 , there is a birthday eect, so p = 1/4. Our formulation for the hypothesis test is therefore as follows.
Probability of getting a result AT LEAST AS EXTREME as X = 14 Jan to March players, if H0 is true and p really is 0.25.
Upper tail: X = 14, 15, . . . , 25, for even more Jan to March players. Lower tail: an equal probability in the opposite direction, for too few Jan to March players.
78
Note: We do not need to calculate the values corresponding to our lower-tail pvalue. It is more complicated in this example than in Section 2.7, because we do not have Binomial probability p = 0.5. In fact, the lower tail probability lies somewhere between 0 and 1 player, but it cannot be specied exactly. We get round this problem for calculating the p-value by just multiplying the upper-tail p-value by 2. Probabilities for X Binomial(n = 25, p = 0.25)
0.00
0.05
0.10
0.15
10
12
14
16
18
20
22
24
R command for the p-value We need twice the UPPER-tail p-value: p-value = 2 (1pbinom(13, 25, 0.25)). (Recall P(X 14) = 1 P(X 13).) Typing this in R gives: 2*(1-pbinom(13, 25, 0.25)) [1] 0.001831663 This p-value is very small. It means that if there really was no birthday eect, we would expect to see results
ice hockey team. Something beyond ordinary chance seems to be going on. The data are barely compatible with H0 .
79
Why should there be a birthday eect? These data are just one example of a much wider - and astonishingly strong phenomenon. Professional sports players not just in ice hockey, but in soccer, baseball, and other sports have strong birthday clustering. Why? Its because these sports select talented players for age-class star teams at young ages, about 10 years old. In ice hockey, the cut-o date for age-class teams is January 1st. A 10-year-old born in December is competing against players who are nearly a year older, born in January, February, and March. The age dierence makes a big dierence in terms of size, speed, and physical coordination. Most of the talented players at this age are simply older and bigger. But there then follow years in which they get the best coaching and the most practice. By the time they reach 17, these players really are the best. 2.9 Likelihood and estimation So far, the hypothesis tests have only told us whether the Binomial probability p might be, or probably isnt, equal to the value specied in the null hypothesis. They have told us nothing about the size, or potential importance, of the departure from H0. For example, for the deep-sea divers, we found that it would be very unlikely to
observe as many as 125 daughters out of 190 children if the chance of having a daughter really was p = 0.5.
But what does this say about the actual value of p? Remember the p-value for the test was 0.000016. Do you think that: 1. p could be as big as 0.8?
The test doesnt even tell us this much! If there was a huge sample size (number of children), we COULD get a p-value as small as 0.000016 even if the true probability was 0.51.
Common sense, however, gives us a hint. Because there were almost twice as many daughters as sons, my guess is that the probability of a having a daughter is something close to p = 2/3. We need some way of formalizing this.
80
Estimation The process of using observations to suggest a value for a parameter is called estimation. The value suggested is called the estimate of the parameter. In the case of the deep-sea divers, we wish to estimate the probability p that the child of a diver is a daughter. The common-sense estimate to use is 125 number of daughters = = 0.658. total number of children 190
p=
However, there are many situations where our common sense fails us. For example, what would we do if we had a regression-model situation (see other courses) and wished to specify an alternative form for p, such as p = + (diver age). How would we estimate the unknown intercept and slope , given known information on diver age and number of daughters and sons? We need a general framework for estimation that can be applied to any situation. The most useful and general method of obtaining parameter estimates is the method of maximum likelihood estimation. Likelihood Likelihood is one of the most important concepts in statistics. Return to the deep-sea diver example. X is the number of daughters out of 190 children. We know that X Binomial(190, p), and we wish to estimate the value of p. The available data is the observed value of X: X = 125.
81
Suppose for a moment that p = 0.5. What is the probability of observing X = 125?
= 3.97 106.
What about p = 0.6? What would be the probability of observing X = 125 if p = 0.6?
= 0.016.
This still looks quite unlikely, but it is almost 4000 times more likely than getting X = 125 when p = 0.5.
So far, we have discovered that it would be thousands of times more likely to observe X = 125 if p = 0.6 than it would be if p = 0.5. This suggests that p = 0.6 is a better estimate than p = 0.5. You can probably see where this is heading. If p = 0.6 is a better estimate than p = 0.5, what if we move p even closer to our common-sense estimate of 0.658?
= 0.061.
This is even more likely than for p = 0.6. So p = 0.658 is the best estimate yet.
82
Can we do any better? What happens if we increase p a little more, say to p = 0.7?
= 0.028.
This has decreased from the result for p = 0.658, so our observation of 125 is LESS likely under p = 0.7 than under p = 0.658.
Overall, we can plot a graph showing how likely our observation of X = 125 is under each dierent value of p.
0.06 P(X=125) when X~Bin(190, p) 0.00 0.50 0.01 0.02 0.03 0.04 0.05
0.55
0.60
0.65 p
0.70
0.75
0.80
The graph reaches a clear maximum. This is a value of p at which the observation X = 125 is MORE LIKELY than at any other value of p. This maximum likelihood value of p is our maximum likelihood estimate. We can see that the maximum occurs somewhere close to our common-sense estimate of p = 0.658.
83
The likelihood function Look at the graph we plotted overleaf: Horizontal axis: The unknown parameter, p. Vertical axis: of p.
This function is called the likelihood function. It is a function of the unknown parameter p. For our xed observation X = 125, the likelihood function shows how LIKELY the observation 125 is for every dierent value of p. The likelihood function is:
L(p) = P(X = 125) when X Binomial(190, p), = = 190 125 p (1 p)190125 125 190 125 p (1 p)65 125
This function of p is the curve shown on the graph on page 82. In general, if our observation were X = x rather than X = 125, the likelihood function is a function of p giving P(X = x) when X Binomial(190, p). We write: L(p ; x) = P(X = x) when X Binomial(190, p), = 190 x p (1 p)190x . x
84
Dierence between the likelihood function and the probability function The likelihood function is a probability of x, but it is a FUNCTION of p. The likelihood gives the probability of a FIXED observation x, for every possible value of the parameter p. Compare this with the probability function, which is the probability of every dierent value of x, for a FIXED value of p.
0.06 0.06 P(X=x) when p=0.6 0.00 0.01 0.02 0.03 0.04 0.05
0.00
0.01
0.02
0.03
0.04
0.05
0.50
0.55
0.60
0.65 p
0.70
0.75
0.80
100
120 x
140
Likelihood function, L(p ; x). Function of p for xed x. Gives P(X = x) as p changes. (x = 125 here, but could be anything.)
Probability function, fX (x). Function of x for xed p. Gives P(X = x) as x changes. (p = 0.6 here, but could be anything.)
Maximizing the likelihood We have decided that a sensible parameter estimate for p is the maximum likelihood estimate: the value of p at which the observation X = 125 is more likely than at any other value of p. We can nd the maximum likelihood estimate using calculus. The likelihood function is L(p ; 125) = 190 125 p (1 p)65. 125
85
We wish to nd the value of p that maximizes this expression. To nd the maximizing value of p, dierentiate the likelihood with respect to p:
dL = dp
190 125
(Product Rule)
= 190 p124 (1 p)64 125 190 124 p (1 p)64 125 125(1 p) 65p
125 190p .
The maximizing value of p occurs when dL = 0. dp This gives: dL = dp 190 124 p (1 p)64 125 125 190p 125 190p = 0 = 0 p = 125 = 0.658 . 190
86
For the diver example, the maximum likelihood estimate of 125/190 is the same
This gives us condence that the method of maximum likelihood is sensible. The hat notation for an estimate It is conventional to write the estimated value of a parameter with a hat, like this: p. For example, p= 125 . 190
=0
p=p
p=
Summary of the maximum likelihood procedure 1. Write down the distribution of X in terms of the unknown parameter: X Binomial(190, p). 2. Write down the observed value of X:
87
4. Dierentiate the likelihood with respect to the parameter, and set to 0 for the maximum: dL = dp 190 124 p (1 p)64 125 125 190p = 0, when p = p.
5. Solve for p: From the graph, we can see that p = 0 and p = 1 are not maxima. p= 125 . 190
This is the maximum likelihood estimate (MLE) of p. Verifying the maximum Strictly speaking, when we nd the maximum likelihood estimate using dL dp = 0,
p=p
we should verify that the result is a maximum (rather than a minimum) by showing that d2 L < 0. dp2 p=p In Stats 210, we will be relaxed about this. You will usually be told to assume that the MLE occurs in the interior of the parameter range. Where possible, it is always best to plot the likelihood function, as on page 82. This conrms that the maximum likelihood estimate exists and is unique. In particular, care must be taken when the parameter has a restricted range like 0 < p < 1 (see later).
88
Estimators For the example above, we had observation X = 125, and the maximum likelihood estimate of p was 125 . p= 190 It is clear that we could follow through the same working with any value of X, which we can write as X = x, and we would obtain p= x . 190
Exercise: Check this by maximizing the likelihood using x instead of 125. This means that even before we have made our observation of X, we can provide a RULE for calculating the maximum likelihood estimate once X is observed: Rule: Let
X Binomial(190, p).
Note that this expression is now a random variable: it depends on the random value of X . A random variable specifying how an estimate is calculated from an observation is called an estimator. In the example above, the maximum likelihood estimaTOR of p is p= X . 190 x . 190
89
General maximum likelihood estimator for Binomial(n, p) Take any situation in which our observation X has the distribution X Binomial(n, p),
(n is known.)
2. Write down the observed value of X:
Observed data: X = x.
3. Write down the likelihood function for this observed value:
4. Dierentiate the likelihood with respect to the parameter, and set to 0 for the maximum: dL = dp n x1 p (1 p)nx1 x x np = 0, when p = p.
(Exercise)
5. Solve for p: p= x . n
90
(Just replace the x in the MLE with an X , to convert from the estimate to the estimator.)
By deriving the general maximum likelihood estimator for any problem of this sort, we can plug in values of n and x to get an instant MLE for any Binomial(n, p) problem in which n is known. Example: Recall the president problem in Section 2.7. Out of 153 children, 65 were daughters. Let p be the probability that a presidential child is a daughter. What is the maximum likelihood estimate of p? Solution: Plug in the numbers n = 153, x = 65:
Note: We showed in Section 2.7 that p was not signicantly dierent from 0.5 in
this example.
However, the MLE of p is denitely dierent from 0.5. This comes back to the meaning of signicantly dierent in the statistical sense. Saying that p is not signicantly dierent from 0.5 just means that we cant DISTINGUISH any dierence between p and 0.5 from routine sampling
variability.
We expect that p probably IS dierent from 0.5, just by a little. The maximum likelihood estimate gives us the best estimate of p. Note: We have only considered the class of problems for which X Binomial(n, p) and n is KNOWN. If n is not known, we have a harder problem: we have two parameters, and one of them (n) should only take discrete values 1, 2, 3, . . .. We will not consider problems of this type in Stats 210.
91
2.10 Random numbers and histograms We often wish to generate random numbers from a given distribution. Statistical packages like R have custom-made commands for doing this. To generate (say) 100 random numbers from the Binomial(n = 190, p = 0.6) distribution in R, we use: rbinom(100, 190, 0.6) or in long-hand, rbinom(n=100, size=190, prob=0.6) Caution: the R inputs n and size are the opposite to what you might expect: n gives the required sample size, and size gives the Binomial parameter n! Histograms The usual graph used to visualise a set of random numbers is the histogram. The height of each bar of the histogram shows how many of the random numbers fall into the interval represented by the bar. For example, if each histogram bar covers an interval of length 5, and if 24 of the random numbers fall between 105 and 110, then the height of the histogram bar for the interval (105, 110) would be 24. Here are histograms from applying the command rbinom(100, 190, 0.6) three dierent times.
10 8 6 8 frequency of x 6 frequency of x frequency of x 6 4 4 2 2 0 0 7
80
90
100
120
140
80
90
100
120
140
80
90
100
120
140
Each graph shows 100 random numbers from the Binomial(n = 190, p = 0.6)
distribution.
92
Note: The histograms above have been specially adjusted so that each histogram bar covers an interval of just one integer. For example, the height of the bar plotted at x = 109 shows how many of the 100 random numbers are equal to
109.
Usually, histogram bars would cover a larger interval, and the histogram would be smoother. For example, on the right is a histogram using the default settings in R, obtained from the command hist(rbinom(100, 190, 0.6)). Each histogram bar covers an interval of
25 30
Frequency
10
15
20
100
110
120
130
5 integers.
In all the histograms above, the sum of the heights of all the bars is 100, because there are 100 observations. Histograms as the sample size increases Histograms are useful because they show the approximate shape of the
93
50
40
frequency of x
frequency of x
40
30
30
20
20
10
10
80
90
100
120
140
0
80
10
20
30
40
50
90
100
120
140
500
500
frequency of x
400
frequency of x
400
300
300
200
200
100
100
80
90
100
120
140
0
80
100
200
300
400
500
90
100
120
140
4000
4000
frequency of x
frequency of x
3000
3000
2000
2000
1000
1000
80
90
100
120
140
0
80
1000
2000
3000
4000
6000
90
100
120
140
0.00 80
0.01
0.02
0.03
0.04
0.05
0.06
100 x
120
140
2.11 Expectation Given a random variable X that measures something, we often want to know what is the average value of X ? For example, here are 30 random observations taken from the distribution X Binomial(n = 190, p = 0.6):
P(X=x) when p=0.6 0.04 0.00 0.01 0.02 0.03 0.05
R command: rbinom(30, 190, 0.6) 116 116 117 122 111 112 114 120 112 102 125 116 97 105 108 117 118 111 116 121 107 113 120 114 114 124 116 118 119 120 The average, or mean, of the rst ten values is:
0.06
100
120 x
140
116 + 116 + . . . + 112 + 102 = 114.2. 10 The mean of the rst twenty values is: 116 + 116 + . . . + 116 + 121 = 113.8. 20 The mean of the rst thirty values is: 116 + 116 + . . . + 119 + 120 = 114.7. 30 The answers all seem to be close to 114. What would happen if we took the average of hundreds of values? 100 values from Binomial(190, 0.6): R command: mean(rbinom(100, 190, 0.6)) Result: 114.86 Note: You will get a dierent result every time you run this command.
95
1000 values from Binomial(190, 0.6): R command: mean(rbinom(1000, 190, 0.6)) Result: 114.02 1 million values from Binomial(190, 0.6): R command: mean(rbinom(1000000, 190, 0.6)) Result: 114.0001 The average seems to be converging to the value 114. The larger the sample size, the closer the average seems to get to 114. If we kept going for larger and larger sample sizes, we would keep getting answers closer and closer to 114. This is because 114 is the DISTRIBUTION MEAN:
the mean value that we would get if we were able to draw an innite sample from the Binomial(190, 0.6) distribution.
This distribution mean is called the expectation, or expected value, of the Bino-
Denition: The expected value, also called the expectation or mean, of a discrete random variable X, can be written as either E(X), or E(X), or X ,
and is given by
X = E(X) =
x
xfX (x) =
x
xP(X = x) .
The expected value is a measure of the centre, or average, of the set of values that X can take, weighted according to the probability of each value.
If we took a very large sample of random numbers from the distribution of X , their average would be approximately equal to X .
96
=
x=0
Although it is not obvious, the answer to this sum is n p = 190 0.6 = 114. We will see why in Section 2.14. Explanation of the formula for expectation We will move away from the Binomial distribution for a moment, and use a simpler example. Let the random variable X be dened as X = 1 with probability 0.9, 1 with probability 0.1.
X takes only the values 1 and 1. What is the average value of X? = 0 would not be useful, because it ignores the fact that usually X = 1, and only occasionally is X = 1.
Using
1+(1) 2
Instead, think of observing X many times, say 100 times. Roughly 90 of these 100 times will have X = 1. Roughly 10 of these 100 times will have X = 1
97
As the sample gets large, the average of the sample will get ever closer to
0.9 1 + 0.1 (1).
or in general,
E(X) =
x
P(X = x) x.
Linear property of expectation Expectation is a linear operator: Theorem 2.11: Let a and b be constants. Then E(aX + b) = aE(X) + b. Proof: Immediate from the denition of expectation. E(aX + b) =
x
= a
fX (x)
= a E(X) + b 1.
98
Example: nding expectation from the probability function Example 1: Let X Binomial(3, 0.2). Write down the probability function of X and nd E(X).
We have:
P(X = x) = x fX (x) = P(X = x)
3 (0.2)x(0.8)3x for x = 0, 1, 2, 3. x
0
0.512
1
0.384
2
0.096
3
0.008
Then
3
E(X) =
x=0
Note: We have: E(X) = 0.6 = 3 0.2 for X Binomial(3, 0.2). We will prove in Section 2.14 that whenever X Binomial(n, p), then E(X) = np. Example 2: Let Y be Bernoulli(p) (Section 2.3). That is, 1 with probability p, 0 with probability 1 p.
Y =
Find E(Y ).
0 y P(Y = y) 1 p
1
p
E(Y ) = 0 (1 p) + 1 p = p.
99
We have to nd E(XY ) either using their joint probability function (see later), or using their covariance (see later). 2. Special case: when X and Y are INDEPENDENT:
E(XY ) = E(X)E(Y ).
2.12 Variable transformations We often wish to transform random variables through a function. For example, given the random variable X, possible transformations of X include: X2 , X, 4X 3 , ... We often summarize all possible variable transformations by referring to Y = g(X) for some function g . For discrete random variables, it is very easy to nd the probability function for Y = g(X), given that the probability function for X is known. Simply change
Thus the probability function for Y = X 2 is: y P(Y = y) 02 0.512 12 0.384 22 0.096 32 0.008
This is because Y takes the value 02 whenever X takes the value 0, and so on. Thus the probability that Y = 02 is the same as the probability that X = 0. Overall, we would write the probability function of Y = X 2 as: y P(Y = y) 0 0.512 1 0.384 4 0.096 9 0.008
Example 2: Mr Chance hires out giant helium balloons for advertising. His balloons come in three sizes: heights 2m, 3m, and 4m. 50% of Mr Chances customers choose to hire the cheapest 2m balloon, while 30% hire the 3m balloon and 20% hire the 4m balloon. The amount of helium gas in cubic metres required to ll the balloons is h3 /2, where h is the height of the balloon. Find the probability function of Y , the amount of helium gas required for a randomly chosen customer.
Let X be the height of balloon ordered by a random customer. The probability function of X is: height, x (m)
P(X = x)
2 0.5
3 0.3
4 0.2
Let Y be the amount of gas required: Y = X 3 /2. The probability function of Y is:
gas, y (m 3 )
P(Y = y)
4 0.5
13.5 0.3
32 0.2
Expected value of a transformed random variable We can nd the expectation of a transformed random variable just like any other random variable. For example, in Example 1 we had X Binomial(3, 0.2), and Y = X 2. The probability function for X is: and for Y = X 2 : x P(X = x) y P(Y = y) 0 0.512 0 0.512 1 0.384 1 0.384 2 0.096 4 0.096 3 0.008 9 0.008
102
Thus the expectation of Y = X is: E(Y ) = E(X 2 ) = 0 0.512 + 1 0.384 + 4 0.096 + 9 0.008 = 0.84.
Note: E(X 2 ) is NOT the same as {E(X)}2 . Check that {E(X)}2 = 0.36. To make the calculation quicker, we could cut out the middle step of writing down the probability function of Y . Because we transform the values and keep the probabilities the same, we have: E(X 2 ) = 02 0.512 + 12 0.384 + 22 0.096 + 32 0.008. If we write g(X) = X 2 , this becomes: E{g(X)} = E(X 2) = g(0) 0.512 + g(1) 0.384 + g(2) 0.096 + g(3) 0.008. Clearly the same arguments can be extended to any function g(X) and any discrete random variable X:
E{g(X)} =
x
g(x)P(X = x).
Transform the values, and leave the probabilities alone. Denition: For any function g and discrete random variable X, the expected value of g(X) is given by E{g(X)} =
x
g(x)P(X = x) =
x
g(x)fX (x).
103
Example: Recall Mr Chance and his balloon-hire business from page 101. Let X be the height of balloon selected by a randomly chosen customer. The probability function of X is: height, x (m) 2 3 4 P(X = x) 0.5 0.3 0.2 (a) What is the average amount of gas required per customer?
Gas required was X 3 /2 from page 101. Average gas per customer is E(X 3 /2).
X3 E 2 =
x
x3 P(X = x) 2
= 12.45 m3 gas. (b) Mr Chance charges $400h to hire a balloon of height h. What is his expected earning per customer?
(expectation is linear)
Let Z1, . . . , Z5 be the earnings from the next 5 customers. Each Zi has E(Zi) = 1080 by part (b). The total expected earning is
E(Z1 + Z2 + . . . + Z5 ) = E(Z1) + E(Z2) + . . . + E(Z5) = 5 1080 = $5400.
104
Suppose
X=
Then 3/4 of the time, X takes value 3, and 1/4 of the time, X takes value 8. So
E(X) =
3 43
+ 1 8. 4
add up the values times how often they occur What about E( X)? 3 with probability 3/4, X= 8 with probability 1/4.
105
Common mistakes i) E( X) = EX =
on Wr g!
3 43
+ 18 4
ii) E( X) =
3 43
1 48
on Wr
g!
iii) E( X) =
ng!
3 43
+ +
1 48
Wr
3 4 3
1 4 8
2.13 Variance Example: Mrs Tractor runs the Rational Bank of Remuera. Every day she hopes to ll her cash machine with enough cash to see the well-heeled citizens of Remuera through the day. She knows that the expected amount of money withdrawn each day is $50,000. How much money should she load in the machine? $50,000? No: $50,000 is the average, near the centre
of the distribution. About half the time, the money required will be GREATER than the average.
How much money should Mrs Tractor put in the machine if she wants to be 99% certain that there will be enough for the days transactions? Answer: it depends how much the amount withdrawn varies above and below
its mean.
For questions like this, we need the study of variance. Variance is the average squared distance of a random variable from its own mean.
2 Denition: The variance of a random variable X is written as either Var(X) or X ,
and is given by
2 X = Var(X) = E (X X )2 = E (X EX)2 .
Var(g(X)) = E
g(X) E(g(X))
sd(X) =
Var(X) =
2 X = X .
107
Variance as the average squared distance from the mean The variance is a measure of how spread out are the values that X can take. It is the average squared distance between a value of X and the central (mean) value, X .
Possible values of X
x1
x2
x3
x4
x5
x6
x2 X x4 X X
(central value)
Var(X) = E [(X X )2 ]
(2) (1)
(1) Take distance from observed values of X to the central point, X . Square it to balance positive and negative distances. (2) Then take the average over all values X can take: ie. if we observed X many times, nd what would be the average squared distance between X and X .
2 Note: The mean, X , and the variance, X , of X are just numbers: there is nothing
Then
E(X) = X = 3
2 Var(X) = X
When we observe X, we get either 3 or 8: this is random. 2 But X is xed at 4.25, and X is xed at 4.6875, regardless of the outcome of
X.
108
Var(X) = E (X X )2 =
(x X )2fX (x) =
(x X )2P(X = x).
by denition X + 2 ] X
constant
r.v. constant
by Thm 2.11
Note: e.g.
E(X 2) =
xx
fX (x) = X=
xx
Thus
109
Theorem 2.13B: If a and b are constants and g(x) is a function, then i) Var(aX + b) = a2 Var(X). ii) Var(a g(X) + b) = a2 Var{g(X)}. Proof:
by Thm 2.11
by Thm 2.11
Example: nding expectation and variance from the probability function Recall Mr Chances balloons from page 101. The random variable Y is the amount of gas required by a randomly chosen customer. The probability function of Y is: gas, y (m 3) P(Y = y) Find Var(Y ). 4 0.5 13.5 0.3 32 0.2
110
We know that E(Y ) = Y = 12.45 from page 103. First method: use Var(Y ) = E[(Y Y )2]:
Second method: use E(Y 2 ) 2 : (usually easier) Y E(Y 2 ) = 42 0.5 + 13.52 0.3 + 322 0.2 = 267.475.
as before.
Variance of a sum of random variables: Var(X + Y ) There are two cases when nding the variance of a sum: 1. General case:
111
Interlude:
Guess whether each of the following statements is true or false.
RUE T
or
FALSE
??
1. Toss a fair coin 10 times. The probability of getting 8 or more heads is less than 1%.
2. Toss a fair coin 200 times. The chance of getting a run of at least 6 heads or 6 tails in a row is less than 10%.
3. Consider a classroom with 30 pupils of age 5, and one teacher of age 50. The probability that the pupils all outlive the teacher is about 90%.
4. Open the Business Herald at the pages giving share prices, or open an atlas at the pages giving country areas or populations. Pick a column of gures.
The gures are over 5 times more likely to begin with the digit 1 than with the digit 9.
Answers: 1. FALSE it is 5.5%. 2. FALSE : it is 97%. 3. FALSE : in NZ the probability is about 50%. 4. TRUE : in fact they are 6.5 times more likely.
112
2.14 Mean and variance of the Binomial(n, p) distribution Let X Binomial(n, p). We have mentioned several times that E(X) = np. We now prove this and the additional result for Var(X). If X Binomial(n, p), then: E(X) = X = np 2 Var(X) = X = np(1 p).
where each
Yi =
That is, Yi counts as a 1 if trial i is a success, and as a 0 if trial i is a failure. Overall, Y1 + . . . + Yn is the total number of successes out of n independent trials, which is the same as X . Note: Each Yi is a Bernoulli(p) random variable (Section 2.3). Now if X = Y1 + Y2 + . . . + Yn , and Y1 , . . . , Yn are independent, then: E(X) = E(Y1) + E(Y2) + . . . + E(Yn)
113
y 0 P(Yi = y) 1 p
1 p
Also, So
E(Yi) = 0 (1 p) + 1 p = p. E(Yi2) = 02 (1 p) + 12 p = p.
114
Hard proof: for mathematicians (non-examinable) We show below how the Binomial mean and variance formulae can be derived directly from the probability function. n x E(X) = xfX (x) = x p (1 p)nx = x x=0 x=0
n n n
x
x=0
n! px (1 p)nx (n x)!x!
But
Next: make ns into (n 1)s, xs into (x 1)s, wherever possible: e.g. n x = (n 1) (x 1), n! = n(n 1)! etc. This gives,
n
px = p px1
E(X) =
x=1
np
what we want
x=1
n 1 x1 p (1 p)(n1)(x1) x1
need to show this sum = 1
E(X) = np
y=0
115
Here goes:
n
E[X(X 1)] = =
x=0 n x=0
x(x 1)
n x p (1 p)nx x
First two terms (x = 0 and x = 1) are 0 due to the x(x 1) in the numerator. Thus
n
x=2 m y=0
n 2 x2 p (1 p)(n2)(x2) x2 m y p (1 p)my y if m = n 2, y = x 2.
So E[X(X 1)] = n(n 1)p2 . Thus Var(X) = E(X 2) (E(X))2 = E(X 2) E(X) + E(X) (E(X))2 = E[X(X 1)] + E(X) (E(X))2 = n(n 1)p2 + np n2 p2 = np(1 p). Note the steps: take out x(x1) and replace n by (n2), x by (x2) wherever possible.
116
Variance of the MLE for the Binomial p parameter In Section 2.9 we derived the maximum likelihood estimator for the Binomial parameter p. Reminder: Take any situation in which our observation X has the distribution X Binomial(n, p), where n is KNOWN and p is to be estimated. Make a single observation X = x. The maximum likelihood estimator of p is p= X . n
of their variability.
For example, in the deep-sea diver example introduced in Section 2.7, we estimated the probability that a diver has a daughter is p= 125 X = = 0.658. n 190
What is our margin of error on this estimate? Do we believe it is 0.658 0.3 (say), in other words almost useless, or do we believe it is very precise, perhaps 0.658 0.02? We assess the usefulness of estimators using their variance. Given p = X , we have: n X n
Var(p) = Var
= = =
for X Binomial(n, p)
()
117
In practice, however, we do not know the true value of p, so we cannot calculate the exact Var(p). Instead, we have to ESTIMATE Var(p) by replacing the unknown p in equation () by p. We call our estimated variance Var (p): p(1 p) . n
Var (p) =
The standard error of p is:
se(p) =
Var (p).
This result occurs because the Central Limit Theorem guarantees that p will be approximately Normally distributed in large samples (large n). We will study the Central Limit Theorem in Chapter 5. The expression p 1.96 se(p) gives an approximate 95% condence interval for p under the Normal approximation. Example: For the deep-sea diver example, with n = 190, p = 0.658,
so:
se(p) =
= 0.034. For our nal answer, we should therefore quote: p = 0.658 1.96 0.034 = 0.658 0.067
or
p = 0.658
(0.591, 0.725).
118
In Chapter 2 we introduced several fundamental ideas: hypothesis testing, likelihood, expectation, and variance. Each of these was illustrated by the Binomial distribution. We now introduce several other discrete distributions and discuss their properties and usage. First we revise Bernoulli trials and the Binomial distribution. Bernoulli Trials A set of Bernoulli trials is a series of trials such that: i) each trial has only 2 possible outcomes: Success and Failure; ii) the probability of success, p, is constant for all trials; iii) the trials are independent. Examples: 1) Repeated tossing of a fair coin: each toss is a Bernoulli trial with 1 P(success) = P(head) = 2 . 2) Having children: each child can be thought of as a Bernoulli trial with outcomes {girl, boy} and P(girl) = 0.5.
3.1 Binomial distribution Description: X Binomial(n, p) if X is the number of successes out of a xed number n of Bernoulli trials, each with P(success) = p. Probability function: fX (x) = P(X = x) = Mean: E(X) = np. Variance: Var(X) = np(1 p). Sum of independent Binomials: If X Binomial(n, p) and Y Binomial(m, p), and if X and Y are independent, and if X and Y both share the same parameter p, then X + Y Binomial(n + m, p).
n x
119
n = 10, p = 0.9 n = 100, p = 0.9 (skewed for p close to 1) (less skew for p = 0.9 if n is large)
0.12 0.3 0.2 0.1 0.0
0.20
0.15
0.10
0.05
0.0
9 10
9 10
0.0 80
0.02
0.04
0.06
0.08
0.10
90
100
3.2 Geometric distribution Like the Binomial distribution, the Geometric distribution is dened in terms of a sequence of Bernoulli trials. The Binomial distribution counts the number of successes out of a xed
The Geometric distribution counts the number of trials before the rst This means that the Geometric distribution counts the number of failures before
120
Properties of the Geometric distribution i) Description X Geometric(p) if X is the number of failures before the rst success in a series of Bernoulli trials with P(success) = p. ii) Probability function For X Geometric(p), fX (x) = P(X = x) = (1 p)xp for x = 0, 1, 2, . . . Explanation: P(X = x) = (1 p)x
need x failures
Dierence between Geometric and Binomial: For the Geometric distribution, the trials must always occur in the order F F . . . F S .
x failures
For the Binomial distribution, failures and successes can occur in any order: e.g. F F . . . F S, F SF . . . F , SF . . . F , etc. This is why the Geometric distribution has probability function P(x failures, 1 success) = (1 p)xp, while the Binomial distribution has probability function P(x failures, 1 success) = iii) Mean and variance E(X) = For X Geometric(p), x+1 (1 p)xp. x
1p q = p p q 1p = 2 2 p p
Var(X) =
121
iv) Sum of independent Geometric random variables If X1 , . . . , Xk are independent, and each Xi Geometric(p), then X1 + . . . + Xk Negative Binomial(k, p). v) Shape Geometric probabilities are always greatest at x = 0. The distribution always has a long right tail (positive skew). The length of the tail depends on p. For small p, there could be many failures before the rst success, so the tail is long. For large p, a success is likely to occur almost immediately, so the tail is short.
p = 0.3 (small p)
0.30 0.5
(see later)
p = 0.5 (moderate p)
0.8
p = 0.9 (large p)
0.25
0.20
0.4
0.15
0.3
0.10
0.2
0.0
0.05
0.0
0.1
9 10
9 10
0.0 0
0.2
0.4
0.6
9 10
vi) Likelihood For any random variable, the likelihood function is just the probability function expressed as a function of the unknown parameter. If: X Geometric(p); p is unknown; the observed value of X is x; then the likelihood function is: L(p ; x) = p(1 p)x
Example: we observe a sh making 5 failed jumps before reaching the top of a waterfall. We wish to estimate the probability of success for each jump.
Then L(p ; 5) = p(1 p)5 for 0 < p < 1. Maximize L with respect to p to nd the MLE, p.
122
For mathematicians: proof of Geometric mean and variance formulae (non-examinable) We wish to prove that E(X) = We use the following results:
x=1 1p p
and Var(X) =
1p p2
when X Geometric(p).
xq x1 =
1 (1 q)2 2 (1 q)3
(3.1)
and
x=2
x(x 1)q x2 =
(3.2)
Proof of (3.1) and (3.2): Consider the innite sum of a geometric progression:
x=0
qx =
1 1q d dq 1 1q
qx
Note that the lower limit of the summation becomes x = 1 because the term for x = 0 vanishes. The proof of (3.2) is obtained similarly, by dierentiating both sides of (3.1) with respect to q (Exercise).
123
xP(X = x) xpq x xq x xq x1 (by equation (3.1)) (because 1 q = p) (where q = 1 p) (lower limit becomes x = 1 because term in x = 0 is zero)
= p
x=0 x=1
= pq
x=1
= pq = pq = q , p
1 (1 q)2 1 p2
as required.
For Var(X), we use Var(X) = E(X 2 ) (EX)2 = E {X(X 1)} + E(X) (EX)2 . Now E{X(X 1)} = =
x=0 x=0
()
= pq
x(x 1)q x2
= pq 2 = Thus by (), 2q 2 . p2
2 (1 q)3
q p
q(q + p) q = 2, 2 p p
124
3.3 Negative Binomial distribution The Negative Binomial distribution is a generalised form of the Geometric distribution: the Geometric distribution counts the number of failures before the rst
success;
the Negative Binomial distribution counts the number of failures before the k th success. If every trial has probability p of success, we write: X NegBin(k, p). Examples: 1) X =number of boys before the second girl in a family: X NegBin(k = 2, p = 0.5). 2) Tom needs to pass 24 papers to complete his degree. He passes each paper with probability p, independently of all other papers. Let X be the number of papers
for x = 0, 1, 2, . . .
125
Explanation: For X = x, we need x failures and k successes. The trials stop when we reach the kth success, so the last trial must be a
success.
For example, if x = 3 failures and k = 2 successes, we could have: F F F SS So: P(X = x) = k+x1 x (k 1) successes and x failures out of (k 1 + x) trials. iii) Mean and variance E(X) = For X NegBin(k, p), k(1 p) kq = p p k(1 p) kq = 2 p2 p k successes pk F F SF S F SF F S SF F F S
This leaves x failures and k 1 successes to occur in any order: a total of k 1 + x trials.
(1 p)x
x failures
Var(X) =
These results can be proved from the fact that the Negative Binomial distribution is obtained as the sum of k independent Geometric random variables: X = Y1 + . . . + Yk , where each Yi Geometric(p), kq , E(X) = kE(Yi) = p kq Var(X) = kVar(Yi) = 2 . p iv) Sum of independent Negative Binomial random variables If X and Y are independent, and X NegBin(k, p), Y NegBin(m, p), with the same value of p, then X + Y NegBin(k + m, p). Yi indept,
126
v) Shape The Negative Binomial is exible in shape. Below are the probability functions for various dierent values of k and p.
k = 3, p = 0.5
0.5
k = 3, p = 0.8
0.08
k = 10, p = 0.5
0.15
0.4
0.10
0.3
0.05
0.2
0.0
0.0
0.1
9 10
9 10
0.0 0 2 4 6 8 10 12 14 16 18 20 22 24
vi) Likelihood As always, the likelihood function is the probability function expressed as a function of the unknown parameters. If: X NegBin(k, p); k is known; p is unknown; the observed value of X is x; then the likelihood function is: L(p ; x) = k+x1 k p (1 p)x x
Example: Tom fails a total of 4 papers before nishing his degree. What is his pass probability for each paper? X =# failed papers before 24 passed papers: X NegBin(24, p).
0.02
0.04
0.06
127
3.4 Hypergeometric distribution: sampling without replacement The hypergeometric distribution is used when we are sampling without replace-
We remove n objects at random without replacement. Let X = number of the n removed objects that are special. Then X Hypergeometric(N, M, n). Example: Ron has a box of Chocolate Frogs. There are 20 chocolate frogs in the box. Eight of them are dark chocolate, and twelve of them are white chocolate. Ron grabs a random handful of 5 chocolate frogs and stus them into his mouth when he thinks that noone is looking. Let X be the number of dark chocolate frogs he picks.
fX (x) = P(X = x) =
128
Explanation: We need to choose x special objects and n x other objects. Number of ways of selecting x special objects from the M available is: M . x
Number of ways of selecting n x other objects from the N M available is: N M . nx Total number of ways of choosing x special objects and (nx) other objects is: M N M . nx x Overall number of ways of choosing n objects from N is:
N n
Thus:
M x
N M nx N n
Note: We need 0 x M (number of special objects), and 0 n x N M (number of other objects). After some working, this gives us the stated constraint that x = max(0, n + M N ) to x = min(n, M). Example: What is the probability that Ron selects 3 white and 2 dark chocolates? X =# dark chocolates. There are N = 20 chocolates, including M = 8 dark
chocolates. We need
P(X = 2) =
8 2 12 3 20 5
iii) Mean and variance For X Hypergeometric(N, M, n), E(X) = np Var(X) = np(1 p)
N n N 1 M N.
where p =
129
iv) Shape The Hypergeometric distribution is similar to the Binomial distribution when n/N is small, because removing n objects does not change the overall composition of the population very much when n/N is small. For n/N < 0.1 we often approximate the Hypergeometric(N, M, n) distribution by the Binomial(n, p = M ) distribution. N
Hypergeometric(30, 12, 10)
0.30 0.25
Binomial(10, 12 ) 30
0.25
0.20
0.15
0.10
0.0
0.05
9 10
0.0 0
0.05
0.10
0.15
0.20
9 10
Note: The Hypergeometric distribution can be used for opinion polls, because these involve sampling without replacement from a nite population. The Binomial distribution is used when the population is sampled with replace-
ment.
As noted above, Hypergeometric(N, M, n) Binomial(n, M ) N A note about distribution names Discrete distributions often get their names from mathematical power series. Binomial probabilities sum to 1 because of the Binomial Theorem: p + (1 p)
n
as N .
Negative Binomial probabilities sum to 1 by the Negative Binomial expansion: i.e. the Binomial expansion with a negative power, k: p
k
1 (1 p)
3.5 Poisson distribution When is the next volcano due to erupt in Auckland?
Poisson Process The Poisson process counts the number of events occurring in a xed time or
Example: Let X be the number of road accidents in a year in New Zealand. Suppose that: i) all accidents are independent of each other; ii) accidents occur at a constant average rate of per year; iii) accidents cannot occur simultaneously. Then the number of accidents in a year, X, has the distribution X Poisson().
131
Number of accidents in one year Let X be the number of accidents to occur in one year: X Poisson(). The probability function for X Poisson() is x P(X = x) = e x! Number of accidents in t years Let Xt be the number of accidents to occur in time t years. Then Xt Poisson(t), and (t)x t P(Xt = x) = e x!
for x = 0, 1, 2, . . .
for x = 0, 1, 2, . . .
General denition of the Poisson process Take any sequence of random events such that: i) all events are independent; ii) events occur at a constant average rate of per unit time; iii) events cannot occur simultaneously. Let Xt be the number of events to occur in time t. Then Xt Poisson(t), and P(Xt = x) = (t)x t e x!
for x = 0, 1, 2, . . .
Note: For a Poisson process in space, let XA = # events in area of size A. Then XA Poisson(A). Example: XA = number of raisins in a volume A of currant bun.
132
Where does the Poisson formula come from? (Sketch idea, for mathematicians; non-examinable). The formal denition of the Poisson process is as follows. Denition: The random variables {Xt : t > 0} form a Poisson process with rate if: i) events occurring in any time interval are independent of those occurring in any other disjoint time interval; ii) lim
t0
= ;
iii) lim
t0
= 0.
These conditions can be used to derive a partial dierential equation on a function known as the probability generating function of Xt . The partial dierential x equation is solved to provide the form P(Xt = x) = (t) et . x! Poisson distribution The Poisson distribution is not just used in the context of the Poisson process. It is also used in many other situations, often as a subjective model (see Section 3.6). Its properties are as follows. i) Probability function For X Poisson(), fX (x) = P(X = x) = x e x!
for x = 0, 1, 2, . . .
133
ii) Mean and variance The mean and variance of the Poisson() distribution are both . E(X) = Var(X) =
when X Poisson().
Notes: 1. It makes sense for E(X) = : by denition, is the average number of events per unit time in the Poisson process. 2. The variance of the Poisson distribution increases with the mean (in fact, variance = mean). This is often the case in real life: there is more uncertainty associated with larger numbers than with smaller numbers. iii) Sum of independent Poisson random variables If X and Y are independent, and X Poisson(), Y Poisson(), then iv) Shape X + Y Poisson( + ).
The shape of the Poisson distribution depends upon the value of . For small , the distribution has positive (right) skew. As increases, the distribution becomes more and more symmetrical, until for large it has the familiar bellshaped appearance. The probability functions for various are shown below.
=1
0.20
= 3.5
0.04
= 100
0.3
0.2
0.15
0.1
0.10
0.0
0.0
0.05
9 10
9 10
0.0 60
0.01
0.02
0.03
80
100
120
140
134
v) Likelihood and Estimator Variance As always, the likelihood function is the probability function expressed as a function of the unknown parameters. If: X Poisson(); is unknown; the observed value of X is x; then the likelihood function is: x L(; x) = e x!
Let X =# babies born in a day in Mt Roskill. Assume that X Poisson(). Observation: X = 28 babies. Likelihood:
L( ; 28) = 28 e 28!
Maximize L with respect to to nd the MLE, . We nd that = x = 28. Thus the estimator variance is: Var() = Var(X) = , because X Poisson().
Because we dont know , we have to estimate the variance: Var() = . vi) R command for the p-value: If X Poisson(), then the R command for P(X x) is ppois(x, lambda). Proof of Poisson mean and variance formulae (non-examinable) We wish to prove that E(X) = Var(X) = for X Poisson().
135
So E(X) =
x e (x 1)!
x=1 y=0
x e (x 2)!
x=2 y=0
136
3.6 Subjective modelling Most of the distributions we have talked about in this chapter are exact models for the situation described. For example, the Binomial distribution describes exactly the distribution of the number of successes in n Bernoulli trials. However, there is often no exact model available. If so, we will use a subjective
model.
In a subjective model, we pick a probability distribution to describe a situation
just because it has properties that we think are appropriate to the situation, such as the right sort of symmetry or skew, or the right sort of relationship between variance and mean.
Example: Distribution of word lengths for English words. Let X = number of letters in an English word chosen at random from the dictio-
nary.
If we plot the frequencies on a barplot, we see that the shape of the distribution
is roughly Poisson.
English word lengths: X 1 Poisson(6.22)
Word lengths from 25109 English words
0.10
0.15
10 number of letters
15
20
The Poisson probabilities (with estimated by maximum likelihood) are plotted as points overlaying the barplot. We need to use X 1 + Poisson because X cannot take the value 0. The t of the Poisson distribution is quite good.
137
In this example we can not say that the Poisson distribution represents the number of events in a xed time or space: instead, it is being used as a subjective
awful.
0.0 0
0.02
0.04
Here are stroke counts from 13061 Chinese characters. X is the number of strokes in a randomly chosen character. The best-tting Poisson distribution (found by MLE) is overlaid.
probability
0.06
0.08
0.10
10
20 number of strokes
30
0.08
(found by MLE)
0.06 probability
is NegBin(k = 23.7, p = 0.64). The t is very good. However, X does not represent the number of failures before the kth success: the NegBin is a
0 10 20 number of strokes 30
0.0
0.02
0.04
subjective model.
138
But suppose that X takes values in a continuous set, e.g. [0, ) or (0, 1). We cant even begin to list all the values that X can take. For example, how would you list all the numbers in the interval [0, 1]? the smallest number is 0, but what is the next smallest? 0.01? 0.0001? 0.0000000001? We just end up talking nonsense. In fact, there are so many numbers in any continuous set that each of them
A continuous random variable takes values in a continuous interval (a, b). It describes a continuously varying quantity such as time or height. When X is continuous, P(X = x) = 0 for ALL x. The probability function is meaningless.
Although we cannot assign a probability to any value of X, we are able to assign probabilities to intervals: eg. P(X = 1) = 0, but P(0.999 X 1.001) can be > 0. This means we should use the distribution function, FX (x) = P(X x).
139
The cumulative distribution function, FX (x) Recall that for discrete random variables: FX (x) is a step function: FX (x) = P(X x);
FX (x)
P(a < X b) = P(X (a, b]) = F (b) F (a). For a continuous random variable: FX (x) 1 FX (x) = P(X x); FX (x) is a continuous function:
As before, P(a < X b) = P(X (a, b]) = F (b) F (a). However, for a continuous random variable, P(X = a) = 0. So it makes no dierence whether we say P(a < X b) or P(a X b).
Endpoints are not important for continuous r.v.s. Endpoints are very important for discrete r.v.s.
140
4.2 The probability density function Although the cumulative distribution function gives us an interval-based tool for dealing with continuous random variables, it is not very good at telling us what the distribution looks like. For this we use a dierent tool called the probability density function. The probability density function (p.d.f.) is the best way to describe and recognise a continuous random variable. We use it all the time to calculate probabilities and to gain an intuitive feel for the shape and nature of the distribution. Using the p.d.f. is like recognising your friends by their faces. You can chat on the phone, write emails or send txts to each other all day, but you never really know a person until youve seen their face. Just like a cell-phone for keeping in touch, the cumulative distribution function is a tool for facilitating our interactions with the continuous random variable. However, we never really understand the random variable until weve seen its face the probability density function. Surprisingly, it is quite dicult to describe exactly what the probability density function is. In this section we take some time to motivate and describe this fundamental idea. All-time top-ten 100m sprint times The histogram below shows the best 10 sprint times from the 168 all-time top male 100m sprinters. There are 1680 times in total, representing the top 10 times up to 2002 from each of the 168 sprinters. Out of interest, here are the summary statistics: Min. 1st Qu. Median Mean 3rd Qu. Max. 9.78 10.08 10.15 10.14 10.21 10.41
frequency 100 200 300 0
9.8
10.2
10.4
141
9.8
10.0 time
10.2
10.4
0.05s intervals
100 200 300 0
9.8
10.0 time
10.2
10.4
0.02s intervals
100 0 50 150
9.8
10.0 time
10.2
10.4
0.01s intervals
20 40 60 80 0
9.8
10.0 time
10.2
10.4
We see that each histogram has broadly the same shape, although the heights of
142
We could t a curve over any of these histograms to show the desired shape, but the problem is that the histograms are not standardized: every time we change the interval width, the heights of the bars change. How can we derive a curve or function that captures the common shape of the histograms, but keeps a constant height? What should that height be? The standardized histogram We now focus on an idealized (smooth) version of the sprint times distribution, rather than using the exact 1680 sprint times observed. We are aiming to derive a curve, or function, that captures the shape of the histograms, but will keep the same height for any choice of histogram bar width. First idea: plot the probabilities instead of the frequencies.
The height of each histogram bar now represents the probability of getting an observation in that bar.
probability 0.4 0.0 0.2
10.2
10.4
0.0
0.10
10.2
10.4
0.0
0.04
9.8
10.2
10.4
This doesnt work, because the height (probability) still depends upon the bar
143
The height of each histogram bar now represents the probability of getting an observation in that bar, divided by the width of the bar.
probability / interval width 4
0.1s intervals
9.8
10.2
10.4
0.05s intervals
9.8
10.2
10.4
0.02s intervals
9.8
10.2
10.4
0.01s intervals
9.8
10.2
10.4
This seems to be exactly what we need! The same curve ts nicely over all the
histograms and keeps the same height regardless of the bar width.
These histograms are called standardized histograms. The nice-tting curve is the probability density function. But. . . what is it?!
144
The probability density function We have seen that there is a single curve that ts nicely over any standardized histogram from a given distribution. This curve is called the probability density function (p.d.f.). We will write the p.d.f. of a continuous random variable X as p.d.f. = fX (x). The p.d.f. fX (x) is NOT the probability of x for example, in the sprint times we can have fX (x) = 4, so it is denitely NOT a probability. However, as the histogram bars of the standardized histogram get narrower, the bars get closer and closer to the p.d.f. curve. The p.d.f. is in fact the limit
probability P(x X x + t) FX (x + t) FX (x) = = , interval width t t where FX (x) is the cumulative distribution function.
Now consider the limit as the histogram bar width (t) goes to 0: this limit is DEFINED TO BE the probability density function at x, fX (x): fX (x) = lim
t0
FX (x + t) FX (x) t
by denition.
145
It is dened to be a single, unchanging curve that describes the SHAPE of any histogram drawn from the distribution of X .
Formal denition of the probability density function Denition: Let X be a continuous random variable with distribution function FX (x). The probability density function (p.d.f.) of X is dened as dFX = FX (x). dx
fX (x) =
It gives: the RATE at which probability is accumulating at any given point, FX (x); the SHAPE of the distribution of X . Using the probability density function to calculate probabilities As well as showing us the shape of the distribution of X, the probability density function has another major use: it calculates probabilities by integration. Suppose we want to calculate P(a X b). We already know that: P(a X b) = FX (b) FX (a). But we also know that: dFX = fX (x), dx
so In fact:
FX (x) =
fX (x) dx
b
(without constants).
FX (b) FX (a) =
fX (x) dx.
a
146
This is a very important result: Let X be a continuous random variable with probability density function fX (x). Then
b
P(a X b) = P(X [ a, b ] ) =
fX (x) dx .
a
total area =
fX (x) dx = FX () FX () = 1 0 = 1.
This says that the total area under the p.d.f. curve is equal to the total probability that X takes a value between and +, which is 1.
fX (x)
147
Using the p.d.f. to calculate the distribution function, FX (x) Suppose we know the probability density function, fX (x), and wish to calculate the distribution function, FX (x). We use the following formula:
x
Distribution function,
FX (x) =
fX (u) du.
Proof:
x
Writing FX (x) =
fX (u) du means:
fX (u)
FX (x) = P(X x)
x
x
Writing FX (x) =
nonsense!
x fX (x) dx
148
Why do we need fX (x)? Why not stick with FX (x)? These graphs show FX (x) and fX (x) from the mens 100m sprint times (X is a random top ten 100m sprint time).
F(x) 0.8 f(x) 0 1 2 3 4 9.8 10.0 x 10.2 10.4 9.8
0.0
0.4
10.0 x
10.2
10.4
Just using FX (x) gives us very little intuition about the problem. For example, which is the region of highest probability? Using the p.d.f., fX (x), we can see that it is about 10.1 to 10.2 seconds. Using the c.d.f., FX (x), we would have to inspect the part of the curve with the
Let
fX (x) =
(i) Find the constant k. (iii) Find the cumulative distribution function, FX (x), for all x. (i) We need:
0
fX (x) dx = 1 k e2x dx = 1
0
0 dx +
0
e2x k 2
= 1
149
k (e e0 ) = 1 2 k (0 1) = 1 2 k = 2.
(ii)
P(1 < X 3) = =
fX (x) dx
1 3 1
2 e2x dx
3 1
2e2x 2
(iii)
x
FX (x) =
0
fX (u) du
x
0 du +
0 x 0
2 e2u du
for x > 0
for x > 0.
x 0 du
FX (x) =
150
fX (x) dx = 1.
1. If you only need to calculate one probability P(a X b): integrate the
p.d.f.:
P(a X b) =
fX (x) dx.
a
2. If you will need to calculate several probabilities, it is easiest to nd the distribution function, FX (x):
x
FX (x) =
fX (u) du.
Then use:
for any a, b.
Endpoints: DO NOT MATTER for continuous random variables: P(X a) = P(X < a)
ov 5 N45? 3 2
4.3 The Exponential distribution When will the next volcano erupt in Auckland? We never quite answered this question in Chapter 3. The Poisson distribution was used to count the
2 Oct 2012?
9J 207 un 4?
time.
To nd the distribution of a continuous random variable, we often work with the cumulative distribution function, FX (x). This is because FX (x) = P(X x) gives us a probability, unlike the p.d.f. fX (x). We are comfortable with handling and manipulating probabilities. Suppose that {Nt : t > 0} forms a Poisson process with rate = We know that Nt Poisson(t) ;
1 . 1000
Nt is the number of volcanoes to have occurred by time t, starting from now. (t)n t e . n!
so P(Nt = n) =
152
Let X be a continuous random variable giving the number of years waited before the next volcano, starting now. We will derive an expression for FX (x). (i) When x < 0: FX (x) = P(X x) = P( less than 0 time before next volcano) = 0. (ii) When x 0: FX (x) = P(X x) = P(amount of time waited for next volcano is x) = P(there is at least one volcano between now and time x) = P(# volcanoes between now and time x is 1) = P(Nx 1) = 1 P(Nx = 0) (x)0 x e = 1 0! = 1 ex . Overall: FX (x) = P(X x) = 1 ex 0
The distribution of the waiting time X is called the Exponential distribution because of the exponential formula for FX (x). Example: What is the probability that there will be a volcanic eruption in Auckland within the next 50 years?
Put =
1 1000 .
There is about a 5% chance that there will be a volcanic eruption in Auckland over the next 50 years. This is the gure given by the Auckland Regional Council at the above web link (under Future Hazards).
153
The Exponential Distribution We have dened the Exponential() distribution to be the distribution of the waiting time (time between events) in a Poisson process with rate . We write X Exponential(), or X Exp(). However, just like the Poisson distribution, the Exponential distribution has many other applications: it does not always have to arise from a Poisson process. Let X Exponential(). Note: > 0 always. Distribution function: FX (x) = P(X x) = 1 ex for x 0, 0 for x < 0.
fX (x) =
FX (x)
ex 0
P.d.f., fX (x)
Link with the Poisson process
Let {Nt : t > 0} be a Poisson process with rate . Then: Nt is the number of events to occur by time t; Nt Poisson(t) ; so P(Nt = n) =
(t)n t ; n! e
Dene X to be either the time till the rst event, or the time from now until the next event, or the time between any two events. Then X Exponential(). X is called the waiting time of the process.
Me
Memorylessness We have said that the waiting time of the Poisson process can be dened either as the time from the start to the rst event, or the time from now until the next event, or the time between any two events.
mo a si ry like eve !
zzz
All of these quantities have the same distribution: X Exponential(). The derivation of the Exponential distribution was valid for all of them, because events occur at a constant average rate in the Poisson process. This property of the Exponential distribution is called memorylessness: the distribution of the time from now until the rst event is the same as the distribution of the time from the start until the rst event: the time
START
NOW
FIRST EVENT
The Exponential distribution is famous for this memoryless property: it is the only memoryless distribution. For volcanoes, memorylessness means that the 600 years we have waited since
155
For private reading: proof of memorylessness Let X Exponential() be the total time waited for an event. Let Y be the amount of extra time waited for the event, given that we have already waited time t (say). We wish to prove that Y has the same distribution as X, i.e. that the time t already waited has been forgotten. This means we need to prove that Y Exponential(). Proof: We will work with FY (y) and prove that it is equal to 1 ey . This proves that Y is Exponential() like X. First note that X = t+Y , because X is the total time waited, and Y is the time waited after time t. Also, we must condition on the event {X > t}, because we know that we have already waited time t. So P(Y y) = P(X t + y | X > t). FY (y) = P(Y y) = P(X t + y | X > t) = P(X t + y AND X > t) P(X > t) (denition of conditional probability) P(t < X t + y) 1 P(X t) FX (t + y) FX (t) 1 FX (t)
= =
(1 e(t+y) ) (1 et ) = 1 (1 et ) = et e(t+y) et
et (1 ey ) = et Thus the conditional probability of waiting time y extra, given that we have already waited time t, is the same as the probability of waiting time y in total. The time t already waited is forgotten. = 1 ey . So Y Exponential() as required.
156
4.4 Likelihood and estimation for continuous random variables For discrete random variables, we found the likelihood using the probability function, fX (x) = P(X = x). For continuous random variables, we nd the likelihood using the probability density function, fX (x) = dFX . dx Although the notation fX (x) means something dierent for continuous and
discrete random variables, it is used in exactly the same way for likelihood and estimation.
Note: Both discrete and continuous r.v.s have the same denition for the cumulative distribution function: FX (x) = P(X x). Example: Exponential likelihood Suppose that: X Exponential(); is unknown; the observed value of X is x. Then the likelihood function is: L( ; x) = fX (x) = ex
We estimate by setting
dL = 0 to nd the MLE, . d
Two or more independent observations Suppose that X1 , . . . , Xn are continuous random variables such that: X1 , . . . , Xn are INDEPENDENT;
all the Xi s have the same p.d.f., fX (x); then the likelihood is fX (x1)fX (x2) . . . fX (xn).
157
Example: Suppose that X1, X2 , . . . , Xn are independent, and Xi Exponential() for all i. Find the maximum likelihood estimate of .
0.004 likelihood 0.0 0 0.002
Likelihood graph shown for = 2 and n = 10. x1, . . . , x10 generated by R command rexp(10, 2).
n lambda
Solution:
L( ; x1, . . . , xn) =
i=1 n
fX (xi) exi
i=1 n
n i=1
= e
xi
Dene x =
1 n
n i=1 xi
xi = nx.
i=1
Thus
L( ; x1, . . . , xn) = n enx
Solve
The MLE of is
1 = . x
158
4.5 Hypothesis tests Hypothesis tests for continuous random variables are just like hypothesis tests for discrete random variables. The only dierence is: endpoints matter for discrete random variables, but not for continuous ran-
dom variables.
Example: discrete. Suppose H0 : X Binomial(n = 10, p = 0.5), and we have observed the value x = 7. Then the upper-tail p-value is P(X 7) = 1 P(X 6) = 1 FX (6). Example: continuous. Suppose H0 : X Exponential(2), and we have observed the value x = 7. Then the upper-tail p-value is P(X 7) = 1 P(X 7) = 1 FX (7). Other than this trap, the procedure for hypothesis testing is the same: Use H0 to specify the distribution of X completely, and oer a one-tailed or two-tailed alternative hypothesis H1 . Find the one-tailed or two-tailed p-value as the probability of seeing an observation at least as weird as what we have seen, if H0 is true. That is, nd the probability under the distribution specied by H0 of seeing an observation further out in the tails than the value x that we have seen. Example with the Exponential distribution A very very old person observes that the waiting time from Rangitoto to the next volcanic eruption in Auckland is 1500 years. Test the hypothesis that 1 1 = 1000 against the one-sided alternative that < 1000 .
1 Note: If < 1000 , we would expect to see BIGGER values of X , NOT smaller. This is because X is the time between volcanoes, and is the rate at which volcanoes occur. A smaller value of means volcanoes occur less often, so the time X between them is BIGGER.
Make observation x.
159
one-tailed test
Observation: x = 1500 years. Values weirder than x = 1500 years: all values BIGGER than x = 1500. p-value: P(X 1500) when X Exponential( =
1 1000 ).
So
p value = P(X 1500) = 1 P(X 1500) = 1 FX (1500) = 0.223.
when X Exponential( =
1 1000 )
= 1 (1 e1500/1000)
There is no evidence against H0. The observation x = 1500 years is consistent with the hypothesis that = 1/1000, i.e. that volcanoes erupt once every 1000 years on average.
0.0010
Interpretation:
f(x)
0 0
0.0002
0.0006
1000
2000
3000
4000
5000
160
4.6 Expectation and variance Remember the expectation of a discrete random variable is the long-term av-
erage:
X = E(X) =
x
xP(X = x) =
x
xfX (x).
(For each value x, we add in the value and multiply by the proportion of times we would expect to see that value: P(X = x).) For a continuous random variable, replace the probability function with the
by
Note: There exists no concept of a probability function fX (x) = P(X = x) for continuous random variables. In fact, if X is continuous, then P(X = x) = 0 for all x. The idea behind expectation is the same for both discrete and continuous random variables. E(X) is: the long-term average of X;
a sum of values multiplied by how common they are: xf (x) or xf (x) dx.
Expectation is also the balance point of fX (x) for both continuous and discrete X. Imagine fX (x) cut out of cardboard and balanced on a pencil.
161
Discrete: E(X) =
x
xfX (x) dx
E(g(X)) =
x
g(x)fX (x)
E(g(X)) =
g(x)fX (x) dx
Variance If X is continuous, its variance is dened in exactly the same way as a discrete random variable:
2 Var(X) = X = E (X X )2 = E(X 2) 2 = E(X 2) (EX)2. X
For a continuous random variable, we can either compute the variance using
Var(X) = E (X X )
or
(x X )2fX (x)dx,
x2 fX (x)dx (EX)2.
162
Properties of expectation and variance All properties of expectation and variance are exactly the same for continuous
INDEPENDENT:
E(XY ) =E(X)E(Y ) when X , Y independent. Var(X + Y ) =Var(X) + Var(Y ) when X , Y independent.
4.7 Exponential distribution mean and variance When X Exponential(), then: E(X) =
1
Var(X) =
1 2
Note: If X is the waiting time for a Poisson process with rate events per year 1 (say), it makes sense that E(X) = . For example, if = 4 events per hour, the average time waited between events is 1 hour. 4
Proof : E(X) =
xfX (x) dx
x dx. 0 xe dv u dx dx = uv dv dx du v dx dx.
163
= 1, and let
0
= ex , so v = ex .
E(X) =
xe
dx =
0
dv dx dx
0 0
= =
uv
du dx dx
0
xe
(ex) dx
= 0+ = E(X) =
1 1
1 x e 0 1
e0
1 2 .
x fX (x) dx =
0
dv dx
Then
E(X 2) = uv
du dx = dx
+
0
2xex dx
xex dx
2 2 E(X) = 2 . 1
2
2 = 2 1 . 2
Var(X) =
164
Interlude: Guess the Mean, Median, and Variance For any distribution: the mean is the average that would be obtained if a large number of observations were drawn from the distribution; the median is the half-way point of the distribution: every observation has a 50-50 chance of being above the median or below the median; the variance is the average squared distance of an observation from the mean. Given the probability density function of a distribution, we should be able to guess roughly the distribution mean, median, and variance . . . but it isnt easy! Have a go at the examples below. As a hint: the mean is the balance-point of the distribution. Imagine that the p.d.f. is made of cardboard and balanced on a rod. The mean is the point where the rod would have to be placed for the cardboard to balance. the median is the half-way point, so it divides the p.d.f. into two equal areas of 0.5 each. the variance is the average squared distance of observations from the mean; so to get a rough guess (not exact), it is easiest to guess an average distance from the mean and square it.
f(x) 0.012 0.008
0.0 0
0.004
50
100
150 x
200
250
300
165
Answers:
f(x) 0.012
0.0 0
0.004
50
100
150 x
200
250
300
Notes: The mean is larger than the median. This always happens when the distribution has a long right tail (positive skew) like this one. The variance is huge . . . but when you look at the numbers along the horizontal axis, it is quite believable that the average squared distance of an observation from the mean is 1182. Out of interest, the distribution shown is a Lognormal distribution. Example 2: Try the same again with the example below. Answers are written below the graph.
f(x) 0.0 0.2 0.4 0.6 0.8 1.0 0
2 x
166
4.8 The Uniform distribution X has a Uniform distribution on the interval [a, b] if X is equally likely to fall anywhere in the interval [a, b]. We write X Uniform[a, b], or X U[a, b]. Equivalently, X Uniform(a, b), or X U(a, b). Probability density function, fX (x)
1 ba fX (x) = 0
if a x b, otherwise.
fX (x)
1 ba
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
FX (x) =
fY (y) dy =
a
1 dy ba
x a
if
axb
= =
xa ba
y ba
if
a x b.
FX (x) 1 0 1 0 1 0 1 0 111 000 11111 00000 1 10 111 000 1 0 111 000 1 0 111 000 1 0 111 000 1 0 111 000 1 0 111 000 1 0 111 000 1 0 0 000000 1111111111 0000000000x 111111 a b
Thus
0 1
FX (x) =
xa ba
if x < a, if a x b, if x > b.
167
If X Uniform[a, b],
Proof : E(X) =
a+b , E(X) = 2
b
xf (x) dx =
a
1 x ba
1 (b2 a2 ) 2 1 (b a)(b + a) 2
Var(X) = E[(X X ) ] =
b a
(x X )2 1 (x X )3 dx = ba ba 3 = 1 ba
(b X )3 (a X )3 3
ab 2 .
a+b 2 ,
so
b X =
ba 2
and
a X =
1 ba
(b a)3 (a b)3 23 3
X = E(X) =
1 2
2 X = Var(X) =
1 (1 12
0)2 =
168
4.9 The Change of Variable Technique: nding the distribution of g(X) Let X be a continuous random variable. Suppose the p.d.f. of X , fX (x), is known; we wish to nd the p.d.f. of Y .
We use the Change of Variable technique. Example: Let X Uniform(0, 1), and let Y = log(X). The p.d.f. of X is fX (x) = 1 for 0 < x < 1. What is the p.d.f. of Y , fY (y)? Change of variable technique for monotone functions Suppose that g(X) is a monotone function R R. This means that g is an increasing function, or g is a decreasing f n . When g is monotone, it is invertible, or (11) (one-to-one). That is, for every y there is a unique x such that g(x) = y . This means that the inverse function, g 1 (y), is well-dened as a function for a certain range of y. When g : R R, as it is here, then g can only be (11) if it is monotone.
y = g(x) = x2
1.0 0.8 0.8 1.0
x = g 1 (y) =
0.6
0.4
x 0.2 0.0 0.0 0.2 0.4 x 0.6 0.8 1.0 0.0 0.0 0.2 0.4
0.6
0.2
0.4 y
0.6
0.8
1.0
169
Change of Variable formula Let g : R R be a monotone function and let Y = g(X). Then the p.d.f. of Y = g(X) is fY (y) = fX (g 1(y))
d 1 dy g (y)
Write
Then
Working for change of variable questions 1) Show you have checked g(x) is monotone over the required range. 2) Write y = y(x) for x in <range of x>, e.g. for a < x < b. 3) So x = x(y) for y in <range of y >: for y(a) < y(x) < y(b) if y is increasing; for y(a) > y(x) > y(b) if y is decreasing. dx = <expression involving y >. dy dx 5) So fY (y) = fX (x(y)) by Change of Variable formula, dy = ... . Quote range of values of y as part of the FINAL answer. 4) Then
Refer back to the question to nd fX (x): you often have to deduce this from information like X Uniform(0, 1) or X Exponential(). Or it may be given explicitly.
170
Note: There should be no xs left in the answer! x(y) and dx are expressions involving y only. dy
4
y = log(x)
Example 1: Let X Uniform(0, 1), and let Y = log(X). Find the p.d.f. of Y . 1) y(x) = log(x) is monotone decreasing,
y 0 0.0 1 2
0.2
0.4 x
0.6
0.8
1.0
<
x y
<
1.
> log(1),
5) So
= fX (ey )ey
But X Uniform(0, 1), so fX (x) = 1 for 0 < x < 1, fX (ey ) = 1 for 0 < y < . Thus fY (y) = fX (ey )ey = ey for 0 < y < . So Y Exponential(1).
Note: In change of variable questions, you lose a mark for: 1. not stating g(x) is monotone over the required range of x;
2. not giving the range of y for which the result holds, as part of the FINAL answer. (eg. fY (y) = . . . for 0 < y < ).
171
Let Y = 1/X . The function y(x) = 1/x is monotone decreasing for 0 < x < 2, so we can apply the Change of Variable formula. Let Then
y = y(x) = 1/x x = x(y) = 1/y dx = | y 2 | = 1/y 2 dy
i.e.
1 2
< y < .
fY (y) = fX (x(y)) = = =
dx dy
1 3 dx (x(y)) 4 dy 1 1 1 3 2 4 y y 1 4y 5
for
1 < y < . 2
Thus
172
For mathematicians: proof of the change of variable formula Separate into cases where g is increasing and where g is decreasing. i) g increasing g is increasing if u < w g(u) < g(w). 1 1 Note that putting u = g (x), and w = g (y), we obtain g 1 (x) < g 1(y) g(g 1 (x)) < g(g 1 (y)) x < y,
put
Now g is increasing, so g 1 is also increasing (by overleaf), so d and thus fY (y) = fX (g 1(y))| dy (g 1(y))| as required. ii) g decreasing, i.e. u > w g(u) < g(w). ()
> 0,
(Putting u = g 1 (x) and w = g 1(y) gives g 1(x) > g 1 (y) x < y, so g 1 is also decreasing.)
= 1 FX (g 1(y)).
= P(X g 1 (y))
173
fY (y) = fX g 1(y)
4.10 Change of variable for non-monotone functions: non-examinable Suppose that Y = g(X) and g is not monotone. We wish to nd the p.d.f. of Y . We can sometimes do this by using the distribution function directly. Example: Let X have any distribution, with distribution function FX (x). Let Y = X 2 . Find the p.d.f. of Y .
So
FY (y) = 0 FX ( y) FX ( y)
if y < 0, if y 0.
174
So the p.d.f. of Y is
fY (y) = d d d FY = (FX ( y)) (FX ( y)) dy dy dy 1 1 = 1 y 2 FX ( y) + 1 y 2 FX ( y) 2 2 = 1 fX ( y) + fX ( y) 2 y
for y 0.
Example: Let X Normal(0, 1). This is the familiar bell-shaped distribution (see later). The p.d.f. of X is: 1 2 fX (x) = ex /2. 2 Find the p.d.f. of Y = X 2 .
This is in fact the Chi-squared distribution with = 1 degrees of freedom. The Chi-squared distribution is a special case of the Gamma distribution (see next section). This example has shown that if X Normal(0, 1), then Y = X 2 Chi-squared(df=1).
175
4.11 The Gamma distribution The Gamma(k, ) distribution is a very exible family of distributions. It is dened as the sum of k independent Exponential r.v.s:
For X Gamma(k, ),
fX (x) =
k k1 x e (k) x
if x 0, otherwise.
Here, (k), called the Gamma function of k, is a constant that ensures fX (x) integrates to 1, i.e.
0 fX (x)dx
= 1. It is dened as (k) =
0
y k1 ey dy .
When k is an integer, (k) = (k 1)! Mean and variance of the Gamma distribution:
For X Gamma(k, ),
E(X) =
and Var(X) =
k 2
Relationship with the Chi-squared distribution The Chi-squared distribution with degrees of freedom, 2 , is a special case of the Gamma distribution. 2 = Gamma(k = , = 1 ). 2 2 So if Y 2 , then E(Y ) =
k
= , and Var(Y ) =
k 2
= 2.
176
Gamma p.d.f.s
k=1
k=2
Notice: right skew (long right tail); exibility in shape controlled by the 2 parameters
k=5
Distribution function, FX (x) There is no closed form for the distribution function of the Gamma distribution. If X Gamma(k, ), then FX (x) can can only be calculated by computer.
k=5
177
and Var(X) =
0
k 2
(non-examinable)
EX =
xfX (x) dx = = = = = =
k x dx 0 (x) e k y 1 0 y e ( ) dy
(letting y = x,
dx dy
1 = )
= =
0
k2 x fX (x) dx 2
2
1 k+1 x e dx 0 ( )(x)
k2 2 where y = x, 1 dx = dy
1 = 2
dy
(k)
k2 2
178
Gamma distribution arising from the Poisson process Recall that the waiting time between events in a Poisson process with rate has the Exponential() distribution. That is, if Xi =time waited between event i 1 and event i, then Xi Exp(). The time waited from time 0 to the time of the kth event is X1 + X2 + . . . + Xk , the sum of k independent Exponential() r.v.s. Thus the time waited until the kth event in a Poisson process with rate has the Gamma(k, ) distribution. Note: There are some similarities between the Exponential() distribution and the (discrete) Geometric(p) distribution. Both distributions describe the waiting time before an event. In the same way, the Gamma(k, ) distribution is similar to the (discrete) Negative Binomial(k, p) distribution, as they both describe the waiting time before the kth event.
4.12 The Beta Distribution: non-examinable The Beta distribution has two parameters, and . We write X Beta(, ). P.d.f. f (x) = 1 x1(1 x)1 for 0 < x < 1, B(, ) 0 otherwise.
The function B(, ) is the Beta function and is dened by the integral
1
B(, ) =
0
()() . ( + )
179
The Normal distribution is the familiar bell-shaped distribution. It is probably the most important distribution in statistics, mainly because of its link with the Central Limit Theorem, which states that any large sum of independent,
5.1 The Normal Distribution The Normal distribution has two parameters, the mean, , and the variance, 2 . and 2 satisfy < < , We write X Normal(, 2), 2 > 0.
or
X N(, 2).
fX (x) =
{(x)2 /22 }
Distribution function, FX (x) There is no closed form for the distribution function of the Normal distribution. If X Normal(, 2), then FX (x) can can only be calculated by computer. R command: FX (x) = pnorm(x, mean=, sd=sqrt( 2)).
180
Mean and Variance For X Normal(, 2), Linear transformations If X Normal(, 2), then for any constants a and b, aX + b Normal a + b, a2 2 . In particular, X E(X) = ,
Var(X) = 2 .
X Normal( 2 )
Normal(0, 1).
181
Proof:
Let a =
1 and b = .
Let Z = aX + b =
Z Normal a + b, a2 2 Normal
Z Normal(0, 1) is called the standard Normal random variable. General proof that aX + b Normal a + b, a2 2 : Let X Normal(, 2), and let Y = aX + b. We wish to nd the distribution of Y . Use the change of variable technique. 1) y(x) = ax +b is monotone, so we can apply the Change of Variable technique. 2) Let y = y(x) = ax + b for < x < . 3) Then x = x(y) = dx 1 1 . = = dy a |a| fY (y) = fX (x(y)) dx = fX dy yb a 1 2 2 1 . |a|
2
yb a
4)
5) So
()
/2 2
e(x)
yb 1 2 2 = e( a ) /2 2 2
1 2 2 2 e(y(a+b)) /2a . = 2 2
182
Returning to (),
fY (y) = fX yb a 1 1 2 2 2 e(y(a+b)) /2a for < y < . = |a| 2a2 2
then
aX + b Normal a + b, a2 2 .
2 2 (a2 1 +. . .+a2 n) . 1 n
fX (x) dx = 1.
fX (x) dx =
1 2 2
e{(x)
/(2 2 )}
dx = 1
This result is non-trivial to prove. See Calculus courses for details. Using this result, the proof that fX (x) dx = 1 follows by using the change (x ) in the integral. of variable y = 2
183
xfX (x) dx =
1 2 2 x e(x) /2 dx 2 2
x
: then x = z + and
dx dz
= .
Thus E(X) = =
1 2 (z + ) ez /2 dz 2 2 z 2 ez /2 dz 2
this is an odd function of z (i.e. g(z) = g(z)), so it integrates to 0 over range to .
1 2 ez /2 dz 2
= =
1 2 z 2 ez /2 dz 2 1 2 zez /2 2
putting z = + 1 2 ez /2 dz 2
(integration by parts)
= 2 {0 + 1} = 2.
184
5.2 The Central Limit Theorem (CLT) also known as. . . the Piece of Cake Theorem
The Central Limit Theorem (CLT) is one of the most fundamental results in statistics. In its simplest form, it states that if a large number of independent random variables are drawn from any distribution, then the distribution of their sum (or alternatively their sample average) always converges to the Normal distribution. Theorem (The Central Limit Theorem):
Let X1 , . . . , Xn be independent r.v.s with mean and variance 2, from ANY distribution. For example, Xi Binomial(n, p) for each i, so = np and 2 = np(1 p). Then the sum Sn = X1 + . . . + Xn = that tends to Normal as n .
n i=1 Xi
has a distribution
The mean of the Normal distribution is E(Sn) = The variance of the Normal distribution is
n
n i=1 E(Xi )
= n.
Var(Sn) = Var
i=1 n
Xi
=
i=1
= n 2.
So
Sn = X1 + X2 + . . . + Xn Normal(n, n 2) as n .
185
Notes: 1. This is a remarkable theorem, because the limit holds for any distribution of X1 , . . . , Xn. 2. A sucient condition on X for the Central Limit Theorem to apply is that Var(X) is nite. Other versions of the Central Limit Theorem relax the conditions that X1, . . . , Xn are independent and have the same distribution. 3. The speed of convergence of Sn to the Normal distribution depends upon the distribution of X. Skewed distributions converge more slowly than symmetric Normal-like distributions. It is usually safe to assume that the Central Limit Theorem applies whenever n 30. It might apply for as little as n = 4. The Central Limit Theorem in action : simulation studies The following simulation study illustrates the Central Limit Theorem, making use of several of the techniques learnt in STATS 210. We will look particularly at how fast the distribution of Sn converges to the Normal distribution. Example 1: Triangular distribution: fX (x) = 2x for 0 < x < 1. Find E(X) and Var(X):
1
f (x)
= E(X) =
0 1
xfX (x) dx
0 1 x
=
0
2x2 dx
1 0
2x3 3
2 = . 3
=
0 1
x fX (x) dx 2x3 dx
1 0
2 3
=
0
4 9
2x4 = 4 1 = . 18
4 9
186
Var(Sn) = Var(X1 + . . . + Xn ) = n 2
So Sn approx Normal
2n n , 18 3
n . Var(Sn) = 18
by independence
The graph shows histograms of 10 000 values of Sn = X1 +. . .+Xn for n = 1, 2, 3, n and 10. The Normal p.d.f. Normal(n, n 2) = Normal( 2n , 18 ) is superimposed 3 across the top. Even for n as low as 10, the Normal curve is a very good approximation.
n=1
2.0 1.2 1.0
n=2
1.0 0.8
n=3
0.5
n = 10
0.4
1.5
0.8
0.6
1.0
0.6
0.4
0.5
0.4
0.2
0.0
0.0
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
0.0
0.5
1.0
1.5
2.0
0.5
1.0
1.5
2.0
2.5
3.0
0.0
0.1
0.2
0.3
Sn
Sn
Sn
Sn f (x)
3 Example 2: U-shaped distribution: fX (x) = 2 x2 for 1 < x < 1. 3 We nd that E(X) = = 0, Var(X) = 2 = 5 . (Exercise)
-1
Var(Sn) = Var(X1 + . . . + Xn ) = n 2
So Sn approx Normal 0,
3n 5
by independence
Var(Sn) =
3n . 5
Even with this highly non-Normal distribution for X, the Normal curve provides a good approximation to Sn = X1 + . . . + Xn for n as small as 10.
n=1
1.4 1.2
0.6
n=2
0.4
n=3
0.15
n = 10
187
1.0
0.3
0.8
0.6
0.4
0.2
0.4
0.2
0.0
0.2
0.0
0.0
0.1
-1.0
-0.5
0.0
0.5
1.0
-2
-1
-3
-2
-1
0.0
0.05
0.10
-5
Sn
Sn
Sn
Sn
Normal approximation to the Binomial distribution, using the CLT Let Y Binomial(n, p). We can think of Y as the sum of n Bernoulli random variables: Y = X1 + X2 + . . . + Xn , where Xi = 1 if trial i is a success (prob = p), 0 otherwise (prob = 1 p)
Thus,
Bin(n, p) Normal
np
mean of Bin(n,p)
var of Bin(n,p)
np(1 p)
as n with p xed.
The Binomial distribution is therefore well approximated by the Normal distribution when n is large, for any xed value of p. The Normal distribution is also a good approximation to the Poisson() distribution when is large:
Poisson( = 100)
0.06
0.04
0.0
0.02
30
40
50
60
70
0.0 60
0.01
0.02
0.03
80
100
120
140
Why the Piece of Cake Theorem? The Central Limit Theorem makes whole realms of statistics into a piece
of cake.
After seeing a theorem this good, you deserve a piece of cake! Example: Remember the margin of error for an opinion poll? An opinion pollster wishes to estimate the level of support for Labour in an upcoming election. She interviews n people about their voting preferences. Let p be the true, unknown level of support for the Labour party in New Zealand. Let X be the number of of the n people interviewed by the opinion pollster who plan to vote Labour. Then X Binomial(n, p). At the end of Chapter 2, we said that the maximum likelihood estimator for p is X p= . n In a large sample (large n), we now know that X approx Normal(np, npq) So p= X pq approx Normal p, n n
where q = 1 p.
189
So
pp
pq n
Now if Z Normal(0, 1), we nd (using a computer) that the 95% central probability region of Z is from 1.96 to +1.96: P(1.96 < Z < 1.96) = 0.95. Check in R: pnorm(1.96, mean=0, sd=1) - pnorm(-1.96, mean=0, sd=1) Putting Z = pp
pq n
, we obtain pp
pq n
< 1.96
0.95.
pq n
0.95.
This enables us to form an estimated 95% condence interval for the unknown parameter p: estimated 95% condence interval is p 1.96 p(1 p) n
to
p + 1.96
p(1 p) . n
About 95% of the time, these random end-points will enclose the true unknown value, p.
Condence intervals are extremely important for helping us to assess how useful
190
Using the Central Limit Theorem to nd the distribution of the mean, X Let X1 , . . . , Xn be independent, identically distributed with mean E(Xi) = and variance Var(Xi ) = 2 for all i. The sample mean, X, is dened as: X= X1 + X2 + . . . + Xn . n
So X =
Because X is a scalar multiple of a Normal r.v. as n grows large, X itself is approximately Normal for large n: X1 + X2 + . . . + Xn 2 approx Normal , n n
as n .
The following three statements of the Central Limit Theorem are equivalent: X= X1 + X2 + . . . + Xn approx Normal , n
2 n
as n . as n .
Sn = X1 + X2 + . . . + Xn approx Normal n, n 2
Sn n X = approx Normal (0, 1) as n . 2 /n n 2 The essential point to remember about the Central Limit Theorem is that large sums or sample means of independent random variables converge to a Normal distribution, whatever the distribution of the original r.v.s. More general version of the CLT A more general form of CLT states that, if X1 , . . . , Xn are independent, and 2 E(Xi) = i , Var(Xi) = i (not necessarily all equal), then Zn =
n i=1 (Xi i ) n 2 i=1 i
Normal(0, 1) as n .
Other versions of the CLT relax the condition that X1 , . . . , Xn are independent.
Chapter 6: Wrapping Up
Probably the two major ideas of this course are: likelihood and estimation; hypothesis testing.
191
Most of the techniques that we have studied along the way are to help us with these two goals: expectation, variance, distributions, change of variable, and the Central Limit Theorem. Lets see how these dierent ideas all come together.
6.1 Estimators the good, the bad, and the estimator PDF We have seen that an estimator is a capital letter replacing a small letter. Whats the point of that? Example: Let X Binomial(n, p) with known n and observed value X = x. x The maximum likelihood estimate of p is p = n . The maximum likelihood estimator of p is p =
X n.
Example: Let X Exponential() with observed value X = x. 1 The maximum likelihood estimate of is = x . The maximum likelihood estimator of is = Why are we interested in estimators? The answer is that estimators are random variables. This means they have distributions, means, and variances that tell us how well we can trust our single observation, or estimate, from this distribution.
1 X.
192
Good and bad estimators Suppose that X1 , X2, . . . , Xn are independent, and Xi Exponential() for all i. is unknown, and we wish to estimate it. In Chapter 4 we calculated the maximum likelihood estimator of : 1 n = . = X1 + X2 + . . . + Xn X Now is a random variable with a distribution. For a given value of n, we can calculate the p.d.f. of . How?
So we can nd the p.d.f. of using the change of variable technique. Here are the p.d.f.s of for two dierent values of n: Estimator 1: n = 100. 100 pieces of information about . Estimator 2: n = 10. 10 pieces of information about .
f( )
p.d.f. of Estimator 1
p.d.f. of Estimator 2
True (unknown)
Clearly, the more information we have, the better. The p.d.f. for n = 100 is focused much more tightly about the true value (unknown) than the p.d.f. for n = 10.
193
It is important to recognise what we do and dont know in this situation: What we dont know: the true ; What we do know: the p.d.f. curve;
curve is good!
f( )
p.d.f. of Estimator 1
p.d.f. of Estimator 2
True (unknown)
This is why we are so concerned with estimator variance. A good estimator has low estimator variance: everywhere on the estimators p.d.f. curve is guaranteed to be good. A poor estimator has high estimator variance: some places on the estimators p.d.f. curve may be good, while others may be very bad. Because we dont know where we are on the curve, we cant trust any estimate from this poor estimator. The estimator variance tells us how much the estimator can be trusted. Note: We were lucky in this example to happen to know that T = X1 + . . . + Xn Gamma(n, ) when Xi i.i.d. Exponential(), so we could nd the p.d.f. of our estimator = n/T . We wont usually be so lucky: so what should we do?
194
Example: calculating the maximum likelihood estimator The following question is in the same style as the exam questions. Let X be a continuous random variable with probability density function 2(s x) for 0 < x < s , s2 fX (x) = 0 otherwise . s (a) Show that E(X) = . 3
s
Use E(X) =
0 2
2 xfX (x) dx = 2 s
s 0
Use E(X ) =
0
2 x fX (x) dx = 2 s
2
s 0
(d) Suppose that we make a single observation X = x. Write down the likelihood function, L(s ; x), and state the range of values of s for which your answer is valid. 2(s x) for x < s < . L(s ; x) = s2 (e) The likelihood graph for a particular value of x is shown here. Show that the maximum likelihood estimator of s is s = 2X . You should refer to the graph in your answer.
0.15 Likelihood 0.0 3 0.05 0.10
6 s
195
L(s ; x) = 2s2(s x)
So
At the MLE,
dL =0 ds
or
s = 2x.
From the graph, we can see that s = is not the maximum. So s = 2x. Thus the maximum likelihood estimator is
s = 2X.
(f) Find the estimator variance, Var(s), in terms of s. Hence nd the estimated variance, Var(s), in terms of s.
Var(s) = Var(2X)
= 22Var(X) s2 = 4 18 2s2 Var(s) = . 9
by (c)
So also:
2s2 . Var(s) = 9
196
(g) Suppose we make the single observation X = 3. Find the maximum likelihood estimate of s, and its estimated variance and standard error. s = 2X = 2 3 = 6. 2s2 2 62 Var(s) = = =8 9 9 se(s) =
Var(s) =
8 = 2.82.
This means s is a POOR estimator: the twice standard-error interval would be 6 2 2.82 to 6 + 2 2.82: that is, 0.36 to 11.64 ! Taking the twice standard error interval strictly applies only to the Normal distribution, but it is a useful rule of thumb to see how good the estimator is.
(h) Write a sentence in plain English to explain what the maximum likelihood estimate from part (g) represents. The value s = 6 is the value of s under which the observation X = 3 is more likely than it is at any other value of s.
6.2 Hypothesis tests: in search of a distribution When we do a hypothesis test, we need a test statistic: some random variable with a distribution that we can specify exactly under H0 and that diers under H1 . It is nding the distribution that is the dicult part. Weird coin: is my coin fair? Let X be the number of heads out of 10 tosses. X Binomial(10, p). We have an easy distribution and can do a hypothesis test. Too many daughters? Do divers have more daughters than sons? Let X be the number of daughters out of 190 diver children. X Binomial(190, p). Easy.
197
Too long between volcanoes? Let X be the length of time between volcanic eruptions. If we assume volcanoes occur as a Poisson process, then X Exponential(). We have a simple distribution and test statistic (X): we can test the observed length of time between eruptions and see if it this is a believable observation under a hypothesized value of . More advanced tests Most things in life are not as easy as the three examples above. Here are some observations. Do they come from a distribution (any distribution) with mean 0? 3.96 2.32 -1.81 -0.14 3.22 1.37 -0.17 1.85 0.61 -0.58 1.07 -0.52 0.40 1.54 -1.42 -0.85 0.51 1.66 1.48 1.54
Answer: yes, they are Normal(0, 4), but how can we tell? What about these? 3.3 -30.0 -8.1 8.1 -7.8 -9.0 3.4 -1.3 8.1 -13.7 12.6 -5.0 -9.6 -6.6 1.4 -5.6 -6.4 -11.8 2.5 9.0
Again, yes they do (Normal(0, 100) this time), but how can we tell? The unknown variance (4 versus 100) interferes, so that the second sample does not cluster about its mean of 0 at all. What test statistic should we use? If we dont know that our data are Normal, and we dont know their underlying variance, what can we use as our X to test whether = 0? Answer: a clever person called W. S. Gossett (1876-1937) worked out an answer. He called himself only Student, possibly because he (or his employers) wanted it to be kept secret that he was doing his statistical research as part of his employment at Guinness Brewery. The test that Student developed is the familiar Students t-test. It was originally developed to help Guinness decide how large a sample of people should be used in its beer tastings!
198
Student used the following test statistic for the unknown mean, : T = X
n 2 i=1 (Xi X)
n(n1)
(n 1)
t2 1+ n1
n/2
T is the Students t-distribution, derived as the ratio of a Normal random variable and an independent Chi-Squared random variable. If = 0, observations of T will tend to lie out in the tails of this distribution. The Students t-test is exact when the distribution of the original data X1 , . . . , Xn is Normal. For other distributions, it is still approximately valid in large samples, by the Central Limit Theorem. It looks dicult It is! Most of the statistical tests in common use have deep (and sometimes quite impenetrable) theory behind them. As you can probably guess, Student did not derive the distribution above without a great deal of hard work. The result, however, is astonishing. With the help of our best friend the Central Limit Theorem, Students T -statistic gives us a test for = 0 (or any other value) that can be used with any large enough sample. The Chi-squared test for testing proportions in a contingency table also has a deep theory, but once researchers had derived the distribution of a suitable test statistic, the rest was easy. In the Chi-squared goodness-of-t test, the Pearsons chi-square test statistic is shown to have a Chi-squared distribution under H0. It produces larger values under H1 . One interesting point to note is the pivotal role of the Central Limit Theorem in all of this. The Central Limit Theorem produces approximate Normal distributions. Normal random variables squared produce Chi-squared random variables. Normals divided by Chi-squareds produce t-distributed random variables. A ratio of two Chi-squared distributions produces an F -distributed random variable. All these things are not coincidental: the Central Limit Theorem rocks!