Académique Documents
Professionnel Documents
Culture Documents
David Nott
standj@nus.edu.sg
Department of Statistics and Applied Probability
National University of Singapore
GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 1
More about sampling
Then p̂ = X/n is the proportion in the sample who answer yes, and this estimates
the proportion p in the whole population who would say yes.
GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 278
More about sampling
We say that we could think of X (and p̂) as random variables, since they would
vary randomly if we were to draw different samples.
The distribution of the random variable p̂ can tell us something about the reliability
of the results of the poll (for example, the standard deviation would tell us
something about how much p̂ is expected to vary from sample to sample).
We had an estimate of the standard deviation of p̂, the standard error, and 1.96
times the standard error was what we called the margin of error (this has the
property that if we construct intervals p̂ plus or minus the margin of error over
many samples then for about 95% of samples the intervals will contain the true p).
GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 279
More about sampling
GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 280
Example: the worst opinion poll ever
The result of the election was that Roosevelt got 62% and Langdon
38%.
The Literary Digest poll was based on 2.3 million responses - small
sample size was certainly not an issue!
How was the poll conducted? When they selected people to be included in the poll
they chose people from address lists such as their own subscription list.
Also, they mailed postcards to 10 million people, and their poll results were based
on the 2.3 million who responded. The tendency to respond may be correlated
with what you’re trying to measure (someone who has just lost their job and may
be sympathetic to Roosevelt may very likely not respond).
Using a smaller sample size but making greater efforts to ensure that the sample
was randomly chosen would have yielded a much more accurate result.
GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 282
Bias in polls
GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 283
Confidentiality in polls
The question I will ask is potentially embarassing. The question is: have you ever
cheated in an exam at NUS?
So if they answer ‘yes’ I cannot be sure whether they really cheated (responding
truthfully) or whether they just flipped two heads.
GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 284
Confidentiality in polls
Now, I’d expect that since the probability of two heads when I flip a
coin twice is 1/4, about 1/4 × 20 = 5 people are going to answer
yes regardless of whether they’ve cheated or not.
So let’s subtract 5 from both 20 and 10, and then I have an estimate
of 5 people out of 15 who are answering yes truthfully.
Olofsson (2007) states the following law of rare events: suppose we are dealing
with some rare, unpredictable event that occurs on average λ times. The number
of occurrences is said to follow a Poisson distribution, with the probability of k
events given by
exp(−λ)λk
P (k occurrences) = .
k!
We won’t be too precise about the exact conditions needed for this law of rare
events to hold.
GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 286
Poisson probability distribution
λ x
P (X = x) = pX (x) = e−λ , x = 0, 1, 2, . . .
x!
for some λ > 0. We also write X ∼ Poisson(λ).
E[X] = Var[X] = λ
P (X = x) ≈ P (Y = x)
The probability that no success was observed among the n (independent) Bernoulli trials is
Matching Problem
7 couples attend a dancing class where the instructor pairs everyone off
at random. What is the probability that at least one couple gets to dance
together? We calculated the exact answer in a previous lecture using the
inclusion-exclusion formula:
1 1 1 1 1 1 1
7 × − 21 × + 35 × − 35 × + 21 × −7× +1×
7 42 210 840 2520 5040 5040
1 1 1 1 1 1
=1− + − + − + ≈ 0.6321
2 6 24 120 720 5040
Do you notice a pattern in the fractions? For n couples the answer would be
1 1 1 1 1 1 1
− + − + − + ··· ± ≈ 0.6321
1! 2! 3! 4! 5! 6! n!
Olofsson (2007)
GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 289
An application of the law of rare events
We can think of the problem as writing down the numbers 1 to 7 in one column,
and then writing down a random permutation of the numbers 1 to 7 in a second
column alongside. What is the chance of a match?
For any position, the chance of a match is 1/7. So with 7 positions we expect an
average of 7 by 1/7 or 1 matches. If a match is considered “rare” then the law of
rare events says that the chance of zero matches is exp(−1) and hence the
chance of one or more matches is 1 − exp(−1). This is 0.6321. Remarkably the
law of rare events gives an answer accurate to four decimal places here.
Note that the terms in the expression that we got from the inclusion-exclusion
formula can be derived from approximating exp(−1) by the first eight terms in its
series expansion, for those of you who know what this means.
GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 290
Poisson probability distribution (cont.)
Under the model, there is a probability of 0.543 of a zero count for the
number of deaths in one year for one of the corps. With 200 corp years
observed, we expect 54.3% of the 200 corp years to result in a zero, i.e.
an expected count of 0.543 × 200 = 108.67.
GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 294
Poisson probability distribution (cont.)
c
(oi − ei )2
χ2 =
i=1
ei
if we have data that falls into c categories; oi and ei are the observed and
expected frequency for category i, respectively.