Vous êtes sur la page 1sur 41

Statistics W4150: Introduction to Probability and

Statistics
Professor Philip Protter
1029 SSW; pep2117@columbia.edu; 212-851-2145
Lectures, Week 1

September 8 & 10, 2015

1 / 41

Details

Office Hours: 2pm to 4pm Fridays in 1029 SSW (121 and

Amsterdam)
Textbook: Probability and Statistics, Fourth Edition, by

M.H. DeGroot & Mark Schervish,


ISBN978-0-321-50046-5 (On reserve in the math library)
We will communicate primarily through Courseworks
Weekly homework: Homework is due in class on Tuesdays
One midterm, and a final
Grading: Homeworks 15%, Midterm 35%, Final 50%

2 / 41

Probability versus Statistics


Tossing a coin: heads or tails
What is the probability? What does this mean?

# of heads in n tosses
(1)
n
n
Does this limit exist? (Recall from calculus limits need not
exist)
lim

Example: an = (1)n ; then an = 1 if n is even and an = 1 if

n is odd
lim sup an = 1 and lim inf an = 1
We can never really know the limit in (2); we make an

educated guess; This is statistics

3 / 41

The Axioms of Probability


1. For every set A, P(A) 0
2. P(S) = 1, where S is the entire space of possible outcomes
3. If A B = , then P(A B) = P(A) + P(B)
4. Two, or even a finite number of events, is not enough, even
for simple experiments (e.g., toss a coin until the first time
heads appears); so we have a more subtle axiom,
replacing (3): For every sequence of events A1 , A2 , . . . with
Ai Aj = when i 6= j,we have
P(
i=1 Ai ) =

P(Ai )

i=1

4 / 41

Consequences of the Axioms of Probability


P(Ac ) = 1 P(A)
If A B then P(A) P(B)
Any event A has 0 P(A) 1
For any events A and B (not necessarily disjoint)

P(A B) = P(A) + P(B) P(A B)


Subadditivity: For any sequence of events A1 , A2 , . . . we

have
P(
i=1 Ai )

P(Ai )

i=1

5 / 41

Counting
In elementary probability, much attention is paid to counting,

which is more complicated than one might a priori think


Rolling two dice: Possible outcomes sets:
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
and
(1, 1), (1, 2) , . . . , (1, 6)
(2, 1), (2, 2) , . . . , (2, 6)
..
.
(6, 1), (6, 2) , . . . , (6, 6)
We can assign probabilities 1/12 to each outcome in the first

example, but it does not correspond to our experience


Better to assign probabilities (1/6)2 = 1/36 to each outcome

in the second example

6 / 41

We can then count the number of outcomes corresponding,

for example, to the event 8 in to get


(2, 6), (3, 5), (4, 4), (5, 3), (6, 2), hence P({8}) = 5/36
We note that the outcome (2, 6) is symmetric to (6, 2). So we

need only count (2, 6) and (3, 5), to get 2 events. Then we
double that and add 1 more for (4, 4) to arrive at 5 events
corresponding to the outcome {8}
In more complicated situations, the counting process can be

difficult
In this example, the two dice represent an experiment with

two parts. Each die represents one part.

7 / 41

Multiplication Rule: If we have an experiment with multiple

parts, we multiply the number of items in each part to get the


total number of outcomes.
Example: The experiment is to toss a coin and roll a die.

The number of total outcomes is then 2 6 = 12. The


probability of an event is 1/2 1/6 = 1/12, or simply take
1
Total number of outcomes
to get the probability of an individual event.

8 / 41

Permutations

The canonical model is that we have an urn filled with n

numbered balls. We repeatedly draw samples of size k


(typically k  n) from the urn. We do not replace the k balls
after they are drawn.
We can record the order of the samples as we take them out,

k at a time. Or we can disregard the order. To relate the


counting procedures, we have first to count how many
different ways we can sample k balls from n.
This is called how many permutations can we have, and is

denoted Pn,k

9 / 41

Permutations
Theorem: The number of permutations of n elements taken

k at a time is
Pn,k = n(n 1)(n 2) . . . (n k + 1)
Since n! = n(n 1)(n 2) . . . (3)(2)(1), we see that

Pn,k =

n!
(n k)!

(2)

Warning: n! gets big very quickly, making calculations

using (2) sometimes difficult

10 / 41

Stirlings Formula
Theorem[Stirlings Formula]:
1

(2) 2 nn+ 2 e n
lim
=1
n
n!
Example:
1

70!
2 2 7070.5 e 70

= 3.940 1035
1
50.5
50
50!
2 2 50 e
See the textbook, Example 1.7.12 on page 31.

11 / 41

Combinations
Now the idea is that we sample k balls from the n in the urn

(k < n), again repeatedly, but this time with replacement.


Thus this is without regard to order. The number of ways we
can do this we call Cn,k .
Theorem: The number of distinct subsets of size k that can

be chosen from a set of size n is


Cn,k =

Pn,k
n!
=
k!
k!(n k)!


There is a more common notation for Cn,k and that is kn .
That is, we have
 
n
n!
=
k
k!(n k)!

12 / 41

There are many examples in the textbook on how to use these

ideas (Chapter 1)
We now begin Chapter 2: Conditional Probability
We have discussed in simple situations how to calculate P(A)

for an event A
Suppose we are told something has happened, call it B; this

could change the probability of A


For example, horse racing, and it rains at the race track
Or for example, the Super Bowl: A = {Denver wins}, and

B = {Peyton Manning has stomach flu}


P(A|B) denotes the probability of A given the knowledge that

B has occurred
This is different from P(A)
If P(B) > 0, then

P(A|B) =

P(A B)
P(B)

This is a simple but powerful concept

13 / 41

Example

Your roommate goes to donate blood. The nurse at the

bloodmobile gives his blood a screening test, which is 99%


accurate, and rejects his blood due to AIDS contamination
Your roommate considers suicide what do you do?

14 / 41

The nurse said the test was 99% accurate. She meant

P(TP|A) = 0.99
TP= test positive; A= AIDS
You asked about false positives; she said it was wrong about

5% of the time; does this mean his probability of AIDS is


0.99-0.05=0.94?
No!

15 / 41

You explain to your roommate he wants P(A|TP) and this is

very different
P(A|TP) =

P(A TP)
;
P(TP)

P(TP|A) =

P(A TP)
= 0.99
P(A)

According to the CDC, 1.1 million have AIDS in the general

population; the US population is 314 million


So the incidence of AIDS in the population is
1.1
314

= 0.0035 = P(A)

Therefore

P(A TP) =

P(ATP)
P(A) P(A)

= 0.99 0.0035 = 0.00347

Caveat: We are being sleazy here; not everyone is equally

likely to have AIDS.

16 / 41

5% false positives means that P(TP) 0.05


0.00347
So we conclude P(A|TP) = P(ATP)
P(TP) = 0.05 0.069, or

around 7%
Suicide is not indicated
The number of mistakes involving not understanding

conditional probabilities is amazing

17 / 41

Going backwards with conditional probabilities


Let A1 , A2 , . . . , An be a partition of S
This means they are all disjoint (Ai Aj = if i 6= j) and

their union is all of S


Theorem The Partition Equation: Let A1 , A2 , . . . , An be
a partition of S. For any event B we have
P(B) =

n
X

P(B|Ai )P(Ai )

i=1

This uses the relation P(B|) = P(B)


P() implies that

P(B ) = P(B|)P()

P(B) = P(B

(ni=1 Ai ))

P(ni=1 B

Ai ) =

n
X

P(B Ai )

i=1

18 / 41

Bayes Theorem

Theorem (Bayes): Let A1 , A2 , . . . , An be a partition of S.

Then

P(B|Ai )P(Ai )
P(Ai |B) = Pn
i=1 P(B|Ai )P(Ai )

Our AIDS example is a special case of Bayes Theorem

19 / 41

Independence
Conditional probability and Independence are both new ideas

and powerful ones


Intuitively, two events A and B are independent if either event

occurring is unrelated to the other


Examples:
a) A = {Denver wins the Super Bowl}; B =
{Joes Cafe in the NWC Building runs out of coffee}
b) A = {It rains}; B = {You forget your umbrella}
c) A = {You are late}; B =
{There is a stoppage in the subway trains}
d) A = {first trial of a scientific experiment}; B =
{second identical trial of the same experiment}

20 / 41

Math Interpretation of Independence


How do we model this intuition?
We use conditional probability: If knowledge that B has

occurred does not change our probability of A, then A and B


should be independent
This becomes

P(A|B) = P(A)

P(A B)
= P(A) assuming P(B) > 0
P(B)
(3)

This is algebraically equivalent to

P(A B) = P(A)P(B)

(4)

Definition (3) is highly intuitive while Definition (4) is not

21 / 41

In spite of its lack of intuition, we primarily use Definition 2,

because it extends easily to 3 or more events


Defintion: Events A1 , A2 , . . . An are mutually independent if

for every k n we have


P(Ai1 Aik ) = P(Ai1 ) P(Ai2 ) P(Aik )

(5)

Caution: There exist examples of 3 events, A, B, C which are

two by two independent, but all three are not independent


So we really do need (5) to hold for all k n
Please read Section 2.3 on Conditionally Independent

Events

22 / 41

Random Variables
We are familiar with the concept of a variable; from algebra,

and from calculus


For example, if we consider the equation x 2 9 = 0, then x is

the variable (the unknown)


When we solve the equation, we see that x = 3 or x = 3
Before we solve the equation however, x can be anything; it

can vary over all of R


A random variable X is a variable that varies randomly
A random variable can a priori take on any value, and its

actual value does not come from solving an equation, but


from performing an experiment whose outcome is not known
in advance (e.g., tossing a coin)

23 / 41

Definition: A random variable X is a function mapping the

sample space S into R


We insist that random variables have numerical values (e.g.,

they cannot be valued as heads, or tails)


We use capital letters for random variables to distinguish

them from algebraic and calculus variables


Example: We toss a coin 10 times. Each time the coin comes

up heads we record the value 1. When it comes up tails we


record the value 0.
Let Xi denote the outcome of the i th toss
Let Y = X1 + X2 + + X10 . Then each Xi is a random

variable, and so also is Y . Y is the (random) number of


heads that occur in 10 tosses of the coin

24 / 41

The random variables Xi just described take on only 2

possible values, and Y takes on 11 possible values


Random variables that take on a finite number or countable

infinite number of values are called discrete random


variables
Example: Toss a coin until the first time heads appears
The sample space is of necessity infinite
This is an example of a countably infinite discrete random

variable

25 / 41

In addition to discrete random variables, we can also have

continuous random variables


We can model time discretely (day 1, day 2, etc.; or minute 1,

minute 2, etc.) or we can model it continuously, as the


interval [0, ),or some subinterval [a, b]
In the example of tossing a coin until the first time heads

appears, we can let X be the amount of time elapsed until


heads appears; X takes values on the time interval [0, )
Definition: A random variable is continuous if it takes its

values in an interval of R, AND no single value has positive


probability
That means for any given R, P(X = ) = 0

26 / 41

We do not know the outcome of a random variable in

advance, and its outcome is random, but that does not mean
we know nothing about it!
For example, we know that most coins have P(Heads) 21 ;

most dies have P(die = 6)


Oceans Eleven)

1
6

(But remember the movie

Although we cannot know the outcome in advance, we can in

fact predict what it will be with some confidence


This is because we have some idea in advance of the

probabilities of the outcomes


From a Statistical standpoint, this is what we find interesting
We call it the distribution of a random variable

27 / 41

Probability Distributions of Random Variables

Probability distributions are very different, depending on

whether the random variable in question is discrete or


continuous
The probability distribution of a discrete random variable

X is the collection p(x) = P(X = x) for all the possible


values of x
For example, for our random variables Xi that are 1 or 0

depending on the outcome of heads or tails on the i th toss of


a coin, we have P(Xi = 1) = p and P(Xi = 0) = 1 p, where
P(Heads) = p

28 / 41

For our random variable Y the possible values are

{0, 1, 2, 3, . . . , 11} and the calculations are a little complicated


P(Y = 0) = P(all 10 tosses give tails) = (1 p)10 = q 10 ,

where q = 1 p = P(Tails)
P(Y = 10) = 10pq 9 , where the 10 comes from the fact that

there are 10 choices for the only heads to appear


 2 8
P(Y = 2) = 10
2 p q

10 k nk
P(Y = k) = k p q
We have this for every k, 0 k 10, so this gives us the

entire probability distribution of Y


Note, as a curiosity, that we now know
P10 10 k
nk = 1 for any p, 0 p 1
k=0 k p (1 p)
Often we abbreviate probability distribution with just the

word distribution

29 / 41

Parameters of a Probability Distribution


In our previous example, we were able to give the complete

distributions of Xi and Y in terms of p = P(Heads)


The number p, which can be anything between 0 and 1, is

called a parameter
Once we know p we know the entire distribution of the

random variables Xi and also of Y


So we have a distribution for each value of p: We call this a

parameterized family of distributions


We also have a Cumulative Distribution Function (CDF)

of a discrete random variable, written


X
X
F (x) = P(X x) =
P(X = y ) =
p(y ); (p(y ) = P(X = y ))
y x

y x

30 / 41

A Graph of a CDF of a Discrete RV

31 / 41

An Example
Let X = the number of tosses needed until Heads appears for

the first time


P(Heads) = p
The distribution of X is given by

P(X = k) = p(k) = (1 p)k1 p for k = 1, 2, 3, . . .


= 0 otherwise
For this example, and any integer k, we have

F (k) =

p(y ) =

y k

k
X
y =1

y 1

(1 p)

k1
k1
X
X
y
p=p
(1 p) = p
qy
y =0

y =0

where q = 1 p

32 / 41

The last term on the previous slide is p

Pk1

y =0 q

In your calculus class you learned a formula for a finite sum of

geometric form:
n
X

i =

i=1

1 n+1
for 0 < < 1
1

We use this and the result from the previous slide that

F (k) = p

Pk1

F (k) = p

1 qk
1 qk
= (1q)
= 1q k for k a positive integer
1q
1q

y =0 q

to get

This is called the geometric distribution

33 / 41

For a real variable x we define the notation [x] to be the

largest integer that is less than x


[x] is often referred to as the integer part of x
With this notation, we can define F (x) for all x R,

essentially embedding the cdf into R, by writing


F (x) = 1 (1 p)[x] for x 1 and 0 otherwise

34 / 41

A Graph of a CDF of a Discrete RV Embedded in R

35 / 41

The Binomial Distribution


We have a finite sequence of n independent trials of an

experiment, each with only two outcomes


Let Y be the number of successes
Next assume for that for every trial P(Success) = p,

0<p<1
Then Y is said to have a Binomial Distribution
There are two parameters, p and n
We have already calculated (in Lecture 4) the distribution of

Y when it is B(10, p)
The general formula is

 
n k nk
P(Y = k) =
p q
for 0 k n and q = 1 p
k

36 / 41

The Poisson Distribution


Sim
eon Denis Poisson, was a French mathematician,

geometer, and physicist.


Poisson is also the French word for fish
A distribution named after Poisson is the Poisson

Distribution
A random variable X has a Poisson Distribution with
parameter > 0 if
k
e
for k = 0, 1, 2, . . .
k!
P
We know that we must have
k=0 P(X = k) = 1
We also know from calculus (sorry second semester) that
P(X = k) =

e =

X
xk
k=0

k!

37 / 41

By the preceding, we see why we need the term e


The Poisson distribution is useful in many applications, but

gthe primary one is the approximation of the Binomial


Let Yn,p be a Binomial with parameters (n, p), and let X be

Poisson with parameter


Theorem: Suppose that n and p 0 in such a way
that limn;p0 np = > 0. Then P(Yn,p = k) P(X = k)
where X is a Poisson with parameter .
The import of this theorem is that for a Binomial(n, p) when

n is large and p is small, we can multiply them together to get


np = and then approximate the probabilities of the Binomial
random variable with those of a Poisson
This is useful because for large n it is hard to compute the

probability distribution of a Binomial, but easy to do so for a


Poisson

38 / 41

Recall a continuous function in calculus is one for which

limy x f (y ) = f (x) for all x in the domain of f


This is not what we mean by a continuous random variable
Indeed, we do not have a concept of points of the domain,

which is the sample space S, converging


Instead what we mean by a continuous random variable is

that the cdf F itself is continuous


Recall F (x) = P(X x). We know x 7 F (x) is a

non-decreasing function, and limx F (x) = 0 and


limx F (x) = 1

39 / 41

Densities
Suppose that F is not only continuous, but differentiable, and

its derivative is continuous, too


Recall that this is not always the case
If F is differentiable with F 0 (x) = f (x), then by the

Fundamental Theorem of Calculus we have


Z b
Z
F (b) =
f (x)dx; F (b) F (a) =

f (x)dx

If F is the cdf of the random variable X , then we have

Z
P(a < X b) =

f (x)dx
a

40 / 41

Densities II
The function f = F 0 is called the density corresponding to

the random variable X


Rx
F (x) = f (u)du is the cdf of X
We can get the density by differentiating the cdf, and we get
the cdf by anti-differentiating the density
Important observation: F (x) = P(X x), so
1 F (x) = P(X > x)
Example: The Uniform Distribution on [A, B]
fA,b (x) =

1
if A x B and 0 otherwise
B A

Example of a strange cdf

f (x) =

1
sin(x) for 0 x and 0 otherwise
2

41 / 41

Vous aimerez peut-être aussi