Prob Stat Week 1

Statistics W4150: Introduction to Probability and
Statistics
Professor Philip Protter
1029 SSW; pep2117@columbia.edu; 212-851-2145
Lectures, Week 1
September 8 & 10, 2015
1 / 41
Details
Office Hours: 2pm to 4pm Fridays in 1029 SSW (121 and
Amsterdam)
Textbook: Probability and Statistics, Fourth Edition, by
M.H. DeGroot & Mark Schervish,

ISBN978-0-321-50046-5 (On reserve in the math library)
We will communicate primarily through Courseworks
Weekly homework: Homework is due in class on Tuesdays
One midterm, and a final
Grading: Homeworks 15%, Midterm 35%, Final 50%
2 / 41
Probability versus Statistics

Tossing a coin: heads or tails
What is the probability? What does this mean?
# of heads in n tosses
(1)
n
n
Does this limit exist? (Recall from calculus limits need not
exist)
lim
Example: an = (1)n ; then an = 1 if n is even and an = 1 if
n is odd
lim sup an = 1 and lim inf an = 1
We can never really know the limit in (2); we make an
educated guess; This is statistics
3 / 41
The Axioms of Probability

1. For every set A, P(A) 0
2. P(S) = 1, where S is the entire space of possible outcomes
3. If A B = , then P(A B) = P(A) + P(B)
4. Two, or even a finite number of events, is not enough, even
for simple experiments (e.g., toss a coin until the first time
heads appears); so we have a more subtle axiom,
replacing (3): For every sequence of events A1 , A2 , . . . with
Ai Aj = when i 6= j,we have
P(
i=1 Ai ) =
P(Ai )
i=1
4 / 41
Consequences of the Axioms of Probability

P(Ac ) = 1 P(A)
If A B then P(A) P(B)
Any event A has 0 P(A) 1
For any events A and B (not necessarily disjoint)
P(A B) = P(A) + P(B) P(A B)

Subadditivity: For any sequence of events A1 , A2 , . . . we
have
P(
i=1 Ai )
P(Ai )
i=1
5 / 41
Counting
In elementary probability, much attention is paid to counting,
which is more complicated than one might a priori think

Rolling two dice: Possible outcomes sets:
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
and
(1, 1), (1, 2) , . . . , (1, 6)
(2, 1), (2, 2) , . . . , (2, 6)
..
.
(6, 1), (6, 2) , . . . , (6, 6)
We can assign probabilities 1/12 to each outcome in the first
example, but it does not correspond to our experience

Better to assign probabilities (1/6)2 = 1/36 to each outcome
in the second example
6 / 41
We can then count the number of outcomes corresponding,
for example, to the event 8 in to get

(2, 6), (3, 5), (4, 4), (5, 3), (6, 2), hence P({8}) = 5/36
We note that the outcome (2, 6) is symmetric to (6, 2). So we
need only count (2, 6) and (3, 5), to get 2 events. Then we
double that and add 1 more for (4, 4) to arrive at 5 events
corresponding to the outcome {8}
In more complicated situations, the counting process can be
difficult
In this example, the two dice represent an experiment with
two parts. Each die represents one part.
7 / 41
Multiplication Rule: If we have an experiment with multiple
parts, we multiply the number of items in each part to get the

total number of outcomes.
Example: The experiment is to toss a coin and roll a die.
The number of total outcomes is then 2 6 = 12. The

probability of an event is 1/2 1/6 = 1/12, or simply take
1
Total number of outcomes
to get the probability of an individual event.
8 / 41
Permutations
The canonical model is that we have an urn filled with n
numbered balls. We repeatedly draw samples of size k

(typically k n) from the urn. We do not replace the k balls
after they are drawn.
We can record the order of the samples as we take them out,
k at a time. Or we can disregard the order. To relate the

counting procedures, we have first to count how many
different ways we can sample k balls from n.
This is called how many permutations can we have, and is
denoted Pn,k
9 / 41
Permutations
Theorem: The number of permutations of n elements taken
k at a time is
Pn,k = n(n 1)(n 2) . . . (n k + 1)
Since n! = n(n 1)(n 2) . . . (3)(2)(1), we see that
Pn,k =
n!
(n k)!
(2)
Warning: n! gets big very quickly, making calculations
using (2) sometimes difficult
10 / 41
Stirlings Formula
Theorem[Stirlings Formula]:
1
(2) 2 nn+ 2 e n
lim
=1
n
n!
Example:
1
70!
2 2 7070.5 e 70
= 3.940 1035
1
50.5
50
50!
2 2 50 e
See the textbook, Example 1.7.12 on page 31.
11 / 41
Combinations
Now the idea is that we sample k balls from the n in the urn
(k < n), again repeatedly, but this time with replacement.

Thus this is without regard to order. The number of ways we
can do this we call Cn,k .
Theorem: The number of distinct subsets of size k that can
be chosen from a set of size n is

Cn,k =
Pn,k
n!
=
k!
k!(n k)!

There is a more common notation for Cn,k and that is kn .
That is, we have

n
n!
=
k
k!(n k)!
12 / 41
There are many examples in the textbook on how to use these
ideas (Chapter 1)
We now begin Chapter 2: Conditional Probability
We have discussed in simple situations how to calculate P(A)
for an event A
Suppose we are told something has happened, call it B; this
could change the probability of A

For example, horse racing, and it rains at the race track
Or for example, the Super Bowl: A = {Denver wins}, and
B = {Peyton Manning has stomach flu}

P(A|B) denotes the probability of A given the knowledge that
B has occurred
This is different from P(A)
If P(B) > 0, then
P(A|B) =
P(A B)
P(B)
This is a simple but powerful concept
13 / 41
Example
Your roommate goes to donate blood. The nurse at the
bloodmobile gives his blood a screening test, which is 99%

accurate, and rejects his blood due to AIDS contamination
Your roommate considers suicide what do you do?
14 / 41
The nurse said the test was 99% accurate. She meant
P(TP|A) = 0.99
TP= test positive; A= AIDS
You asked about false positives; she said it was wrong about
5% of the time; does this mean his probability of AIDS is

0.99-0.05=0.94?
No!
15 / 41
You explain to your roommate he wants P(A|TP) and this is
very different
P(A|TP) =
P(A TP)
;
P(TP)
P(TP|A) =
P(A TP)
= 0.99
P(A)
According to the CDC, 1.1 million have AIDS in the general
population; the US population is 314 million

So the incidence of AIDS in the population is
1.1
314
= 0.0035 = P(A)
Therefore
P(A TP) =
P(ATP)
P(A) P(A)
= 0.99 0.0035 = 0.00347
Caveat: We are being sleazy here; not everyone is equally
likely to have AIDS.
16 / 41
5% false positives means that P(TP) 0.05

0.00347
So we conclude P(A|TP) = P(ATP)
P(TP) = 0.05 0.069, or
around 7%
Suicide is not indicated
The number of mistakes involving not understanding
conditional probabilities is amazing
17 / 41
Going backwards with conditional probabilities

Let A1 , A2 , . . . , An be a partition of S
This means they are all disjoint (Ai Aj = if i 6= j) and
their union is all of S

Theorem The Partition Equation: Let A1 , A2 , . . . , An be
a partition of S. For any event B we have
P(B) =
n
X
P(B|Ai )P(Ai )
i=1
This uses the relation P(B|) = P(B)

P() implies that
P(B ) = P(B|)P()
P(B) = P(B
(ni=1 Ai ))
P(ni=1 B
Ai ) =
n
X
P(B Ai )
i=1
18 / 41
Bayes Theorem
Theorem (Bayes): Let A1 , A2 , . . . , An be a partition of S.
Then
P(B|Ai )P(Ai )
P(Ai |B) = Pn
i=1 P(B|Ai )P(Ai )
Our AIDS example is a special case of Bayes Theorem
19 / 41
Independence
Conditional probability and Independence are both new ideas
and powerful ones

Intuitively, two events A and B are independent if either event
occurring is unrelated to the other

Examples:
a) A = {Denver wins the Super Bowl}; B =
{Joes Cafe in the NWC Building runs out of coffee}
b) A = {It rains}; B = {You forget your umbrella}
c) A = {You are late}; B =
{There is a stoppage in the subway trains}
d) A = {first trial of a scientific experiment}; B =
{second identical trial of the same experiment}
20 / 41
Math Interpretation of Independence

How do we model this intuition?
We use conditional probability: If knowledge that B has
occurred does not change our probability of A, then A and B

should be independent
This becomes
P(A|B) = P(A)
P(A B)
= P(A) assuming P(B) > 0
P(B)
(3)
This is algebraically equivalent to
P(A B) = P(A)P(B)
(4)
Definition (3) is highly intuitive while Definition (4) is not
21 / 41
In spite of its lack of intuition, we primarily use Definition 2,
because it extends easily to 3 or more events

Defintion: Events A1 , A2 , . . . An are mutually independent if
for every k n we have

P(Ai1 Aik ) = P(Ai1 ) P(Ai2 ) P(Aik )
(5)
Caution: There exist examples of 3 events, A, B, C which are
two by two independent, but all three are not independent

So we really do need (5) to hold for all k n
Please read Section 2.3 on Conditionally Independent
Events
22 / 41
Random Variables
We are familiar with the concept of a variable; from algebra,
and from calculus

For example, if we consider the equation x 2 9 = 0, then x is
the variable (the unknown)

When we solve the equation, we see that x = 3 or x = 3
Before we solve the equation however, x can be anything; it
can vary over all of R

A random variable X is a variable that varies randomly
A random variable can a priori take on any value, and its
actual value does not come from solving an equation, but

from performing an experiment whose outcome is not known
in advance (e.g., tossing a coin)
23 / 41
Definition: A random variable X is a function mapping the
sample space S into R

We insist that random variables have numerical values (e.g.,
they cannot be valued as heads, or tails)

We use capital letters for random variables to distinguish
them from algebraic and calculus variables

Example: We toss a coin 10 times. Each time the coin comes
up heads we record the value 1. When it comes up tails we

record the value 0.
Let Xi denote the outcome of the i th toss
Let Y = X1 + X2 + + X10 . Then each Xi is a random
variable, and so also is Y . Y is the (random) number of

heads that occur in 10 tosses of the coin
24 / 41
The random variables Xi just described take on only 2
possible values, and Y takes on 11 possible values

Random variables that take on a finite number or countable
infinite number of values are called discrete random

variables
Example: Toss a coin until the first time heads appears
The sample space is of necessity infinite
This is an example of a countably infinite discrete random
variable
25 / 41
In addition to discrete random variables, we can also have
continuous random variables

We can model time discretely (day 1, day 2, etc.; or minute 1,
minute 2, etc.) or we can model it continuously, as the

interval [0, ),or some subinterval [a, b]
In the example of tossing a coin until the first time heads
appears, we can let X be the amount of time elapsed until

heads appears; X takes values on the time interval [0, )
Definition: A random variable is continuous if it takes its
values in an interval of R, AND no single value has positive

probability
That means for any given R, P(X = ) = 0
26 / 41
We do not know the outcome of a random variable in
advance, and its outcome is random, but that does not mean
we know nothing about it!
For example, we know that most coins have P(Heads) 21 ;
most dies have P(die = 6)

Oceans Eleven)
1
6
(But remember the movie
Although we cannot know the outcome in advance, we can in
fact predict what it will be with some confidence

This is because we have some idea in advance of the
probabilities of the outcomes

From a Statistical standpoint, this is what we find interesting
We call it the distribution of a random variable
27 / 41
Probability Distributions of Random Variables
Probability distributions are very different, depending on
whether the random variable in question is discrete or

continuous
The probability distribution of a discrete random variable
X is the collection p(x) = P(X = x) for all the possible

values of x
For example, for our random variables Xi that are 1 or 0
depending on the outcome of heads or tails on the i th toss of

a coin, we have P(Xi = 1) = p and P(Xi = 0) = 1 p, where
P(Heads) = p
28 / 41
For our random variable Y the possible values are
{0, 1, 2, 3, . . . , 11} and the calculations are a little complicated

P(Y = 0) = P(all 10 tosses give tails) = (1 p)10 = q 10 ,
where q = 1 p = P(Tails)
P(Y = 10) = 10pq 9 , where the 10 comes from the fact that
there are 10 choices for the only heads to appear

2 8
P(Y = 2) = 10
2 p q

10 k nk
P(Y = k) = k p q
We have this for every k, 0 k 10, so this gives us the
entire probability distribution of Y

Note, as a curiosity, that we now know
P10 10 k
nk = 1 for any p, 0 p 1
k=0 k p (1 p)
Often we abbreviate probability distribution with just the
word distribution
29 / 41
Parameters of a Probability Distribution

In our previous example, we were able to give the complete
distributions of Xi and Y in terms of p = P(Heads)

The number p, which can be anything between 0 and 1, is
called a parameter
Once we know p we know the entire distribution of the
random variables Xi and also of Y

So we have a distribution for each value of p: We call this a
parameterized family of distributions

We also have a Cumulative Distribution Function (CDF)
of a discrete random variable, written

X
X
F (x) = P(X x) =
P(X = y ) =
p(y ); (p(y ) = P(X = y ))
y x
y x
30 / 41
A Graph of a CDF of a Discrete RV
31 / 41
An Example
Let X = the number of tosses needed until Heads appears for
the first time

P(Heads) = p
The distribution of X is given by
P(X = k) = p(k) = (1 p)k1 p for k = 1, 2, 3, . . .

= 0 otherwise
For this example, and any integer k, we have
F (k) =
p(y ) =
y k
k
X
y =1
y 1
(1 p)
k1
k1
X
X
y
p=p
(1 p) = p
qy
y =0
y =0
where q = 1 p
32 / 41
The last term on the previous slide is p
Pk1
y =0 q
In your calculus class you learned a formula for a finite sum of
geometric form:
n
X
i =
i=1
1 n+1
for 0 < < 1
1
We use this and the result from the previous slide that
F (k) = p
Pk1
F (k) = p
1 qk
1 qk
= (1q)
= 1q k for k a positive integer
1q
1q
y =0 q
to get
This is called the geometric distribution
33 / 41
For a real variable x we define the notation [x] to be the
largest integer that is less than x

[x] is often referred to as the integer part of x
With this notation, we can define F (x) for all x R,
essentially embedding the cdf into R, by writing

F (x) = 1 (1 p)[x] for x 1 and 0 otherwise
34 / 41
A Graph of a CDF of a Discrete RV Embedded in R
35 / 41
The Binomial Distribution

We have a finite sequence of n independent trials of an
experiment, each with only two outcomes

Let Y be the number of successes
Next assume for that for every trial P(Success) = p,
0<p<1
Then Y is said to have a Binomial Distribution
There are two parameters, p and n
We have already calculated (in Lecture 4) the distribution of
Y when it is B(10, p)
The general formula is

n k nk
P(Y = k) =
p q
for 0 k n and q = 1 p
k
36 / 41
The Poisson Distribution

Sim
eon Denis Poisson, was a French mathematician,
geometer, and physicist.

Poisson is also the French word for fish
A distribution named after Poisson is the Poisson
Distribution
A random variable X has a Poisson Distribution with
parameter > 0 if
k
e
for k = 0, 1, 2, . . .
k!
P
We know that we must have
k=0 P(X = k) = 1
We also know from calculus (sorry second semester) that
P(X = k) =
e =
X
xk
k=0
k!
37 / 41
By the preceding, we see why we need the term e

The Poisson distribution is useful in many applications, but
gthe primary one is the approximation of the Binomial

Let Yn,p be a Binomial with parameters (n, p), and let X be
Poisson with parameter

Theorem: Suppose that n and p 0 in such a way
that limn;p0 np = > 0. Then P(Yn,p = k) P(X = k)
where X is a Poisson with parameter .
The import of this theorem is that for a Binomial(n, p) when
n is large and p is small, we can multiply them together to get

np = and then approximate the probabilities of the Binomial
random variable with those of a Poisson
This is useful because for large n it is hard to compute the
probability distribution of a Binomial, but easy to do so for a

Poisson
38 / 41
Recall a continuous function in calculus is one for which
limy x f (y ) = f (x) for all x in the domain of f

This is not what we mean by a continuous random variable
Indeed, we do not have a concept of points of the domain,
which is the sample space S, converging

Instead what we mean by a continuous random variable is
that the cdf F itself is continuous

Recall F (x) = P(X x). We know x 7 F (x) is a
non-decreasing function, and limx F (x) = 0 and

limx F (x) = 1
39 / 41
Densities
Suppose that F is not only continuous, but differentiable, and
its derivative is continuous, too

Recall that this is not always the case
If F is differentiable with F 0 (x) = f (x), then by the
Fundamental Theorem of Calculus we have

Z b
Z
F (b) =
f (x)dx; F (b) F (a) =
f (x)dx
If F is the cdf of the random variable X , then we have
Z
P(a < X b) =
f (x)dx
a
40 / 41
Densities II
The function f = F 0 is called the density corresponding to
the random variable X

Rx
F (x) = f (u)du is the cdf of X
We can get the density by differentiating the cdf, and we get
the cdf by anti-differentiating the density
Important observation: F (x) = P(X x), so
1 F (x) = P(X > x)
Example: The Uniform Distribution on [A, B]
fA,b (x) =
1
if A x B and 0 otherwise
B A
Example of a strange cdf
f (x) =
1
sin(x) for 0 x and 0 otherwise
2
41 / 41

Prob Stat Week 1

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Prob Stat Week 1

Transféré par

Droits d'auteur :

Formats disponibles

Statistics W4150: Introduction to Probability and

September 8 & 10, 2015

Office Hours: 2pm to 4pm Fridays in 1029 SSW (121 and

M.H. DeGroot & Mark Schervish,

Probability versus Statistics

Example: an = (1)n ; then an = 1 if n is even and an = 1 if

educated guess; This is statistics

The Axioms of Probability

Consequences of the Axioms of Probability

P(A B) = P(A) + P(B) P(A B)

which is more complicated than one might a priori think

example, but it does not correspond to our experience

in the second example

We can then count the number of outcomes corresponding,

for example, to the event 8 in to get

two parts. Each die represents one part.

Multiplication Rule: If we have an experiment with multiple

parts, we multiply the number of items in each part to get the

The number of total outcomes is then 2 6 = 12. The

The canonical model is that we have an urn filled with n

numbered balls. We repeatedly draw samples of size k

k at a time. Or we can disregard the order. To relate the

Warning: n! gets big very quickly, making calculations

using (2) sometimes difficult

(k < n), again repeatedly, but this time with replacement.

be chosen from a set of size n is

There are many examples in the textbook on how to use these

could change the probability of A

B = {Peyton Manning has stomach flu}

This is a simple but powerful concept

Your roommate goes to donate blood. The nurse at the

bloodmobile gives his blood a screening test, which is 99%

5% of the time; does this mean his probability of AIDS is

You explain to your roommate he wants P(A|TP) and this is

According to the CDC, 1.1 million have AIDS in the general

population; the US population is 314 million

= 0.99 0.0035 = 0.00347

Caveat: We are being sleazy here; not everyone is equally

likely to have AIDS.

5% false positives means that P(TP) 0.05

conditional probabilities is amazing

Going backwards with conditional probabilities

their union is all of S

This uses the relation P(B|) = P(B)

Theorem (Bayes): Let A1 , A2 , . . . , An be a partition of S.

Our AIDS example is a special case of Bayes Theorem

and powerful ones

occurring is unrelated to the other

Math Interpretation of Independence

occurred does not change our probability of A, then A and B

This is algebraically equivalent to

Definition (3) is highly intuitive while Definition (4) is not

In spite of its lack of intuition, we primarily use Definition 2,

because it extends easily to 3 or more events

for every k n we have

Caution: There exist examples of 3 events, A, B, C which are

two by two independent, but all three are not independent

and from calculus

the variable (the unknown)

can vary over all of R

actual value does not come from solving an equation, but

Definition: A random variable X is a function mapping the

sample space S into R

they cannot be valued as heads, or tails)