Vous êtes sur la page 1sur 17

Probability Theory

Muhammad Waliji August 11, 2006

Abstract

This paper introduces some elementary notions in Measure-Theoretic Probability Theory. Several probabalistic notions of the convergence of a sequence of random variables are discussed. The theory is then used to prove the Law of Large Numbers. Finally, the notions of conditional expectation and conditional probability are introduced.

1 Heuristic Introduction

Probability theory is concerned with the outcome of experiments that are ran- dom in nature, that is, experiments whose outcomes cannot be predicted in advance. The set of possible outcomes, ω, of an experiment is called the sample

space, denoted by Ω. For instance, if our experiment consists of rolling a dice,

A subset, A, of Ω is called an event. For

instance A = {1, 3, 5} corresponds to the event ‘an odd number is rolled’. In elementary probability theory, one is normally concerned with sample spaces that are either finite or countable. In this case, one often assigns a probability to every single outcome. That is, we have probability function P :

[0, 1], where P (ω) is the probability that ω occurs. Here, we inssist that

we will have Ω = {1, 2, 3, 4, 5, 6}.

ω

P(ω) = 1.

However, if the sample space is uncountable, then this condition becomes non- sensible. Two elementary types of problems come into this category and hence cannot be dealt with by elementary probability theory: an infinite number of repeated coin tosses (or dice rolls), and a number drawn at random from [0, 1]. This illustrates the importance of uncountable sample spaces. The solution to this problem is to use the theory of measures. Instead of assigning probabilities to outcomes in the sample space, one can restrict himself to a certain class of events that form a structure known as a σ-field, and assign probabilities to these special kinds of events.

1

2 σ-Fields, Probability Measures, and Distribu- tion Functions

Definition 2.1. A class of subsets of Ω, F, is a σ-field if the following hold:

(i)

∅ ∈ F and Ω ∈ F

(ii)

A ∈ F =A c ∈ F

(iii)

A 1 , A 2 ,

∈ F = n A n ∈ F

Note that this implies that σ-fields are also closed under countable intere- sections also.

Definition 2.2. The σ-field generated by a class of sets, A, is the smallest σ-field containing A. It is denoted σ(A).

Definition 2.3. Let F be a σ-field. A function P : F → [0, 1] is a probability measure if P () = 0, P (Ω) = 1, and whenever (A n ) nN is a disjoint collection of sets in F , we have

P

n=1 A n =

n=1

P(A n ).

Throughout this paper, unless otherwise noted, the words increasing, de- creasing, and monotone are always meant in their weak sense. Suppose {A n } is

a sequence of sets. We say that {A n } is an increasing sequence if A 1 A 2 ⊂ ···.

In both of these

cases, the sequence {A n } is said to be monotone. If A n is increasing, then set lim n A n := n A n . If A n is decreasing, then set lim n A n := n A n . The following properties follow immediately from the definitions.

We say that {A n } is a decreasing sequence if A 1 A 2 ⊃ · · · .

Lemma 2.4. Let F be a σ-field, and let P be a probability measure on it.

(i)

(ii)

(iii)

(iv)

P (A c ) = 1 P(A)

If A B, then P (A) P (B).

P (

If {A n } is a monotone sequence in F , then lim n P(A n ) = P (lim n A n ).

i=1 A i )

i=1 P(A i ).

Definition 2.5. Suppose Ω is a set, F is a field on Ω, and P is a probability measure on F. Then, the ordered pair (Ω, F) is called a measurable space. The triple (Ω, F , P ) is called a probability space. The a probability space is finitely additive or countably additive depending on whether P is finitely or countably additive.

Definition 2.6. Let (X, τ ) be a topological space. The σ-field, B(X, τ ) gener- ated by τ is called the Borel σ-field. In particular, B(X, τ ) is the smallest σ-field containing all open and closed sets of X. The sets of B(X, τ ) are called Borel sets.

2

When the topology, τ , or even the space X are obvious from the context, B(X, τ ) will often be abbreviated B(X) or even just B. A particularly important situation in probability theory is when Ω = R and F are the Borel sets in R.

Definition 2.7. A distribution function is an increasing right-continuous func- tion F : R [0, 1] such that

x F (x) = 0 and

lim

x F(x) = 1.

lim

We can associate probability functions on (R, B ) with distribution functions. Namely, the distribution function associated with P is F (x) := P ((−∞, x]). Conversely, each distribution function defines a probability function on the reals.

3 Random Variables, Transformations, and Ex- pectation

We now have stated the basic objects that we will be studying and discussed their elementary properties. We now introduce the concept of a Random Vari-

able. Let Ω be the set of all possible drawings of lottery numbers. The function

X :Ω R which indicates the payoff X(ω) to a player associated with a drawing

ω is an example of a random variable. The expectation of a random variable is

the “average” or “expected” value of X.

Definition 3.1. Let (Ω 1 , F 1 ) and (Ω 2 , F 2 ) be measurable spaces. A function

T :Ω 1 2 is a measurable transformation if the preimage of any measurable

set is a measurable set. That is, T is a measurable transformation if (A ∈ F 2 )(T 1 (A) ∈ F 1 ).

Lemma 3.2. It is sufficient to check the condition in Definition 3.1 for those

A in a class that generates F 2 . More precisely, suppose that A generates F 2 .

Then, if (A ∈ A)(T 1 (A) ∈ F 1 ), then T is a measurable transformation.

Proof. Let C := {A 2 2 : T 1 (A) ∈ F 1 }. Then, C is a σ-field, and A ⊂ C. But

then, σ(A) = F 2 ⊂ C, which is exactly what we wanted.

Definition 3.3. Let (Ω, F) be a measurable space. A measurable function or a random variable is a measurable transformation from (Ω, F) into (R, B).

Lemma 3.4. If f : R R is a continuous function, then f is a measurable transformation from (R, B) to (R, B).

Definition 3.5. Given a set A, the indicator function for A is the function

a set A , the indicator function for A is the function I A ( ω

I A (ω) :=

1

if ω A

0

if ω / A

3

If A ∈ F, then I A is a measurable function. Note that many elementary operations, including composition, arithmetic, max, min, and others, when performed upon measurable functions, again yield measurable functions. Let (Ω 1 , F 1 , P ) be a probability space and (Ω 2 , F 2 ) a measurable space. A measurable transformation T :Ω 1 2 naturally induces a probability measure PT 1 on (Ω 2 , F 2 ). In the case of a random variable, X, the induced measure on R will generally be denoted α. The distribution function associated with α will be denoted F X . α will sometimes be called a probability distribution. Now that we have a notion of measure and of measurable functions, we can develop a notion of the “integral” of a function. The integral will have the probabalistic interpretation of being an expected (or average) value. For the precise definition of the Lebesgue integral, see any textbook on Measure Theory.

Definition 3.6. Suppose X is a random variable. Then the expectation of X is EX := X(ω)dP .

We conclude this section with a useful change of variables formula for inte- grals.

Proposition 3.7. Let (Ω 1 , F 1 , P ) be a probability space and let (Ω 2 , F 2 ) be a measurable space. Suppose T :Ω 1 2 is a measurable transformation. Suppose f : Ω 2 R is a measurable function. Then, P T 1 is a probability measure on (Ω 2 , F 2 ) and fT :Ω 1 R is a measurable function. Furthermore, f is integrable iff fT is integrable, and

1 fT(ω 1 )dP = 2 f(ω 2 )dP T 1 .

4 Notions of Convergence

We will now introduce some notions of the convergence of random variables. Note that we will often not explicitly state the dependence of a function X (ω) on ω. Hence, sets of the form {ω : X(ω) > 0} will often be abbreviated {X > 0}. For the remainder of this section, let X n be a sequence of random variables.

Definition 4.1. The sequence X n converges almost surely (almost everywhere)

to a random variable X if X n (ω) X(ω) for all ω outside of a set of probability

0.

Definition 4.2. The sequence X n converges in probability (in measure) to a function X if,

for every > 0,

P

This is denoted X n X.

n P{ω : |X n (ω) X(ω)| ≥ } = 0.

lim

4

Proposition 4.3. If X n converges almost surely to X, then X n converges in probability to X.

Proof. We have {ω : X n (ω) X(ω)} ⊂ N , P (N ) = 0. That is,

> 0

∞ ∞

n=1 m=n

{|X m X| ≥ } ⊂ N.

Therefore, given > 0, we have

n P {|X n X| ≥ }

lim

=

lim

n→∞ P

m=n

{|X m X| ≥ }

P

∞ ∞

n=1 m=n

{|X m X| ≥ } ≤ P(N) = 0

thereby completing the proof.

Note, however, that the converse is not true. Let Ω = [0, 1] with Lebesgue Consider the sequence of sets A 1 = [0, 2 ], A 2 = [ 2 , 1], A 3 = [0, 3 ],

A 4 = [

3 ], and so on. Then, the indicator functions, I A n , converge in prob-

3 , ability to 0. However, I A n (ω) does not converge for any ω, and in particular the sequence does not converge almost surely. However, the following holds as a sort of converse:

Proposition 4.4. Suppose f n converges in probability to f . Then, there is a subsequence f n k of f n such that f n k converges almost surely to f .

Proof. Let B n := {ω : |f n (ω) f (ω)| ≥ }. Then,

{ ω : | f n ( ω ) − f ( ω ) | ≥

1

measure.

1

2

1

1

f n i f almost surely

iff

P (

i j>i

We know that for any ,

Now, notice that

n→∞ lim P(B

n ) = 0.

B n j ) = 0.

P(

n mn

B m ) inf P(

n

mn

B m ) inf

n

m=n

P(B m ) = lim

n→∞

m=n

P(B m ).

Furthermore, 1 < 2 B

Let δ i n i := n

1

n

B

2

n

P(B

1

n

) P(B

2

n

).

:=

1/2 i .

Now, note that (i)(n )(n

i

n )(P (B n )

i

i

i

. Choose 0. Note, (m)(δ m < ). Hence,

δ

P(

i

ji

B n j ) lim

i→∞

j=i

P(B n j ) lim

i→∞

∞ ∞

j=i

P(B

δ

n

j ) = lim

j

i→∞

j=i

which is what we wanted.

<

δ i ).

δ j = 0

Let

∞ ∞ j = i P ( B δ n j ) = lim j i

5

Definition 4.5. A sequence of probability measures {α n } on R converges weakly

to α if whenever α(a) = α(b) = 0, for a < b R, we have

n α n [a, b] = α[a, b].

lim

A sequence of random variable {X n } converges weakly to X if the induced

probability measures {α n } converge weakly to α. This is denoted α n α or

X n X.

Lemma 4.6. Suppose α n and α are probability measures on R with associated distribution functions F n and F . Then, α n α iff F n (x) F (x) for each continuity point x of F .

Let a < b be

Proof. First, note that x is a continuity point of F iff α(x) = 0.

continuity points of F . Suppose F n (x) F (x) for each continuity point x of

F . Then,

n α n [a, b] = lim

lim

n F n (b) F n (a) = F (b) F (a) = α[a, b].

For the converse, suppose α n α. Then,

n F n (b) F n (a) = lim

lim

n α n [a, b] = α[a, b].

Now, we can let a → −∞ in such a way that a is always a continuity point of

F . Then, we get, lim n F n (b) = α(−∞, b].

Then, we get, lim n F n ( b ) = α ( −∞ , b

The next result shows that weak convergence is actually ‘weak’:

Proposition 4.7. Suppose X n converges in probability to X. Then, X n con- verges weakly to X.

Proof. Let F n , F be the distribution functions of X n , X respectively. suppose

x is a continuity point of F . Note that

{X x } {|X n X| ≥ } ⊂ {X n x}

and

{X n x}

=

{X n x and X

x + } ∪ {X n x

and X > x + }

{X x + } ∪ {|X n X| ≥ }

Therefore,

P {X x } − P {|X n X| ≥

}

P{X n x}

 

P {X x + } + P {|X n X| ≥ }

Since for each > 0, lim n P {|X n X | ≥ } = 0, when we let n → ∞, we have

F (x ) lim inf F n (x) lim sup F n (x) F(x + ).

n→∞

n→∞

6

Finally, since F is continuous at x, letting 0, we have

n F n (x) = F(x)

lim

so that X n X.

The converse is not true in general. However, if X is a degenerate distribu- tion (takes a single value with probability one), then the converse is true.

Proposition 4.8. Suppose X n X, and X is a degenerate distribution such that P {X = a} = 1. Then, X n X.

Proof. Let α n and α be the distributions on R induced by X n and X respectively. Given > 0, we have

n α n [a , a + ] = α[a , a + ] = 1.

[ a − , a + ] = α [ a − , a + ]

P

lim

Hence,

 

n P {|X n X| ≤ } = 1,

lim

and so

n P {|X n X| > } = 0

lim

n → ∞ P {| X n − X | > } = 0 lim 5

5 Product Measures and Independence

Suppose (Ω 1 , F 1 ) and (Ω 2 , F 2 ) are two measurable spaces. We want to construct a product measurable space with sample space Ω 1 × 2 .

Definition 5.1. Let A = {A × B : A ∈ F 1 , B ∈ F 2 }. Let F 1 ×F 2 be the σ-field generated by A. F 1 × F 2 is called the product σ-field of F 1 and F 2 .

If P 1 and P 2 are probability measures on the measurable spaces above, then P 1 × P 2 (A × B) := P 1 (A)P 2 (B) gives a probability measure on A. This can be extended in a canonical way to the σ-field F 1 × F 2 .

Definition 5.2. P 1 ×P 2 is called the product probability measure of P 1 and P 2 .

Let Ω := Ω 1 × 2 , F := F 1 × F 2 , and P := P 1 × P 2 . Note that when calculating integrals with respect to a product probability measure, we can normally perform an iterated integral in any order with re- spect to the component probability measures. This result is know as Fubini’s Theorem. Before we define a notion of independence, we will give some heuristic con- siderations. Two events A and B should be independent if A occurring has nothing to do with B occurring. If we denote by P A (X), the probability that X occurs given that A has occurred, then we see that P A (X) = P(AX) . Now,

P(A)

suppose that A and B are indeed independent. This means that P A (B) = P (B). But then, P (B) = P(AB) , so that P (A B) = P (A)P (B). This leads us to

define,

P(A)

7

Definition 5.3. Let (Ω, F, P ) be a probability space. Let A i ∈ F for every i. Let X i be a random variable for every i.

(i) A 1 ,

, A n are independent if P (A 1 ∩ ··· ∩ A n ) = P(A 1 )··· P(A n ).

(ii) A collection of events {A i } iI is independent if every finite subcollection is independent.

(iii)

X 1 ,

, X n are independent if for any n sets A 1 ,

, A n ∈ B(R), the events

{X i A i } i=1 n are independent.

(iv)

A collection of random variables {X i } iI is independent if every finite subcollection is independent.

Lemma 5.4. Suppose X, Y are random variables on (Ω, F, P ), with induced distributions α, β on R respectively. Then, X and Y are independent if and only if the distribution induced on R 2 by (X, Y ) is α × β.

Lemma 5.5. Suppose X, Y are independent random variables, and suppose that f, g are measurable functions. Then, f (X) and g(Y ) are also independent random variables.

Proposition 5.6. Let X, Y be independent random variables, and let f, g be measurable functions. Suppose that E|f (X)| and E|g(Y )| are both finite. Then, E[f (X)g(Y )] = E[f (X)]E[g(Y )].

Proof. Let α be the distribution on R induced by f (X ), and let β be the distri- bution induced by g(Y ). Then, the distribution on R 2 induced by (f (X ), g(Y )) is α × β. So,

E[f (X)g(Y )] = f (X(ω))g(Y (ω)) dP = R R uv dα dβ

= R u dα R v dβ = E[f (X)]E[g(Y )]

R u d α R v d β = E [ f ( X )] E

6 Characteristic Functions

The inverse Fourier transform of a probability distribution plays a central role in probability theory.

Definition 6.1. Let α be a probability measure on R. Then, the characteristic function of α is

φ α (t) = R e ıtx dα

If X is a random variable, the characteristic function of the distribution on R induced by X will sometimes be denoted φ X . These results demonstrate the importance of the characteristic function in probability.

8

Proposition 6.2. Suppose α and β are probability measures on R with char- acteristic functions φ and ψ respectively. Suppose further that for each t R, φ(t) = ψ(t). Then, α = β.

Theorem 6.3. Let α n , α be probability measures on R with distribution func- tions F n and F and characteristic functions φ n and φ. Then, the following are equivalent

(i) α n α.

(ii) for any bounded continuous function f : R R,

(iii) for every t R,

n R f (x)dα n = R f (x)dα.

lim

n φ n (t) = φ(t).

lim

Theorem 6.4. Suppose α n is a sequence of probability measures on R, with characteristic functions φ n . Suppose that for each t R, lim n φ n (t) =: φ(t) exists and φ is continuous at 0. Then, there is a probability distriubtion α such that φ is the characteristic function of α. Furthermore, α n α.

Next, we show how to recover the moments of a random variable from its characteristic function.

Definition 6.5. Suppose X is a random variable. Then, the kth moment of X is EX k . The kth absolute moment of X is E|X| k .

Proposition 6.6. Let X be a random variable. Suppose that the kth moment of X exists. Then, the characteristic function φ of X is k times continuously differentiable, and

φ (k) (0) = ı k EX k .

Now, a result on affine transforms of a random variable:

Proposition 6.7. Suppose X is a random variable, and Y = aX + b. Let φ X and φ Y be the characteristic functions of X and Y . Then, φ Y (t) = e ıtb φ X (at).

We will often be interested in the sums of independent random variables. Suppose that X and Y are independent random variables with induced distri- butions α and β on R respectively. Then, the induced distribution of (X, Y ) on R 2 is α × β. Consider the map f : R 2 R given by f (x, y) = x + y. Then, the distribution on R induced by α × β is denoted α β, and is called the convolution of α and β. α β is the distribution of the sum of X and Y .

Proposition 6.8. Suppose X and Y are independent random variables with distributions α and β respectively. Then, φ X+Y (t) = φ X (t)φ Y (t).

9

Proof.

φ αβ (t)

=

R e ıtz dα β = R R

e ıt(x+y) dα dβ

= R R e ıtx e ıty dα dβ = R e ıtx dα R e ıty dβ

= φ α (t)φ β (t)

R e ı t y d β = φ α ( t ) φ β (

7 Useful Bounds and Inequalities

Here, we will prove some useful bounds regarding random variables and their moments.

Definition 7.1. Let X be a random variable. Then, the variance of X is Var(X) := E[(X EX) 2 ] = EX 2 (EX) 2 .

The variance is a measure of how far spread X is on average from its mean. It exists if X has a finite second moment. It is often denoted σ 2 .

Lemma 7.2. Suppose X, Y are independent random variables. Then, Var(X + Y ) = Var(X) + Var(Y )

Proposition 7.3 (Markov’s Inequality). Let > 0. Suppose X is a random

variable with finite kth absolute moment. Then, P {|X| ≥ } ≤

Proof.

1

k

E|X| k .

P {|X| ≥ } = {|X|≥ } dP 1

k

{|X|≥ } |X| k dP 1

k

|X| k dP = 1 k E|X| k

≤ 1 k Ω | X | k d P = 1 k E | X

Corollary 7.4 (Chebyshev’s Inequality). Suppose X is a random variable with finite 2nd moment. Then,

P {|X EX| ≥ } ≤ 1 2 Var(X).

The following is also a useful fact:

Lemma 7.5. Suppose X is a nonnegative random variable. Then,

Proof.

m=1

P{X m} ≤ EX

E X =

n=1

nP {n X < n + 1} =


=

m=1

P{X m} ≤ EX

m=1 n=m

10

P{n X < n + 1}

nP { n ≤ X < n + 1 } = ∞ = m =1 P

8 The Borel-Cantelli Lemma

First, let us introduce some terminology. Let A 1 , A 2 ,

lim sup A n :=

n→∞

∞ ∞

A m .

n=1 m=n

be sets. Then,

lim sup n A n consists of those ω that appear in A n infinitely often (i.o.). Also,

lim inf A n :=

n→∞

A n .

n=1 m=n

lim inf n A n consists of those ω that appear in all but finitely many A n .

Theorem 8.1 (Borel-Cantelli Lemma). Let A 1 , A 2 ,

, then P (lim sup n A n ) = 0. Furthermore, suppose that the A i are independent.

∈ F. If n=1 P(A n ) <

Then, if n=1 P(A n ) = , then P (lim sup n A n ) = 1.

Proof. Suppose n=1 P(A n ) < . Then,

P n=1 m = n A m =

n→∞ P

lim

m=n A m lim

n→∞

m=n

P(A m ) = 0.

For the converse, it is enough to show that

P

n=1 m=n

A m = 0,

c

and so it is also enough to show that


P

m=n

A m = 0

c

for all n. By independence, and since 1 x e x , we have

P

m=n

A m P

c

n+k

m=n

A

m =

c

m=n (1 P (A m )) exp m=n P(A m )

n+k

n+k

Since the last sum diverges, taking the limit as k → ∞, we get


P

m=n

A m = 0

c

11

A m ) n + k n + k Since the last sum diverges, taking the

9 The Law of Large Numbers

be random variables that are independent and indentically dis-

tributed (iid). Let S n := X 1 +···+X n . We will be interested in the asymptotic

behavior of the average S n . If X i has a finite expectation, then we would think that S n would settle down to EX i . This is known as the Law of Large Numbers. There are two varieties of this law: the Weak Law of Large Numbers and the Strong Law of Large Numbers. The weak law states that the average converges in probability to EX i . The strong law, however states that the average con- verges almost surely to EX i . However, the strong law is significantly harder to prove, and requires a bit of additional machinery. For the rest of this section, fix a probability space (Ω, F, P ).

Theorem 9.1 (The Weak Law of Large Numbers). Suppose X 1 , X 2 ,

random variables with mean EX i = m < . Then, S n

Then, the characteristic

is

function of S n is [φ(t)] n .

Proof. Let φ be the characteristic function of X i .

are iid

Let X 1 , X 2 ,

n

n

n

P m.

Then, by 6.7, the characteristic function of S n

n

t

ψ n (t) = [φ( n )] n . Furthermore, by 6.6, φ is differentiable, and φ (0) = im. Therefore, we can form the Taylor expansion,

φ n = 1 + ımt + o

t

n

n ,

1

and so

ψ n (t) = 1 + ımt + o

n

n n

1

.

Taking the limit as n → ∞, we get

n→∞ ψ n (t) = e ımt

lim

which is the characteristic function for the distribution degenerate at m. There-

fore, by Proposition 4.8, S n

n

converges in probability to m.

Proposition 4.8, S n n converges in probability to m . Theorem 9.2 (The Strong Law

Theorem 9.2 (The Strong Law of Large Numbers). Suppose X 1 , X 2 ,

iid random variables with EX i = m < . Let S n = X 1 + ··· + X n . Then, S n converges to m almost surely.

Proof. We can decompose an arbitrary random variable X i into its positive

i := X i I {X i <0} , so that X i =

X

n .

Hence, it is enough to prove the Theorem for nonnegative X i .

Furthermore, let α > 1,

and set u n := α n . We shall first establish the inequality

are

n

and negative parts: X

+

i

X

i + := X i I {X i 0} and X

+···+X

+

n

Let S n := Y 1 +

i

. Then, we have S n = X

+

1

(X

1

Y

n .

+···+X

n

) =: S

+

n

S

Now, Let Y i := X i I {X i i} .

n=1 P

S u n ES

u

n

u

n

<

> 0

(9.1)

12

Since the X i are independent, we have

Var(S n ) =

n

k=1

Var(Y k )

n

k=1

2

EY

k

n

k=1

E[X

2

i

I {X i k} ] nE[X

2

i

I {X i n} ]

By Chebyshev’s inequality, we have

n=1 P

S u n ES

u

n

u

n

=

n=1

Var(S 2 u

2

n

u n )

1

2

n=1

E[X

2 i I {X i u n } ]

u n

1

2 E X

2 E X


2
i

n=1

1

u

n

I {X i u n }

(9.2)

Now, let K :=

Also, note that α n 2u n , and so u n 2α n . Then,

2α

α1 . Let x > 0, and let N := inf{n : u n x}. Then, α N x.

u n x

1

n 2

u

nN

1

n = 2α N

α

n=0

α n

1

= N Kx 1 ,

and hence,

n=1

1

u

n

I {X i u n } KX

1

1

and so, putting this into (9.2), we get

2 E X

1

2

i

n=1

1

u

n

I {X i u n }

1 2 E X

2

i

if X 1 > 0

KX

1

i

=

K EX i <

2

thereby establishing inequality (9.1). Therefore, by the Borel-Cantelli Lemma, we have

= 0

P lim sup

n

S u n ES

u