Probability Theory
Muhammad Waliji August 11, 2006
Abstract
This paper introduces some elementary notions in MeasureTheoretic Probability Theory. Several probabalistic notions of the convergence of a sequence of random variables are discussed. The theory is then used to prove the Law of Large Numbers. Finally, the notions of conditional expectation and conditional probability are introduced.
1 Heuristic Introduction
Probability theory is concerned with the outcome of experiments that are ran dom in nature, that is, experiments whose outcomes cannot be predicted in advance. The set of possible outcomes, ω, of an experiment is called the sample
space, denoted by Ω. For instance, if our experiment consists of rolling a dice,
A subset, A, of Ω is called an event. For
instance A = {1, 3, 5} corresponds to the event ‘an odd number is rolled’. In elementary probability theory, one is normally concerned with sample spaces that are either ﬁnite or countable. In this case, one often assigns a probability to every single outcome. That is, we have probability function P :
Ω → [0, 1], where P (ω) is the probability that ω occurs. Here, we inssist that
we will have Ω = {1, 2, 3, 4, 5, 6}.
ω∈Ω
P(ω) = 1.
However, if the sample space is uncountable, then this condition becomes non sensible. Two elementary types of problems come into this category and hence cannot be dealt with by elementary probability theory: an inﬁnite number of repeated coin tosses (or dice rolls), and a number drawn at random from [0, 1]. This illustrates the importance of uncountable sample spaces. The solution to this problem is to use the theory of measures. Instead of assigning probabilities to outcomes in the sample space, one can restrict himself to a certain class of events that form a structure known as a σﬁeld, and assign probabilities to these special kinds of events.
1
2 σFields, Probability Measures, and Distribu tion Functions
Deﬁnition 2.1. A class of subsets of Ω, F, is a σﬁeld if the following hold:
(i) 
∅ ∈ F and Ω ∈ F 

(ii) 
A ∈ F =⇒ A ^{c} ∈ F 

(iii) 
A _{1} , A _{2} , 
∈ F =⇒ ^{} _{n} A _{n} ∈ F 
Note that this implies that σﬁelds are also closed under countable intere sections also.
Deﬁnition 2.2. The σﬁeld generated by a class of sets, A, is the smallest σﬁeld containing A. It is denoted σ(A).
Deﬁnition 2.3. Let F be a σﬁeld. A function P : F → [0, 1] is a probability measure if P (∅) = 0, P (Ω) = 1, and whenever (A _{n} ) _{n}_{∈}_{N} is a disjoint collection of sets in F , we have
P
n=1 A n =
∞
∞
n=1
P(A _{n} ).
Throughout this paper, unless otherwise noted, the words increasing, de creasing, and monotone are always meant in their weak sense. Suppose {A _{n} } is
a sequence of sets. We say that {A _{n} } is an increasing sequence if A _{1} ⊂ A _{2} ⊂ ···.
In both of these
cases, the sequence {A _{n} } is said to be monotone. If A _{n} is increasing, then set lim _{n} A _{n} := ^{} _{n} A _{n} . If A _{n} is decreasing, then set lim _{n} A _{n} := ^{} _{n} A _{n} . The following properties follow immediately from the deﬁnitions.
We say that {A _{n} } is a decreasing sequence if A _{1} ⊃ A _{2} ⊃ · · · .
Lemma 2.4. Let F be a σﬁeld, and let P be a probability measure on it.
(i)
(ii)
(iii)
(iv)
P (A ^{c} ) = 1 − P(A)
If A ⊂ B, then P (A) ≤ P (B).
P ( ^{}
If {A _{n} } is a monotone sequence in F , then lim _{n} P(A _{n} ) = P (lim _{n} A _{n} ).
∞
i=1 ^{A} i ^{)} ^{≤} ^{}
∞
_{i}_{=}_{1} P(A _{i} ).
Deﬁnition 2.5. Suppose Ω is a set, F is a ﬁeld on Ω, and P is a probability measure on F. Then, the ordered pair (Ω, F) is called a measurable space. The triple (Ω, F , P ) is called a probability space. The a probability space is ﬁnitely additive or countably additive depending on whether P is ﬁnitely or countably additive.
Deﬁnition 2.6. Let (X, τ ) be a topological space. The σﬁeld, B(X, τ ) gener ated by τ is called the Borel σﬁeld. In particular, B(X, τ ) is the smallest σﬁeld containing all open and closed sets of X. The sets of B(X, τ ) are called Borel sets.
2
When the topology, τ , or even the space X are obvious from the context, B(X, τ ) will often be abbreviated B(X) or even just B. A particularly important situation in probability theory is when Ω = R and F are the Borel sets in R.
Deﬁnition 2.7. A distribution function is an increasing rightcontinuous func tion F : R → [0, 1] such that
_{x}_{→}_{−}_{∞} F (x) = 0 and
lim
_{x}_{→}_{∞} F(x) = 1.
lim
We can associate probability functions on (R, B ) with distribution functions. Namely, the distribution function associated with P is F (x) := P ((−∞, x]). Conversely, each distribution function deﬁnes a probability function on the reals.
3 Random Variables, Transformations, and Ex pectation
We now have stated the basic objects that we will be studying and discussed their elementary properties. We now introduce the concept of a Random Vari
able. Let Ω be the set of all possible drawings of lottery numbers. The function
X :Ω → R which indicates the payoﬀ X(ω) to a player associated with a drawing
ω is an example of a random variable. The expectation of a random variable is
the “average” or “expected” value of X.
Deﬁnition 3.1. Let (Ω _{1} , F _{1} ) and (Ω _{2} , F _{2} ) be measurable spaces. A function
T :Ω _{1} → Ω _{2} is a measurable transformation if the preimage of any measurable
set is a measurable set. That is, T is a measurable transformation if (∀A ∈ F _{2} )(T ^{−}^{1} (A) ∈ F _{1} ).
Lemma 3.2. It is suﬃcient to check the condition in Deﬁnition 3.1 for those
A in a class that generates F _{2} . More precisely, suppose that A generates F _{2} .
Then, if (∀A ∈ A)(T ^{−}^{1} (A) ∈ F _{1} ), then T is a measurable transformation.
Proof. Let C := {A ∈ 2 ^{Ω} ^{2} : T ^{−}^{1} (A) ∈ F _{1} }. Then, C is a σﬁeld, and A ⊂ C. But
then, σ(A) = F _{2} ⊂ C, which is exactly what we wanted.
Deﬁnition 3.3. Let (Ω, F) be a measurable space. A measurable function or a random variable is a measurable transformation from (Ω, F) into (R, B).
Lemma 3.4. If f : R → R is a continuous function, then f is a measurable transformation from (R, B) to (R, B).
Deﬁnition 3.5. Given a set A, the indicator function for A is the function
I _{A} (ω) :=
1 
if ω ∈ A 
0 
if ω ∈/ A 
3 
If A ∈ F, then I _{A} is a measurable function. Note that many elementary operations, including composition, arithmetic, max, min, and others, when performed upon measurable functions, again yield measurable functions. Let (Ω _{1} , F _{1} , P ) be a probability space and (Ω _{2} , F _{2} ) a measurable space. A measurable transformation T :Ω _{1} →Ω _{2} naturally induces a probability measure PT ^{−}^{1} on (Ω _{2} , F _{2} ). In the case of a random variable, X, the induced measure on R will generally be denoted α. The distribution function associated with α will be denoted F _{X} . α will sometimes be called a probability distribution. Now that we have a notion of measure and of measurable functions, we can develop a notion of the “integral” of a function. The integral will have the probabalistic interpretation of being an expected (or average) value. For the precise deﬁnition of the Lebesgue integral, see any textbook on Measure Theory.
Deﬁnition 3.6. Suppose X is a random variable. Then the expectation of X is EX := ^{} _{Ω} X(ω)dP .
We conclude this section with a useful change of variables formula for inte grals.
Proposition 3.7. Let (Ω _{1} , F _{1} , P ) be a probability space and let (Ω _{2} , F _{2} ) be a measurable space. Suppose T :Ω _{1} →Ω _{2} is a measurable transformation. Suppose f : Ω _{2} → R is a measurable function. Then, P T ^{−}^{1} is a probability measure on (Ω _{2} , F _{2} ) and fT :Ω _{1} → R is a measurable function. Furthermore, f is integrable iﬀ fT is integrable, and
Ω 1 fT(ω _{1} )dP = Ω 2 f(ω _{2} )dP T ^{−}^{1} .
4 Notions of Convergence
We will now introduce some notions of the convergence of random variables. Note that we will often not explicitly state the dependence of a function X (ω) on ω. Hence, sets of the form {ω : X(ω) > 0} will often be abbreviated {X > 0}. For the remainder of this section, let X _{n} be a sequence of random variables.
Deﬁnition 4.1. The sequence X _{n} converges almost surely (almost everywhere)
to a random variable X if X _{n} (ω) → X(ω) for all ω outside of a set of probability
0.
Deﬁnition 4.2. The sequence X _{n} converges in probability (in measure) to a function X if,
for every > 0,
P
This is denoted X _{n} → X.
_{n}_{→}_{∞} P{ω : X _{n} (ω) − X(ω) ≥ } = 0.
lim
4
Proposition 4.3. If X _{n} converges almost surely to X, then X _{n} converges in probability to X.
Proof. We have {ω : X _{n} (ω) X(ω)} ⊂ N , P (N ) = 0. That is,
∀ > 0
∞ ∞
n=1 m=n
{X _{m} − X ≥ } ⊂ N.
Therefore, given > 0, we have
_{n}_{→}_{∞} P {X _{n} − X ≥ }
lim
≤
=
lim
n→∞ ^{P}
∞
m=n
{X _{m} − X ≥ }
P
∞ ∞
n=1 m=n
{X _{m} − X ≥ } ≤ P(N) = 0
thereby completing the proof.
Note, however, that the converse is not true. Let Ω = [0, 1] with Lebesgue Consider the sequence of sets A _{1} = [0, _{2} ], A _{2} = [ _{2} , 1], A _{3} = [0, _{3} ],
A _{4} = [
_{3} ], and so on. Then, the indicator functions, I _{A} _{n} , converge in prob
3 ^{,} ability to 0. However, I _{A} _{n} (ω) does not converge for any ω, and in particular the sequence does not converge almost surely. However, the following holds as a sort of converse:
Proposition 4.4. Suppose f _{n} converges in probability to f . Then, there is a subsequence f _{n} _{k} of f _{n} such that f _{n} _{k} converges almost surely to f .
Proof. Let B _{n} := {ω : f _{n} (ω) − f (ω) ≥ }. Then,
1
measure.
1
2
1
1
f _{n} _{i} → f almost surely
iﬀ
P ( ^{}
i j>i
We know that for any ,
Now, notice that
n→∞ lim ^{P}^{(}^{B}
_{n} ) = 0.
B _{n} _{j} ) = 0.
P( ^{}
n m≥n
B _{m} ) ≤ inf P( ^{}
n
m≥n
B _{m} ) ≤ inf
n
∞
m=n
P(B _{m} ) = lim
n→∞
∞
m=n
P(B _{m} ).
Furthermore, _{1} < _{2} ⇒ B
Let δ _{i} n _{i} := n
1
n
⊃ B
2
n
⇒ P(B
1
n
) ≥ P(B
2
n
).
:=
1/2 ^{i} .
Now, note that (∀i)(∃n )(∀n
i
≥ n )(P (B _{n} )
i
i
i
. Choose ≥ 0. Note, (∃m)(δ _{m} < ). Hence,
δ
P( ^{}
i
j≥i
B _{n} _{j} ) ≤ lim
i→∞
∞
j=i
P(B _{n} _{j} ) ≤ lim
i→∞
∞ ∞
j=i
P(B
δ
n
_{j} ) = lim
j
i→∞
j=i
which is what we wanted.
<
δ _{i} ).
δ _{j} = 0
Let
5
Deﬁnition 4.5. A sequence of probability measures {α _{n} } on R converges weakly
to α if whenever α(a) = α(b) = 0, for a < b ∈ R, we have
_{n}_{→}_{∞} α _{n} [a, b] = α[a, b].
lim
A sequence of random variable {X _{n} } converges weakly to X if the induced
probability measures {α _{n} } converge weakly to α. This is denoted α _{n} ⇒ α or
X _{n} ⇒ X.
Lemma 4.6. Suppose α _{n} and α are probability measures on R with associated distribution functions F _{n} and F . Then, α _{n} ⇒ α iﬀ F _{n} (x) → F (x) for each continuity point x of F .
Let a < b be
Proof. First, note that x is a continuity point of F iﬀ α(x) = 0.
continuity points of F . Suppose F _{n} (x) → F (x) for each continuity point x of
F . Then,
_{n}_{→}_{∞} α _{n} [a, b] = lim
lim
_{n}_{→}_{∞} F _{n} (b) − F _{n} (a) = F (b) − F (a) = α[a, b].
For the converse, suppose α _{n} ⇒ α. Then,
_{n}_{→}_{∞} F _{n} (b) − F _{n} (a) = lim
lim
_{n}_{→}_{∞} α _{n} [a, b] = α[a, b].
Now, we can let a → −∞ in such a way that a is always a continuity point of
F . Then, we get, lim _{n} F _{n} (b) = α(−∞, b].
The next result shows that weak convergence is actually ‘weak’:
Proposition 4.7. Suppose X _{n} converges in probability to X. Then, X _{n} con verges weakly to X.
Proof. Let F _{n} , F be the distribution functions of X _{n} , X respectively. suppose
x is a continuity point of F . Note that
{X ≤ x − } ^{} {X _{n} − X ≥ } ⊂ {X _{n} ≤ x}
and
{X _{n} ≤ x} 
= 
{X _{n} ≤ x and X 
≤ x + } ∪ {X _{n} ≤ x 
and X > x + } 
⊂ 
{X ≤ x + } ∪ {X _{n} − X ≥ } 
Therefore,
P {X ≤ x − } − P {X _{n} − X ≥ 
} 
≤ 
P{X _{n} ≤ x} 
≤ 
P {X ≤ x + } + P {X _{n} − X ≥ } 
Since for each > 0, lim _{n} P {X _{n} − X  ≥ } = 0, when we let n → ∞, we have
F (x − ) ≤ lim inf F _{n} (x) ≤ lim sup F _{n} (x) ≤ F(x + ).
n→∞
n→∞
6
Finally, since F is continuous at x, letting → 0, we have
_{n}_{→}_{∞} F _{n} (x) = F(x)
lim
so that X _{n} ⇒ X.
The converse is not true in general. However, if X is a degenerate distribu tion (takes a single value with probability one), then the converse is true.
Proposition 4.8. Suppose X _{n} ⇒ X, and X is a degenerate distribution such that P {X = a} = 1. Then, X _{n} → X.
Proof. Let α _{n} and α be the distributions on R induced by X _{n} and X respectively. Given > 0, we have
_{n}_{→}_{∞} α _{n} [a − , a + ] = α[a − , a + ] = 1.
P
lim
Hence, 

_{n}_{→}_{∞} P {X _{n} − X ≤ } = 1, lim 

and so 
_{n}_{→}_{∞} P {X _{n} − X > } = 0
lim
5 Product Measures and Independence
Suppose (Ω _{1} , F _{1} ) and (Ω _{2} , F _{2} ) are two measurable spaces. We want to construct a product measurable space with sample space Ω _{1} × Ω _{2} .
Deﬁnition 5.1. Let A = {A × B : A ∈ F _{1} , B ∈ F _{2} }. Let F _{1} ×F _{2} be the σﬁeld generated by A. F _{1} × F _{2} is called the product σﬁeld of F _{1} and F _{2} .
If P _{1} and P _{2} are probability measures on the measurable spaces above, then P _{1} × P _{2} (A × B) := P _{1} (A)P _{2} (B) gives a probability measure on A. This can be extended in a canonical way to the σﬁeld F _{1} × F _{2} .
Deﬁnition 5.2. P _{1} ×P _{2} is called the product probability measure of P _{1} and P _{2} .
Let Ω := Ω _{1} × Ω _{2} , F := F _{1} × F _{2} , and P := P _{1} × P _{2} . Note that when calculating integrals with respect to a product probability measure, we can normally perform an iterated integral in any order with re spect to the component probability measures. This result is know as Fubini’s Theorem. Before we deﬁne a notion of independence, we will give some heuristic con siderations. Two events A and B should be independent if A occurring has nothing to do with B occurring. If we denote by P _{A} (X), the probability that X occurs given that A has occurred, then we see that P _{A} (X) = ^{P}^{(}^{A}^{∩}^{X}^{)} . Now,
P(A)
suppose that A and B are indeed independent. This means that P _{A} (B) = P (B). But then, P (B) = ^{P}^{(}^{A}^{∩}^{B}^{)} , so that P (A ∩ B) = P (A)P (B). This leads us to
deﬁne,
P(A)
7
Deﬁnition 5.3. Let (Ω, F, P ) be a probability space. Let A _{i} ∈ F for every i. Let X _{i} be a random variable for every i.
(i) A _{1} ,
, A _{n} are independent if P (A _{1} ∩ ··· ∩ A _{n} ) = P(A _{1} )··· P(A _{n} ).
(ii) A collection of events {A _{i} } _{i}_{∈}_{I} is independent if every ﬁnite subcollection is independent.
(iii) 
X _{1} , 
, X _{n} are independent if for any n sets A _{1} , 
, A _{n} ∈ B(R), the events 
{X _{i} ∈ A _{i} } _{i}_{=}_{1} ^{n} are independent. 

(iv) 
A collection of random variables {X _{i} } _{i}_{∈}_{I} is independent if every ﬁnite subcollection is independent. 
Lemma 5.4. Suppose X, Y are random variables on (Ω, F, P ), with induced distributions α, β on R respectively. Then, X and Y are independent if and only if the distribution induced on R ^{2} by (X, Y ) is α × β.
Lemma 5.5. Suppose X, Y are independent random variables, and suppose that f, g are measurable functions. Then, f (X) and g(Y ) are also independent random variables.
Proposition 5.6. Let X, Y be independent random variables, and let f, g be measurable functions. Suppose that Ef (X) and Eg(Y ) are both ﬁnite. Then, E[f (X)g(Y )] = E[f (X)]E[g(Y )].
Proof. Let α be the distribution on R induced by f (X ), and let β be the distri bution induced by g(Y ). Then, the distribution on R ^{2} induced by (f (X ), g(Y )) is α × β. So,
E[f (X)g(Y )] = Ω f (X(ω))g(Y (ω)) dP = R R uv dα dβ
= R u dα R v dβ = E[f (X)]E[g(Y )]
6 Characteristic Functions
The inverse Fourier transform of a probability distribution plays a central role in probability theory.
Deﬁnition 6.1. Let α be a probability measure on R. Then, the characteristic function of α is
φ _{α} (t) = R e ^{ı}^{t}^{x} dα
If X is a random variable, the characteristic function of the distribution on R induced by X will sometimes be denoted φ _{X} . These results demonstrate the importance of the characteristic function in probability.
8
Proposition 6.2. Suppose α and β are probability measures on R with char acteristic functions φ and ψ respectively. Suppose further that for each t ∈ R, φ(t) = ψ(t). Then, α = β.
Theorem 6.3. Let α _{n} , α be probability measures on R with distribution func tions F _{n} and F and characteristic functions φ _{n} and φ. Then, the following are equivalent
(i) α _{n} ⇒ α.
(ii) for any bounded continuous function f : R → R,
(iii) for every t ∈ R,
_{n}_{→}_{∞} R f (x)dα _{n} = R f (x)dα.
lim
_{n}_{→}_{∞} φ _{n} (t) = φ(t).
lim
Theorem 6.4. Suppose α _{n} is a sequence of probability measures on R, with characteristic functions φ _{n} . Suppose that for each t ∈ R, lim _{n} φ _{n} (t) =: φ(t) exists and φ is continuous at 0. Then, there is a probability distriubtion α such that φ is the characteristic function of α. Furthermore, α _{n} ⇒ α.
Next, we show how to recover the moments of a random variable from its characteristic function.
Deﬁnition 6.5. Suppose X is a random variable. Then, the kth moment of X is EX ^{k} . The kth absolute moment of X is EX ^{k} .
Proposition 6.6. Let X be a random variable. Suppose that the kth moment of X exists. Then, the characteristic function φ of X is k times continuously diﬀerentiable, and
φ ^{(}^{k}^{)} (0) = ı ^{k} EX ^{k} .
Now, a result on aﬃne transforms of a random variable:
Proposition 6.7. Suppose X is a random variable, and Y = aX + b. Let φ _{X} and φ _{Y} be the characteristic functions of X and Y . Then, φ _{Y} (t) = e ^{ı}^{t}^{b} φ _{X} (at).
We will often be interested in the sums of independent random variables. Suppose that X and Y are independent random variables with induced distri butions α and β on R respectively. Then, the induced distribution of (X, Y ) on R ^{2} is α × β. Consider the map f : R ^{2} → R given by f (x, y) = x + y. Then, the distribution on R induced by α × β is denoted α ∗ β, and is called the convolution of α and β. α ∗ β is the distribution of the sum of X and Y .
Proposition 6.8. Suppose X and Y are independent random variables with distributions α and β respectively. Then, φ _{X}_{+}_{Y} (t) = φ _{X} (t)φ _{Y} (t).
9
Proof.
φ _{α}_{∗}_{β} (t)
=
R e ^{ı}^{t}^{z} dα ∗ β = R R
_{e} ıt(x+y) _{d}_{α} _{d}_{β}
= R R e ^{ı}^{t}^{x} e ^{ı}^{t}^{y} dα dβ = R e ^{ı}^{t}^{x} dα R e ^{ı}^{t}^{y} dβ
= φ _{α} (t)φ _{β} (t)
7 Useful Bounds and Inequalities
Here, we will prove some useful bounds regarding random variables and their moments.
Deﬁnition 7.1. Let X be a random variable. Then, the variance of X is Var(X) := E[(X − EX) ^{2} ] = EX ^{2} − (EX) ^{2} .
The variance is a measure of how far spread X is on average from its mean. It exists if X has a ﬁnite second moment. It is often denoted σ ^{2} .
Lemma 7.2. Suppose X, Y are independent random variables. Then, Var(X + Y ) = Var(X) + Var(Y )
Proposition 7.3 (Markov’s Inequality). Let > 0. Suppose X is a random
variable with ﬁnite kth absolute moment. Then, P {X ≥ } ≤
Proof.
1
_{k}
EX ^{k} .
P {X ≥ } = {X≥ } dP ≤ ^{1}
^{k}
{X≥ } X k dP ≤ 1
_{k}
Ω X ^{k} dP = ^{1} _{k} EX ^{k}
Corollary 7.4 (Chebyshev’s Inequality). Suppose X is a random variable with ﬁnite 2nd moment. Then,
P {X − EX ≥ } ≤ ^{1} _{2} Var(X).
The following is also a useful fact:
Lemma 7.5. Suppose X is a nonnegative random variable. Then,
Proof.
∞
m=1
P{X ≥ m} ≤ EX
E X =
∞
n=1
nP {n ≤ X < n + 1} =
∞
=
m=1
P{X ≥ m} ≤ EX
∞
∞
m=1 n=m
10
P{n ≤ X < n + 1}
8 The BorelCantelli Lemma
First, let us introduce some terminology. Let A _{1} , A _{2} ,
lim sup A _{n} :=
n→∞
∞ ∞
A _{m} .
n=1 m=n
be sets. Then,
lim sup _{n} A _{n} consists of those ω that appear in A _{n} inﬁnitely often (i.o.). Also,
lim inf A _{n} :=
n→∞
∞
∞
A _{n} .
n=1 m=n
lim inf _{n} A _{n} consists of those ω that appear in all but ﬁnitely many A _{n} .
Theorem 8.1 (BorelCantelli Lemma). Let A _{1} , A _{2} ,
∞, then P (lim sup _{n} A _{n} ) = 0. Furthermore, suppose that the A _{i} are independent.
∈ F. If ^{} _{n}_{=}_{1} P(A _{n} ) <
∞
∞
Then, if ^{} _{n}_{=}_{1} P(A _{n} ) = ∞, then P (lim sup _{n} A _{n} ) = 1.
∞
Proof. Suppose ^{} _{n}_{=}_{1} P(A _{n} ) < ∞. Then,
∞ P n=1 ^{} m = n ^{∞} A _{m} =
n→∞ ^{P}
lim
m=n A _{m} ≤ lim
∞
n→∞
∞
m=n
P(A _{m} ) = 0.
For the converse, it is enough to show that
P
∞
∞
n=1 m=n
A _{m} = 0,
c
and so it is also enough to show that
P
∞
m=n
A _{m} = 0
c
for all n. By independence, and since 1 − x ≤ e ^{−}^{x} , we have
P
∞
m=n
A _{m} ≤ P
c
n+k
m=n
A
m =
c
m=n (1 − P (A _{m} )) ≤ exp − m=n P(A _{m} )
n+k
n+k
Since the last sum diverges, taking the limit as k → ∞, we get
P
∞
m=n
A _{m} = 0
c
11
9 The Law of Large Numbers
be random variables that are independent and indentically dis
tributed (iid). Let S _{n} := X _{1} +···+X _{n} . We will be interested in the asymptotic
behavior of the average ^{S} ^{n} . If X _{i} has a ﬁnite expectation, then we would think that ^{S} ^{n} would settle down to EX _{i} . This is known as the Law of Large Numbers. There are two varieties of this law: the Weak Law of Large Numbers and the Strong Law of Large Numbers. The weak law states that the average converges in probability to EX _{i} . The strong law, however states that the average con verges almost surely to EX _{i} . However, the strong law is signiﬁcantly harder to prove, and requires a bit of additional machinery. For the rest of this section, ﬁx a probability space (Ω, F, P ).
Theorem 9.1 (The Weak Law of Large Numbers). Suppose X _{1} , X _{2} ,
random variables with mean EX _{i} = m < ∞. Then, ^{S} ^{n}
Then, the characteristic
is
function of S _{n} is [φ(t)] ^{n} .
Proof. Let φ be the characteristic function of X _{i} .
are iid
Let X _{1} , X _{2} ,
n
n
n
→ _{P} m.
Then, by 6.7, the characteristic function of ^{S} ^{n}
_{n}
t
ψ _{n} (t) = [φ( _{n} )] ^{n} . Furthermore, by 6.6, φ is diﬀerentiable, and φ ^{} (0) = im. Therefore, we can form the Taylor expansion,
φ _{n} = 1 + ^{ı}^{m}^{t} + o
t
n
n ^{,}
1
and so
ψ _{n} (t) = 1 + ^{ı}^{m}^{t} + o
n
n _{n}
1
.
Taking the limit as n → ∞, we get
n→∞ ψ _{n} (t) = e ^{ı}^{m}^{t}
lim
which is the characteristic function for the distribution degenerate at m. There
fore, by Proposition 4.8, ^{S} ^{n}
n
converges in probability to m.
Theorem 9.2 (The Strong Law of Large Numbers). Suppose X _{1} , X _{2} ,
iid random variables with EX _{i} = m < ∞. Let S _{n} = X _{1} + ··· + X _{n} . Then, ^{S} ^{n} converges to m almost surely.
Proof. We can decompose an arbitrary random variable X _{i} into its positive
i − := −X _{i} I _{{}_{X} _{i} _{<}_{0}_{}} , so that X _{i} =
X
n .
Hence, it is enough to prove the Theorem for nonnegative X _{i} .
Furthermore, let α > 1,
and set u _{n} := α ^{n} . We shall ﬁrst establish the inequality
are
n
and negative parts: X
+
i
−X
i + := X _{i} I _{{}_{X} _{i} _{≥}_{0}_{}} and X
+···+X
+
n
Let S _{n} := Y _{1} +
∗
−
i
. Then, we have S _{n} = X
+
1
−(X
−
1
Y
_{n} .
+···+X
−
n
) =: S
+
n
−S
−
Now, Let Y _{i} := X _{i} I _{{}_{X} _{i} _{≤}_{i}_{}} .
n=1 P
∞
S _{u} _{n} − ES
∗
∗
u
n
u
n
_{} ≥ < ∞
∀ > 0
(9.1)
12
Since the X _{i} are independent, we have
Var(S _{n} ) =
∗
n
k=1
Var(Y _{k} ) ≤
n
k=1
2
EY
k
≤
n
k=1
E[X
2
i
I _{{}_{X} _{i} _{≤}_{k}_{}} ] ≤ nE[X
2
i
^{I} {X _{i} ≤n} ^{]}
By Chebyshev’s inequality, we have
n=1 P
∞
S _{u} _{n} − ES
∗
∗
u
n
u
n
_{} ≥
≤
≤
=
∞
n=1
Var(S ^{2} u
2
n
u n ^{)}
∗
1
^{2}
∞
n=1
E[X
2 i ^{I} {X _{i} ≤u _{n} } ^{]}
u n
^{1}
2 ^{E} ^{X}
∞
2
i
n=1
1
u
n
I _{{}_{X} _{i} _{≤}_{u} _{n} _{}}
(9.2)
Now, let K :=
Also, note that α ^{n} ≤ 2u _{n} , and so u ^{−}^{n} ≤ 2α ^{−}^{n} . Then,
2α
_{α}_{−}_{1} . Let x > 0, and let N := inf{n : u _{n} ≥ x}. Then, α ^{N} ≥ x.
u _{n} ≥x
1
n ≤ 2 ^{}
u
n≥N
1
_{n} = 2α ^{−}^{N}
α
∞
n=0
α n
1
= Kα ^{−}^{N} ≤ Kx ^{−}^{1} ,
and hence,
_{∞}
n=1
1
u
n
^{I} {X _{i} ≤u _{n} } ^{≤} ^{K}^{X}
−1
1
and so, putting this into (9.2), we get
_{2} E X
1
2
i
∞
n=1
1
u
n
^{I} {X _{i} ≤u _{n} } ^{≤}
1 _{2} E ^{} X
2
i
if X _{1} > 0
KX
−1
i
^{} =
^{K} EX _{i} < ∞
^{2}
thereby establishing inequality (9.1). Therefore, by the BorelCantelli Lemma, we have
_{} ≥ = 0
P lim sup
n
_{}
S _{u} _{n} − ES
∗
∗
u
Bien plus que des documents.
Découvrez tout ce que Scribd a à offrir, dont les livres et les livres audio des principaux éditeurs.
Annulez à tout moment.