Vous êtes sur la page 1sur 222

ABDELKADER BENHARI

PROBABILITY, STATISTICS AND RANDOM PROCESSES

This course is an introduction to probability, statistics


and random processes

Contents
I.

POBABILITY .................................................................... 6

Basic Ideas of Probability ............................................................... 7


1. Probability Spaces ........................................................... 7
1.1. Discrete Probability Spaces ................................................... 8
1.2. Continuous Probability Spaces ................................................ 9
1.3. Properties of Probability ..................................................... 9
2. Conditional Probability and Statistical Independence ........................ 11
2.1. Conditional Probability ..................................................... 11
2.2. Composite Probability Formulae .............................................. 11
2.3. Bayes Formulae ........................................................... 12
2.4. Statistical Independence .................................................... 12
Appendix Combinatorics ......................................................... 13
Random Variables and Distributions ...................................................... 15
1. Random Variables ............................................................ 15
1.1. Discrete Random Variables .................................................. 15
1.2. Continuous Random Variables ............................................... 18
1.3. Distributions of Functions of Random Variables ................................. 21
2. Random Vectors (Multidimensional Random Variables) .......................... 23
2.1. Discrete Random Vectors ................................................... 24
2.2. Continuous Random Vectors................................................. 24
2.3. Marginal Distributions/Probabilities/Densities ................................... 24
2.4. Conditional Distributions/Probabilities/Densities ................................. 25
2.5. Independence of Random Variables ........................................... 26
2.6. Distributions of Functions of Random Vectors ................................... 26
Mathematical Expectations (Statistical Average) of Random Variables ........................... 31
1. Mathematical Expectations (Statistical Average) ............................. 31
1.1. Definitions ............................................................... 31
1.2. Properties ................................................................ 32
1.3. Moments ................................................................ 32
1.4. Holder Inequality.......................................................... 34
2. Correlation Coefficients and Linear Regression (Approximation) .............. 35
3. Conditional Expectations and Regression Analysis ............................ 37
4. Generating and Characteristic Functions ..................................... 38
5. Normal Random Vectors ....................................................... 40
Memo ........................................................................... 42
Definition ................................................................... 42
Examples ................................................................... 42
Properties ................................................................... 42
Linear Regression ............................................................. 43
Regression .................................................................. 43
Normal Distribution ........................................................... 43
Limit Theorems ...................................................................... 44

A.BENHARI

-2-

1. Inequalities ................................................................ 44
2. Convergences of Sequences of Random Variables ............................... 45
3. The Weak Laws of Large Numbers .............................................. 46
4. The Strong Laws of Large Numbers ............................................ 47
5. The Central Limit Theorems .................................................. 49
Conditioning. Conditioned distribution and expectation. ............................ 51
1. The conditioned probability and expectation. ................................ 51
2. Properties of the conditioned expectation. .................................. 53
3. Regular conditioned distribution of a random variable. ...................... 59
Transition Probabilities ........................................................... 67
1. Definitions and notations. .................................................. 67
2. The product between a probability and a transition probability. ............. 68
3. Contractivity properties of a transition probability. ....................... 70
4. The product between transition probabilities. ............................... 73
5. Invariant measures. Convergence to a stable matrix .......................... 74
Disintegration of the probabilities on product spaces .............................. 75

II.

1. Regular conditioned distributions. Standard Borel Spaces .................... 75


2. The disintegration of a probability on a product of two spaces ............... 78
3. The disintegration of a probability on a product of n spaces ................. 79
The Normal Distribution ........................................................ 83
1. One-dimensional normal distribution .......................................... 83
2. Multidimensional normal distribution ........................................ 83
3. Properties of the normal distribution ........................................ 86
4. Conditioning inside normal distribution ...................................... 88
5. The multidimensional central limit theorem ................................... 91
STATISTCS ..................................................................... 95

Basic Concepts ....................................................................... 96


1. Populations, Samples and Statistics ......................................... 97
2. Sample Distributions ........................................................ 99
2.1. (Chi-Square)-Distribution ................................................ 99
2

2.2. t(Student)-Distribution .................................................... 100


2.3. F-Distribution ........................................................... 100
3. Normal Populations ......................................................... 103
Parameter Estimation ................................................................. 104
1. Point Estimation ........................................................... 105
1.1. Point Estimators ......................................................... 105
1.2. Method of Moments (MOM)................................................ 105
1.3. Maximum Likelihood Estimation (MLE) ...................................... 106
2. Interval Estimation ........................................................ 108
Tests of Hypotheses .................................................................. 111
1. Parameters from a Normal Population ........................................ 112
2. Parameters from two Independent Normal Populations ......................... 115

A.BENHARI

-3-

III.

RANDOM PTOCESSES ...................................................... 118

Introduction ........................................................................ 119


1.
2.
3.
4.

Definition ................................................................. 120


Family of Finite-Dimensional Distributions ................................. 121
Mathematical Expectations .................................................. 122
Examples ................................................................... 123
4.1. Processes with Independent, Stationary or Orthogonal Increments .................. 123
4.2. Normal Processes ........................................................ 124
Markov Processes (1) ................................................................. 125
1. General Properties ......................................................... 126
2. Discrete-Time Markov Chains ................................................ 128
2.1. Transition Probabilities .................................................... 128
2.2. Classification of States .................................................... 130
2.3. Stationary & Limit Distributions ............................................. 135
2.4. Examples: Simple Random Walks ........................................... 136
Appendix Eigenvalue Diagonalization ........................................... 138
Markov Processes (2) ................................................................. 140
1. Continuous-Time Markov Chains .............................................. 141
1.1. Transition Rates.......................................................... 141
1.2. Kolmogorov Forward and Backward Equations ................................. 142
1.3. Fokker-Planck Equations .................................................. 144
1.4. Ergodicity .............................................................. 145
1.5. Birth and Death Processes .................................................. 146
1.6. Poisson Processes ........................................................ 147
Appendix Queuing Theory .................................................... 153
2. Continuous-Time and Continuous-State Markov Processes ...................... 155
2.1. Basic Ideas.............................................................. 155
2.2. Wiener Processes......................................................... 156
Hidden Markov Models ............................................................... 159
1. Definition of Hidden Markov Models ......................................... 160
2. Assumptions in the theory of HMMs .......................................... 161
3. Three basic problems of HMMs .............................................. 163
3.1. The Evaluation Problem ................................................... 163
3.2. The Decoding Problem .................................................... 163
3.3. The Learning Problem ..................................................... 163
4. The Forward/Backward Algorithm and its Application to the Evaluation Problem 165
5. Viterbi Algorithm and its Application to the Decoding Problem .............. 167
6. Baum-Welch Algorithm and its Application to the Learning Problem ........... 169
6.1. Maximum Likelihood (ML) Criterion ......................................... 169
6.2. Baum-Welch Algorithm ................................................... 169
Second-Order Processes and Random Analysis ............................................. 172
1. Second-Order Random Variables and Hilbert Spaces ........................... 173

A.BENHARI

-4-

2. Second-Order Random Processes .............................................. 174


2.1. Orthogonal Increment Random Processes...................................... 174
3. Random Analysis ............................................................ 176
3.1. Limits ................................................................. 176
3.2. Continuity .............................................................. 176
3.3. Derivatives ............................................................. 177
3.4. Integrals ................................................................ 178
Stationary Processes .................................................................. 179
1. Strictly Stationary Processes .............................................. 180
2. Weakly Stationary Processes ................................................ 181
2.1. Definition .............................................................. 181
2.2. Properties of Correlation/Covariance Functions ................................. 181
2.3. Periodicity .............................................................. 182
2.4. Random Analysis ........................................................ 182
2.5. Ergodicity (Statistical Average = Time Average) ................................ 183
2.6. Spectrum Analysis & White Noise ........................................... 184
3. Discrete Time Sequence Analysis: Auto-Regressive and Moving-Average (ARMA)
Models ........................................................................ 186
3.1. Definition .............................................................. 186
3.2. Transition Functions ...................................................... 186
3.3. Mathematical Expectations ................................................. 188
3.4. Parameter Estimation ..................................................... 189
4. Problems ................................................................... 193
Martingales ....................................................................... 196
1.
2.
3.
Convergence

Simple properties ..................................................... 197


Stopping times ........................................................ 199
An application: the ruin problem. ..................................... 205
of martingales ........................................................ 207

1.
Maximal inequalities .................................................. 207
2.
Almost sure convergence of semimartingales ............................ 210
1
3.
Uniform integrability and the convergence of semimartingales in L .... 214
4.
Singular martingales. Exponential martingales. ........................ 218
Bibliography: ...................................................................... 221

A.BENHARI

-5-

I.

A.BENHARI

POBABILITY

-6-

Basic Ideas of Probability


1. Probability Spaces
There are two definitions of probabilities for random events: classical and modern. The
modern definition of probability is based on the measure theory in which a random event is
nothing but a set and its probability is the measure of the set.

Definition (Sigma-Algebra) Let be a set and a class of subsets of , i.e., a subset


of 2 , is said to be a algebra of if
(1)
(2) if A , then A = A (which implies that )
(3) if A i , where i I and I is at most a countable index set, then

UA

(which

iI

means that the class is closed with respect to union)

Remark 1: 2 is the power set of , i.e., the set of all subsets of .


Remark 2: In measure theory, (, ) is called a measurable space.
Remark 3: Since

IA = IA = UA
i

iI

iI

, is also closed with respect to intersection.

iI

Example Let = {1 , 2 } , = {, {1 }, { 2 }, {1 , 2 }} , where stands for empty set, is


then a a lg ebra .

Definition (Probability Space) Let be a set, a -algebra of and P a real-valued


function defined on , the triplet (, , P ) is called a probability space if P satisfies the
following conditions
(1) P(A ) 0 for all A

A.BENHARI

-7-

+ +
(2) P U A i = P(A i ) for all A 1 , A 2 , L , A n , L such that A i I A j = when
i =1 i =1

i j
(3) P( ) = 1 (which implies that P( ) = 0 )

Remark 1: Usually, is often called sample space, the field of random events and for all

A , P(A ) the probability of occurrence of A.


Remark 2: In measure theory, the probability space (, , P ) is also called measured space.
Remark 3: Two random events A and B are said to be incompatible if AB = . In this case,
P(AB) = 0 .

1.1. Discrete Probability Spaces


The number of all possible occurrences in a random experiment is countable.

Definition A probability space (, , P ) is called a discrete probability space if the sample


space is a countable (finite or denumerable infinite) set and = 2 .

Remark 1: To specify a discrete probability P, it suffices to specify a mapping p : [0,1]


such that p() 0 for all and

p() = 1 . Then, for all A , P(A ) = p() .

Remark 2: If = {1 , 2 , L , N } and p( i ) =

1
, where i = 1,2, L , N , then the resulting
N

triple (, , P ) is called classical probability space.

Example Let = {1 , 2 } , = {, {1 }, { 2 }, {1 , 2 }} , and

1
2
(1) p(1 ) = , p( 2 ) = , then (, , P ) is a discrete probability space
3
3
(2) p(1 ) = p( 2 ) =

A.BENHARI

1
, then (, , P ) is a classical probability space
2

-8-

Example Let = {1 , 2 , L , n , L} , = 2

1
2
6
=
and p( n ) = + n
, n = 1,2, L , then
1
(n)2

2
k =1 k

(, , P ) is a discrete probability space.

1.2. Continuous Probability Spaces


The number of all possible occurrences in a random experiment is uncountable.

Definition A probability space (, , P ) is called a continuous probability space if the


sample space is a continuum.

Example (Geometric Probability) Assume that the sample is an interval, an area or a


volume, then the probability of a point falling into a part of is given by

P=

Measure of the part of


Measure of

1.3. Properties of Probability


Theorem (Finite Measure) Let (, , P ) be a probability space, then for all A ,

P(A ) + P(A ) = P( ) = 1 P(A ) 1

Theorem (Monotonicity) Let (, , P ) be a probability space, then for all A, B ,


A B P(A ) P(A ) + P(B A ) = P(B)

Theorem (Union) Let (, , P ) be a probability space, then for all A, B ,

A.BENHARI

-9-

P(A U B) = P(A U (B A )) = P(A ) + P(B A ) = P(A ) + P(B) P(A I B)

Theorem (Union) Let (, , P ) be a probability space, then for all A 1 , A 2 , L , A n ,

n
n
k 1
P U A i = ( 1)
P A i1 L A ik
1i1 <L<i k n
i=1 k =1

Hint:

n +1
n

P U A i = P U A i + P(A n +1 ) P U (A i A n +1 )
i=1
i=1
i=1

n
n

k 1
k 1
= ( 1)
P
A
L
A
+
P
(
A
)

P A i1 L A i k A n +1
( 1)

i1
ik
n +1
k =1
1i1 <L<i k n
k =1
1i1 <L<i k n

P(A ) + P(A )

1 i1 n

i1

n +1

n
n 1

k 1
k
+ ( 1)
P A i1 L A ik + ( 1)
P A i1 L A ik A n +1

k=2
1i1 <L<i k n
1i1 <L<i k n
k =1

+ ( 1)

P(A

1i1 <L<i n n

i1

L A in A n +1

P(A )

1i n +1

k 1
+ ( 1) P A i1 L A ik +
P A i1 L A i k 1 A n +1

k=2
1i1 <L<i k 1 n
1i1 <L<ik n

+ ( 1) P(A1 L A n A n +1 )
n

n
k 1
(
)
P
A
+
P A i1 L A i k + ( 1) P(A1 L A n A n +1 )
( 1)

i
1i n +1
k =2
1i1 <L<i k n +1

n +1

k 1
= ( 1)
P A i1 L A ik

k =1
1i1 <L<i k n +1

A.BENHARI

-10-

2. Conditional Probability and Statistical Independence


2.1. Conditional Probability
Definition Let (, , P ) be a probability space and A, B , the conditional probability of
B, given that A has occurred, is defined as P(B A ) =

P (AB)
, where P(A ) > 0 .
P(A )

Theorem Let (, , P ) be a probability space and A with P(A ) > 0 , the triplet

( A , A , PA )

is also a probability space, where A = I A , A = {AB B } and

PA (AB) = P(B A ) .

2.2. Composite Probability Formulae


Theorem (Composite Probability Formula) Let (, , P ) be a probability space, and

A , if A U E k , where E k with P(E k ) > 0 and E i I E j = for all i j , then


k

P(A ) = P(A E k )P(E k ) .


k

Proof:

P(A ) = P A U E k = P U (AE k ) = P(AE k ) = P(A E k )P(E k ) #


k

k
k
k
Remark:

A B A I B = A , A I U E k = U (A I E k )
k
k

A.BENHARI

-11-

2.3. Bayes Formulae


Theorem (Bayes Formula) Let (, , P ) be a probability space and A with P(A ) > 0 ,
if A U E k , where E k with P(E k ) > 0 and E i I E j = for all i j , then
k

P(E i A ) =

P(E i )P(A E i )
.
P(E k )P(A E k )
k

Proof:

P(E i A ) =

P(AE i )
=
P(A )

P(E i )P(A E i )
#
P(E k )P(A E k )
k

2.4. Statistical Independence


Definition Let (, , P ) be a probability space and A, B , A and B are said to be
statistically independent if P(AB) = P(A )P(B) .
Remark 1: If A and B are independent, then P (A B ) =

P(AB)
= P(A ) .
P(B)

Remark 2: Recall that two events A and B are said to be incompatible if AB = . In this
case, P(AB) = 0 .

Definition Let (, , P ) be a probability space and a subset of , is said to be

statistically independent if for all finite subsets of , P I A = P(A ) .


A A
Remark: The statistical independence of any two events of can not guarantee the
statistical independence of . For example, = {A, B, C} , is statistically independent if
P(AB) = P(A )P(B) , P(AC) = P(A )P(C ) , P(BC ) = P(B)P(C ) , P(ABC) = P(A )P(B)P(C )
are established at the same time.

A.BENHARI

-12-

Appendix Combinatorics
Sample Selection Suppose there are m distinguishable elements, how many ways there are in
which one can select r elements from these m distinguishable elements?

Repetitions are
Order

allowed?

The number of ways

counts?

(With/Without

to choose the samples

Remarks

replacement)
Yes

Yes

mr

Permutation

Yes

No

m!
(m r )!

Permutation

No

Yes

(m + r 1)!
r!(m 1)!

Combination

No

No

m!
r!(m r )!

Combination

Balls into Cells There are eight different ways in which n balls can be placed into k cells:

Distinguish the Distinguish the

Can cells be

The number of ways to

balls?

empty?

place n balls into k cells

A.BENHARI

cells?
Yes

Yes

Yes

kn

Yes

Yes

No

n
k!
k

No

Yes

Yes

(k + n 1)!
n!(k 1)!

No

Yes

No

(n 1)!
(k 1)!(n k )!

-13-

Yes

Yes

No

Yes

No

r =1

r
n

k

No
k

No

No

Yes

p (n )
r =1

No

No

No

p k (n )

n 1 k
kr k
where = ( 1) r n is the Stirling cycle number and p k (n ) the number of
k k! r =1
r
partition of the number n into exactly k integer pieces.

A.BENHARI

-14-

Random Variables and Distributions


1. Random Variables
Definition Let (, , P ) be a probability space, a random variable is a function
f : R (Re al Numberes) such that for all x R , E(x ) = { , () < x} .

Remark 1: In terms of measure theory, a random variable is in fact a measurable function


over the measurable space (, ) .

Remark 2: In application, a random variable can be used to depict a random experiment and
E(x ) can be used to depict a result of the experiment, i.e., a random event.

Definition Let (, , P ) be a probability space and a random variable, then the probability

F(x ) = P{ , () < x}
is called the distribution (function) of .

Theorem Let F(x ) be the distribution of a random variable, then


(1) F(x ) is monotone increasing
(2) F(x ) is continuous from left
(3) lim F(x ) = 0 , lim F(x ) = 1
x

x +

Remark 1: If the distribution F(x ) is defined as F(x ) = P{ , () x} , then F(x ) is


continuous from right.

Remark 2: For all a < b , P{a < b} = F(b ) F(a ) .

1.1. Discrete Random Variables


Definition A random variable is said to be a discrete random variable if its distribution
function is not continuous.
A.BENHARI

-15-

Remark: If is a discrete random variable, then F(x ) = P{ < x} = P{ = k} . Note that


k<x

F(x ) is continuous from left. For all x, P{ = x} = F(x + 0 ) F(x ) .

1.1.1. Bernoulli Distribution


Example (Bernoulli Distribution) A discrete random variable is said to have 0 1
(Bernoulli) distribution if
k =1
p

P{ = k} = 1 p = q k = 0 , where p > 0 and p + q = 1


0
others

In the case, we have


x0
0

F(x ) = P{ < x} = P{ = k} = q 0 < x 1


k<x
1
x >1

Note that F(x ) is continuous from left.

1.1.2. Binomial Distribution


Example (Binomial Distribution) A discrete random variable is said to have a binomial
distribution if

P{ = k} = C kn p k q n k , where p > 0 , p + q = 1 , k = 0,1,L, n , C nk =

n!
k! (n k )!

Remark 1: Note that (a + b ) = C nk a k b n k


n

k =0

Remark 2: If let { = k} be an event that among the n independent random experiments only
k experiments are successful, then P{ = k} = C kn p k q n k .

Theorem If for all n, np n = = Const. , then


lim C nk p kn (1 p n )

n +

proof:

A.BENHARI

-16-

nk

k
e
k!

Recall that lim 1 + = e t , we have


x
x

lim C p (1 p n )

n +

n
k

k
n

n k


= C 1
n
n
k

n k

n
k

Remark: For n large enough, C p (1 p )


n
k

k
1
k 1
=
1 L 1
1
k!
n
n
n

n k

k
(
np )

k!

n k

k
e #
k!

e np

Example If the variables 1 , 2 ,L, n are statistically independent and distributed with the
n

same 0 ~ 1 distribution, then the variable = i possesses the binomial distribution.


i =1

1.1.3. Negative Binomial Distribution


Example (Negative Binomial Distribution) A discrete random variable is said to have a
negative binomial distribution if

P{ = k} = C kn 11 p n q k n , where p > 0 , p + q = 1 and k = n, n + 1, L

1.1.4. Geometric Distribution


Example (Geometric Distribution) A discrete random variable is said to have a geometric
distribution if

P{ = k} = q k1 p , where p > 0 , p + q = 1 and k = 0,1,L


Remark: If let { = k} be an event such that the kth experiment is first successful one, then

P{ = k} = q k1 p .

1.1.5. Hypergeometric Distribution


Example (Hypergeometric Distribution) A discrete random variable is said to have a
hypergeometric distribution if

P{ = k} =

A.BENHARI

N M
CM
k C n k
, where M < N , k M , n N and k = 0,1,L, n
C nN

-17-

1.1.6. Poission Distribution


Example (Poission Distribution) A discrete random variable is said to have a Poisson
distribution if
P{ = k} =

k
e , where > 0 , k = 0,1,L
k!

1.2. Continuous Random Variables


Definition A random variable is said to be a continuous random variable if its distribution
function is continuous.
Definition A function f (x ) is called a probability density function if f (x ) 0 and
+

f (x )dx = 1.

Remark: It can be easily proven that the function


F(x ) =

f ()d

is a distribution function, i.e., F(x ) is monotone increasing, continuous and lim F(x ) = 0 ,
x

lim F(x ) = 1 .

x +

Theorem Let be a continuous random variable with distribution F(x ) , then there must be a
probability density function f (x ) such that F(x ) =

f ()d .

Remark: For a continuous random variable, the relation between its distribution and its
probability density function is as follows:
F(x ) =

f ()d

F(x ) = f (x )

A.BENHARI

-18-

1.2.1. Uniform Distribution


Definition A continuous random variable is said to have a uniform distribution if its
density function is as follows:
1

f (x ) = b a
0

x (a , b )
others

1.2.2. Normal Distribution

Definition A continuous random variable is said to have a normal distribution N , 2 if


its density function is as follows:

1
f (x ) =
e
2

( x )2
22

, x ( ,+ )

1.2.3. Exponential Distribution


Definition A continuous random variable is said to have an exponential distribution if its
density function is as follows:

e x
f (x ) =
0

x0
, where > 0
x<0

Remark: The distribution of follows immediately:

x t
e dt = 1 e x
F(x ) = P{ < x} = f (t )dt =
0

x0
x<0

Theorem (Necessary Conditions) If a random variable is exponentially distributed with


the parameter , then for all x 0 and x > 0 , we have
P{ < x + x x} = x + o(x )
where o(x ) is the higher order infinitesimal of x , i.e., lim

x 0

Proof:
(1) At first, we have

A.BENHARI

-19-

o(x )
=0.
x

P{ x + x x} =

P{ x + x; x} P{ x + x} e (x + x )
=
=
= e x = P{ x}
x
P{ x}
P{ x}
e

This property is often called memoryless.


(2) From the memoryless property, we further have
P{ < x + x x} = 1 P{ x + x x} = 1 P{ x} = P{ < x}

=1 e

k
(
x )
= x + ( 1)
+

k!

k =2

o ( x )= ( 1)k

(x )k

k=2

x + o(x ) #

k!

xn
.
n = 0 n!

Remark: e x =

Theorem (Sufficient Conditions) If a continuous random variable satisfies the following


conditions
P{ 0} = 1 ; P{ < x + x x} = x + o(x ) for all x 0 and x > 0
then it must be exponentially distributed with the parameter .

Proof:
Let p(t ) = P{ t}, then we have p(0 ) = P{ 0} = 1 and
p(t + t ) = P{ t + t} = P{ t + t; t} = P{ t + t t}P{ t}
= [1 P{ < t + t t}]p(t ) = [1 t + o(t )]p(t )
which leads to

d ln p(t )
p(t + t ) p(t )
o(t )
=
= + lim
p(t ) = p(t )

t 0
t 0 t
dt
t

p (t ) = lim

p(t ) = p(0)e t = e t F(t ) = P{ < t} = 1 P{ t} = 1 p(t ) = 1 e t

This shows that the random variable is exponentially distributed. #

Example (Speaking Time) Suppose the probability of a telephone being used at time t and
released during the coming period (t , t + t ] is t + o(t ) , whats the distribution of time T
during which the telephone is being used, i.e., the speaking time of a telephone user?

A.BENHARI

-20-

Example Suppose there are n persons speaking at time t, whats the probability of the event
that 2 or more persons finish speaking in the coming time period (t , t + t ] ?
Solution:
Let i be a random variable such that { i = 1} represents the event that the ith person finishes
speaking in the time period (t , t + t ] , then
p{ i = 1} = t + o(t ) , p{ i = 0} = 1 t + o(t )
where i = 1,2, L , n . Thus, the random variable

i =1

represents the number of persons who

finish speaking in the coming time period, which leads to

P i 2
1 P i = 0 P i = 1
i =1
= lim
i =1

i =1

lim+
+
t 0
t 0
t
t
1 [1 t + o(t )] n[t + o(t )][1 t + o(t )]
= lim+
t 0
t

n 1

=0

This means that P i 2 = o(t ) . #


i =1

1.2.4. Gamma Distribution


Definition A continuous random variable is said to have a Gamma distribution if its
density function is as follows:

1 x

x e
f (x ) = ( )

Remark: Gamma Function: ( ) =

x>0

, where > 0 , > 0

x0

1 t

e dt , where > 0 .

1.3. Distributions of Functions of Random Variables


Given the distribution of , what is the distribution of g ( ) ?

Example Let be a random variable, g (x ) a continuous function and = g ( ) ,

A.BENHARI

-21-

If the function g (x ) is strictly monotone-increasing, then


F (y ) = P{ = g ( ) < y} =

f (x )dx =

g ( x )< y

g 1 ( y )

f (x )dx f (y ) =

dF (y )
dy

) dgdy(y)

) dgdy(y)

= f g 1 (y )

If the function g (x ) is strictly monotone-decreasing, then


F (y ) = P{ = g ( ) < y} =

f (x )dx =

g ( x )< y

f (x )dx f (y ) =

g 1 ( y )

dF (y )
dy

= f g 1 (y )

Remark 1: To sump up, when g (x ) is continuous and strictly monotone,

) dgdy(y)
1

f =g ( ) (y ) = f g 1 (y )

Remark 2:

g(x )
g (x )

d
= g (x )h (x , g (x )) f (x )h (x , f (x )) + h (x , t ) dt
(
)
h
x
,
t
dt
x

dx f (x )
f (x )

Example (Linear Transform) Let be a random variable and = a + b , a > 0 , then

y b

y b
F ( y ) = P{ = a + b < y} = P <

= F
a

a
y b
dF

dF ( y )
a 1 y b

f ( y ) =
=
= f

dy
dy
a a
Remark: For a 0 , f ( y ) =

1 y b
f
.
a a

Example (Parabolic Function) Let be a random variable and = 2 , then

P y < <
F ( y ) = P = 2 < y =

f ( y ) =

A.BENHARI

dF ( y )
dy

( y )+ f ( y )

2 y
0

-22-

y = F
0

( y ) F ( y )

y>0
y0

y>0
y0

Example (Exponential Function) Let be a random variable and = e , then

P{ < ln y} = F (ln y ) y > 0


F ( y ) = P = e < y =
0
y0

f ( y ) =

dF ( y )
dy

1
f (ln y ) y > 0
= y

0
y0

Example (Logarithmic Function) Let be a random variable and = ln , then


ey

( )

F (y ) = P{ = ln < y} = f (x )dx = F e y f (y ) =
0

dF (y )
dy

( )

= f ey ey

Example (Triangular Function) Let be a random variable and = sin , then


1
y >1

+ 2(k 1) sin 1 y

F ( y ) = P{ = sin < y} =
f1 (x )dx 1 < y 1
=

2 k + sin y

0
y 1

2. Random Vectors (Multidimensional Random Variables)


Definition Let 1 , 2 , L , n be n random variables defined on the same probability space,
then the vector (1 , 2 , L , n ) is called a random vector.

Definition Let ( 1 , 2 , L , n ) be a random vector, then for all ( x1 , x 2 , L , x n ) R n , the


function

F (x1 , x 2 , L , x n ) = P{ 1 < x1 ; 2 < x 2 ; L; n < x n }


is called the joint distribution function of ( 1 , 2 , L , n ) .

Example Let (, ) be a random vector and F(x , y ) its joint distribution, then

A.BENHARI

-23-

P{a < b; c < d} = F(b, d ) F(a , d ) F(b, c ) + F(a , c )

2.1. Discrete Random Vectors


Definition If each component of a random vector (1 , 2 , L , n ) is a discrete random
variable, the random vector (1 , 2 , L , n ) is then called a discrete random vector.
Remark: If (1 , 2 , L , n ) is a discrete random vector, then
F(x 1 , x 2 , L , x n ) = P{1 < x 1 ; 2 < x 2 ;L; n < x n } =
=

k1 < x1 k 2 < x 2

P{

kn <xn

= k1 ; 2 = k 2 ;L ; n = k n }

2.2. Continuous Random Vectors


Definition If each component of a random vector (1 , 2 , L , n ) is a continuous random
variable, the random vector (1 , 2 , L , n ) is then called a continuous random vector.

Theorem Let (1 , 2 , L , n ) be a continuous random vector and F(x 1 , x 2 , L , x n ) its joint


distribution function, then there is a function with n variables f (x 1 , x 2 , L , x n ) such that
(1) f (x 1 , x 2 , L , x n ) 0
(2)

+ +

L f (x , x
1

, L , x n )dx 1dx 2 L dx n = 1

(3) F(x 1 , x 2 , L , x n ) =

x1 x 2

xn

L f ( ,
1

, L , n )d1 d 2 L d n

Remark: The function f (x 1 , x 2 , L , x n ) is called joint density function of (1 , 2 , L , n ) ,


which characterizes the random vector completely.

2.3. Marginal Distributions/Probabilities/Densities


Definition Let (1 , 2 , L , n ) be a random vector and F(x 1 , x 2 , L , x n ) its distribution, then
the marginal distribution of any sub-vector of (1 , 2 , L , n ) , say, ( 1 , 2 , L , p ) , p < n , is
given by

A.BENHARI

-24-

F(x 1 , x 2 , L , x p ) = F(x 1 , x 2 , L , x p , x p +1 = +, L , x n = + )

Remark: In the discrete case, we prefer the marginal probability as followed:


P{1 = k 1 ; L ; p = k p } = L P{1 = k 1 ; L; p = k p ; p +1 = k p +1 ; L ; n = k n }
k p +1

kn

In the continuous case, we prefer the marginal density as followed:


+

f (1 , 2 , L , p ) = L f (1 , L , p , p +1 , L , n )d p +1 L d n

2.4. Conditional Distributions/Probabilities/Densities


Definition Let

(1 , 2 ,L , n )

be a discrete random vector and F ( x1 , x 2 , L , x n ) its

distribution, then the conditional distribution of ( 1 , 2 , L , n ) , given that its sub-vector

, L , n ) , p < n , has taken a certain value, say, (k p +1 , L , k n ) , is given by

p +1

F1 ,L, p

p +1 ,L, n

(x , x
1

, L , x p k p +1 , L , k n ) =

L P{

k1 < x1

kp <xp

= k1 ; L ; p = k p ; p +1 = k p +1 ; L ; n = k n }
P{ p +1 = k p +1 ; L ; n = k n }

Remark: Again, in the discrete case, we prefer the conditional probability to the conditional
distribution:
P{1 = k 1 ;L; p = k p p +1 = k p+1 ;L; n = k n } =

P{1 = k 1 ;L; p = k p ; p+1 = k p +1 ;L; n = k n }


P{ p+1 = k p +1 ;L; n = k n }

Definition Let (1 , 2 , L , n ) be a continuous random vector and F(x 1 , x 2 , L , x n ) its


distribution, then the conditional distribution of (1 , 2 , L , n ) , given that the sub-vector

p +1

, L , n ) , p < n , has taken certain values, say, (x p +1 , L , x n ) , is given by


F1 ,L, p

Remark:

p +1 ,L, n

(x , x
1

f (1 ,L, p x p +1 ,L, x n ) =

A.BENHARI

xp

f p +1 L n (x p +1 , L , x n )

the

conditional

, L , x p x p +1 , L , x n ) = L

In

f (1 , L , p , x p +1 , L , x n )

x1

practice,

f (1 ,L, p , x p +1 ,L, x n )
f p +1Ln (x p +1 ,L, x n )

-25-

d 1 L d p
density

is preferred to the condition distribution.

2.5. Independence of Random Variables


Definition The random variables 1 , 2 , L , n are said to be independent if for all
x 1 , x 2 ,L, x n R ,
P{1 < x 1 ; 2 < x 2 ; L ; n < x n } = P{1 < x 1 }P{ 2 < x 2 }L P{ n < x n }
or expressed in distribution
F1 2 L n (x 1 , x 2 , L , x n ) = F1 (x 1 )F 2 (x 2 )L F n (x n )

Remark 1: If the random variables 1 , 2 , L , n are independent, then any subset of


1 , 2 , L , n , say, i1 , i 2 , L , i k , k < n , is also independent, i.e.,

} {

}{

} {

P i1 < x i1 ; i 2 < x i 2 ;L; i k < x i k = P i1 < x i1 P i 2 < x i 2 L P i k < x i k

Remark 2: For discrete random variables, the independence can be stated as


P{1 = x 1 ; 2 = x 2 ;L ; n = x n } = P{1 = x 1 }P{ 2 = x 2 }L P{ n = x n }
Also, for continuous random variables, the independence can be stated as
f (x 1 , x 2 , L , x n ) = f 1 (x 1 )f 2 (x 2 )L f n (x n )
where f (x 1 , x 2 , L , x n ) is the joint probability density function of 1 , 2 , L , n , and f i (x ) is
the probability density function of i , i = 1,2, L , n .

2.6. Distributions of Functions of Random Vectors


Example (Addition) Let and be two random variables and = + , then
F (z ) = P{ = + < z} =

f (x , y )dxdy =

x + y<z

+ z y

f (x , y )dx dy

z +
z
z

f
(
u

y
,
y
)
du
dy
=
f
(
u

y
,
y
)
dy
du
=
f (u )du

x + y=u

where f (z ) =

dF (z )
dz

= f (z y, y )dy .

If the random variables and are independent, then


f (z ) =

f (z y, y )dy = f (z y )f (y)dy = (f

A.BENHARI

-26-

* f )(z )

Example (Addition) Let T1 , T2 , L , Tn , L be independent exponential random variables with


the same parameter . Show that the distribution of Sn = T1 + T2 + L + Tn is the gamma
distribution:

n x n 1 x
e

f Sn (x ) = (n 1)!

x0

, where n 1

x<0

Solution:
n

When n = 1 , the theorem is self-evident. For n 1 , S n = Tk is first assumed to be gammak =1

distributed, the distribution of S n +1 = S n + Tn +1 will be then given by


+

f Sn +1 (x ) = f Sn (t )f Tn +1 (x t )dt =

n t n 1 t (x t )
n +1 x n 1
n +1 x n x
e e
dt =
e t dt =
e
(n 1)! 0
(n 1)!
n!
x

By induction, the theorem is workable. #

Remark: It follows that


n t n 1 t
0 (n 1)! e dt
x

P{S n < x}
lim+
= lim+
x 0
x 0
x

n = 1
n x n 1 x
= lim+
e =
x 0 (n 1)!
0 n 2

P{S n < x} = o(x ) , n 2

This remark shows that the probability of 2 or more telephones being called by a person
during a period is the higher order infinitesimal of the period.

Example (Subtraction) Let and be two random variables, then


F (z ) = P{ = < z} =

+ z + y

f (x, y )dxdy = f (x, y )dx dy

x y< z

z +
z
z

=
f (u + y, y )du dy = f (u + y, y )dy du = f (u )du
x y=u

where f (z ) =

dF (z )
dz

= f (z + y, y )dy .

If the random variables and are independent, then

A.BENHARI

-27-

f (z ) =

f (z + y, y )dy = f (z + y )f (y )dy

Example (Division) Let and be two random variables, then


0 +
+ zy

F (z ) = P = < z = f (x, y )dxdy = f (x, y )dx dy + f (x, y )dx dy

x <z
0
zy

0
z

= yf (uy, y )du dy + yf (uy, y )du dy


x
= u 0
z

y
0
+

yf
(
uy
,
y
)
dy

0
yf (uy, y )dy du
z

z
+

= y f (uy, y )dy du = f (u )du


where f (z ) =

dF (z )
dz

y f (zy, y )dy .

Example (Multiplication) Let and be two random variables, then

zy

0 +

F (z ) = P{ = < z} = f (x , y )dxdy = f (x , y )dx dy + f (x , y )dx dy

xy < z
0
z

0
z 1 u
1 u

= f , y du dy + f , y du dy
xy = u
y y
y y
0
z
0
+ 1 u
1 u

f
,
y
dy

0 y y y f y , y dydu
z

z
+ 1
u

= f , y dy du = f (u )du
y
y

where f (z ) =

A.BENHARI

dF (z )
dz

yf

z
, y dy .
y

-28-

Example Suppose and are independent random variables with the same exponential
distribution , i.e.,

2 e ( x + y )
f (x, y ) = f (x )f (y ) =
0

x > 0, y > 0
others

then

f x (x, y )dxdy u > 0, v > 0

F (u , v ) = P = + < u; = < v = 0< x + y < u , 0< < v


y


0
others

pq
p p

,
dpdq u > 0, v > 0
f
= 0< p<u ,0<q < v q + 1 q + 1 (q + 1)2
x
p = x + y ,q =
y
0
others

uv
u u
u2
,
=
e u
f

f (u , v ) = v + 1 v + 1 (v + 1)2 (v + 1)2

u > 0, v > 0
others

Remark 1:
x
p
J=
y
p

q
x
1 + q

q
pq= p
y x = , y = 1
1+ q
1+ q
1 + q
q

2
(1 + q ) dxdy = J dpdq = p dpdq
p
(1 + q )2

(1 + q )2

Remark 2: f (u , v ) can be obtained in another way:

F (u , v ) = P = + < u; = < v =
f (x, y )dxdy 0<u=,0<v

0< x + y < u , 0< x < v


y

uv
1+ v

(u x )

x
e dx =

uv
1+ v

1+vv x

e
e u dx

v
uv u
v
1 e u
e =
1 e u ue u
1+ v
1+ v
1+ v

f (u , v ) =

2 F (u , v )
uv

2 u
e u

2
= (v + 1)

Theorem (Jacobian Transform) Let


A.BENHARI

-29-

0 < u ,0 < v
others

uv
1+ v

u x

e y dy e x dx

xv

1 f 1 (1 , 2 , L , n )
f ( , , L , )
n
2 = 2 1 2


one toone correspendence
M

n f n (1 , 2 , L , n )

1 g 1 (1 , 2 , L , n )
g ( , , L , )
n
2 = 2 1 2
M

n g n (1 , 2 , L , n )

then
F12 Ln (y1 , y 2 , L , y n ) = P{1 < y1 ; 2 < y 2 ;L; n < y n }

1 2 L n
f1 ( x 1 , x 2 ,L, x n )< y1
f 2 ( x 1 , x 2 ,L, x n )< y 2

(x 1 , x 2 ,L , x n )dx 1dx 2 L dx n

M
f n ( x 1 , x 2 ,L, x n )< y n

u1 = f1 ( x1 , x 2 ,L, x n )
u 2 = f 2 ( x1 , x 2 ,L, x n )
M
u n = f n ( x1 , x 2 ,L, x n )

u 1 < y1
u 2 <y2

1 2 L n

[g1 (u 1 , u 2 ,L, u n ),L, g n (u 1 , u 2 ,L, u n )] J du 1du 2 L du n

M
un <yn

which leads to

f 12 Ln (u 1 , u 2 ,L, u n ) = f 1 2 L n [g 1 (u 1 , u 2 ,L , u n ),L , g n (u 1 , u 2 ,L, u n )] J


g 1
u
1
g 2
where J = u
1
M
g
n
u 1

A.BENHARI

g 1
u 2
g 2
u 2
M
g n
u 2

g 1
u n

g 2
L
u n is Jacobian matrix.
O
M
g n

L
u n
L

-30-

Mathematical Expectations (Statistical Average) of


Random Variables
1. Mathematical Expectations (Statistical Average)
1.1. Definitions
Definition Let be a discrete random variable and g (x ) a function, then the mathematical
expectation of g ( ) is defined as

E[g ( )] = g ( k )P{ = k }
k

if

g( ) P{ = } < + .
k

Remark 1: If

g( ) P{ = } < + , E[g()] is then said to be well defined.


k

Remark 2: The definition can be easily generalized to multivariate distributions. For


example,

E[g(, )] = g ( i , j )P{ = i ; = j }
i, j

Definition Let be a continuous random variable and g (x ) a function, then the


mathematical expectation of g ( ) is defined as

E[g ( )] =

g(x )f (x )dx

if

g(x ) f (x )dx < + , where f (x ) is the density function of .

Remark 1: If

g(x ) f (x )dx < + , E[g()] is then said to be well defined.

Remark 2: The definition can be easily generalized to multivariate distributions. For


example,

A.BENHARI

-31-

E[g (, )] =

+ +

g(x, y )f (x, y )dxdy

where f (x , y ) is the joint density function of and .

1.2. Properties
Theorem The expectation E[] is a linear operator, i.e.,
E[af ( ) + bg ( )] = aE[f ( )] + bE[g ()]
where f (x ) and g (x ) are two functions, and two random variables and a and b two
numbers.

Theorem If two random variables and are independent, then


E[f ( )g ( )] = E[f ( )]E[g ()]
where f (x ) and g (x ) are two functions.

Theorem Let and be two random variables, then

E = 0 P{ = } = 1
2

Remark: In terms of probability, P{ = } = 1 means = .

1.3. Moments
Definition Let be a random variable, then

[ ]

[ ]

E k is called the k-th original moment of if E k is well defined.

E ( E )

] is called the k-th central moment of if E[( E) ] is well defined.


k

[ ]

Remark 1: A random variable is said to be second-order if E 2 is well defined.


Remark 2: The first-order original moment of is called the mean of . The second-order
central moment of is called the variance of , often denoted by D .

A.BENHARI

-32-

Example Let be a second-order random variable and =

E
D

, then

E[] = 0 , D[] = 1

Remark: The variable =

E
D

is often called the standardized/normalized variable of .

] [
]
E[( ) ]

Theorem (Variational Inequality) For all numbers , E ( E ) E ( ) .

] [

] [

Hint: E ( E ) = E ( + E ) = E ( ) ( E )
2

Theorem If 1 , 2 , L , n are independent, then


2
2
n
n
n
n



D i i = E i i i E i = E i ( i E i )
i =1
i =1



i =1
i =1

= E[(
n

i =1 j=1

2
2
2
i E i )( j E j )] = i E [( i E i ) ] = i D i
n

i =1

i =1

Example
k =1
p
Bernoullis distribution: P{ = k} =
, then
1 p k = 0

E = p , D = E ( E )2 = p(1 p )
Binormial distribution: P{ = k} = C nk p k q n k , k = 0,1, L , n , then

E = np , D = E ( E )2 = npq
Poisson distribution: P{ = k} =


e , k = 0,1,2, L , then
k!

] [ ]

E = , D = E ( E ) = E 2 (E ) =
1

Uniform distribution: f (x ) = b a
0

x (a , b )

, then

others

2
a+b
(
b a)
2
E =
, D = E ( E ) =
2
12

A.BENHARI

-33-

e x
Exponential distribution: f (x ) =
0

E =

1
Normal distribution: f (x ) =
e
2

x>0
others

, then

1
1
2
, D = E ( E ) = 2

( x )2
22

, x ( ,+ ) , then

E = , D = 2

1.4. Holder Inequality


Theorem Suppose and are two random variables defined on the same probability space,
then

( [ ]) (E[ ])

E[ ] E
where p > 1 and

1
p

1
q

1 1
+ = 1.
p q

Proof:
(1) We first prove that u v u + v , where u 0 , v 0 , 0 < < 1 and + = 1 .
Lets begin with the function y = x , where 0 < < 1 . Since y = ( 1)x 2 < 0 for all
x > 0 , the shape of y = x must be convex over the range (0, + ) , which leads to

x x + , where = 1 and x > 0


This is because y = x + is the tangent of y = x at the point x = 1 . Note that the above
inequality can be also applied to the case of x = 0 , let x =

u
, where v > 0 and u 0 , we
v

then have

v u u + v
Again, the above inequality can be applied to the case of v = 0 .
(2) Let
v=

[ ]

, u=

[ ]

, where p > 1 and

and

A.BENHARI

-34-

1 1
+ =1
p q

1
1
and = ,
p
q

we then obtain from the inequality obtained in (1) that

(E[ ]) (E[ ])
1
p

1
q

1
1

+
q E q
p E p

( [ ]) ( [ ])

Applying the mathematical expectation to both sides of the above inequality gives

E[ ]

(E[ ]) (E[ ])
p

1
p

1
q

( [ ]) (E[ ])

1 1
p
+ = 1 E[ ] E
q p

1
p

1
q

Remark 1: When p = q = 2 , Holder inequality is also called Cauchy-Schwarz inequality. In


fact, Cauchy-Schwarz Inequality can be proven directly.

[ ]

[ ]

[ ][ ]

0 E + x = x 2 E 2 + 2 xE[] + E 2 E[] E 2 E 2
2

Remark 2: By using Cauchy-Schwarz inequality, we have


=

E[( E )( E)]
D D

E ( E )( E)
D D

E E

E E

D D

=1

2. Correlation Coefficients and Linear Regression


(Approximation)
Definition The (linear) correlation coefficient of two random variables and is defined as
E E E[( E )( E)]
= E
=
D
DD
D
if the expectations concerned are well defined.

Remark 1: If = 0 , and are said to be uncorrelated. It follows that statistical


independence must lead to uncorrelation.

Remark 2: Note the differences between the concepts of incompatibility (sets), statistical
independence (probability) and uncorrelation (mathematical expectation).

Theorem (Linear Correlation) Let and be two second-order random variables and
the correlation coefficient of and , then

A.BENHARI

-35-

= 1 = a + b
where a and b are two numbers.
Proof:
(1) If = a + b , then

E[( E )( E)]

[
] E[( E) ]
aE[( E ) ]
=
=1
a E[( E ) ]
E ( E )

E[( E )(a + b aE b )]

E ( E )

] E[(a + b aE b) ]
2

(2) If = 1 , then
2
2
E E 2

= E ( E) + E ( E ) 2E ( E )( E)
E

D
D D
D

D
D

E
E
= D
+ D
2 = 1 + 1 2 = 0
D
D

E E

D
P
=
( E ) + E = P{ = a + b} = 1
= P =
D
D

where a =

D
D

, b = E

D
D

E .

(3) If = 1 , then
2
2
E E 2

= E ( E) + E ( E ) + 2E ( E )( E)
E
+
D
D
D D
D
D

E
E
= D
+ D
+ 2 = 1 + 1 2 = 0
D
D

D
E
P
=
( E ) + E = P{ = a + b} = 1
= P =
D
D

where a =

D
D

, b = E +

D
D

E . #

Example (Linear Regression) Let and be two second-order random variables and

A.BENHARI

-36-

e(a , b ) = E (a + b )

How to choose a and b to make the error e(a , b ) as small as possible? By taking partial
derivatives of e(a , b ) with respect to a and b, one can have

e(a , b )
2
aE 2 + b1 = E[]

= 2E[( a b )] = 0

a
a =


1

b = 2 a 1
e(a , b ) = 2E[ a b ] = 0
a1 + b = 2

[ ]

where 1 = E , 2 = E , 1 = D and 2 = D . Let

L( ) =

2
( 1 ) + 2
1

L( ) is often called the linear regression of or linear approximation to . The error


between a random variable and its linear regression is then given by

e 2min = E L( )

If = 1 , E L( )

] = E[( ) a( ) ]= (1 )
2

]= 0 , i.e., = L() . #

2
2

3. Conditional Expectations and Regression Analysis


Definition Let and be two random variables, the conditional expectation of , given

= x , is then defined as
E[ x ] =

yf (y x )dy =

f (x, y )
f (x )

dy

Remark: The conditional expectation E[ x ] is in fact a function of x and E[ ] is then a


function of the random variable . The mean of E[ ] is given by:
E[E[ ]] =

E[ x ]f (x )dx =

+ +

f (x , y )
y
f (x )dx =
dy
f (x )
yf (x, y )dydx = E[]

+ +

Example From
f (y x ) 0 ,

f (y x )dy =

A.BENHARI

-37-

f (x, y )
f (x )

dy =

f (x )
f (x )

=1

it follows that f (y x ) can be regarded as the density function of a random variable x


indexed with x. The mean of x is given by

E[ x ] =

yf (y x )dy = E[ x ]

Then, for all functions g (x ) , it follows that

] [

E x E[ x ] E x g (x )
2

or expressed in integral form,


+

y E[ x ] f (y x )dy
2

y g (x )

f (y x )dy

Theorem (Regression) Let and be two random variables, then for all functions g (x ) ,

] [

E E[ ] E g ( ) .
2

Proof:

E g ( )

] = y g (x ) f
+ +

+ +

y g(x )

+ +

(x , y )dydx = y g (x )

f (x , y )
f (x )

dy f (x )dx

f (y x )dy f (x )dx

[y E[ x ]] f (y x )dy f (x )dx = E[ E[ ]

+ +

]#

Remark: The theorem shows that if one wants to look for a function g (x ) such that g ( )
approaches best among others, then the conditional expectation E[ x ] given is the best
choice. The resultant variable E[ ] is often called the regression of with respect to .

4. Generating and Characteristic Functions


Definition Let be a discrete random variable assuming nonnegative integers, then the

[ ]

function g(x ) = E x is called the generating function of .

[ ]

Remark: Since g(x ) = E x = x k P( = k ) , we have


k

A.BENHARI

-38-

d n g (x )
= k (k 1)L (k n + 1)x k n P( = k )
n
dx
k
d n g (x )
= k (k 1)L (k n + 1)P( = k ) = E[( 1)L ( n + 1)]
x 1 dx n
k

lim

Example Let be a random variable satisfying the binomial distribution, the generating
function of is then given by

[ ]

g(x ) = E x = x k C kn p k q n k = (xp + q )

k =0

With the help of g (x ) , one can calculate the moments of :

dg(x )
n 1
= lim n (xp + q ) p = np
x 1 dx
x 1

E[] = lim

d 2 g (x )
n2
+ np = lim n (n 1)(xp + q ) p 2 + np = n (n 1)p 2 + np
2
x 1 dx
x 1

[ ]

E 2 = E[( 1)] + E[] = lim

] [ ]

2 = E ( E[]) = E 2 E 2 [] = n (n 1)p 2 + np n 2 p 2 = np(1 p ) = npq


2

Example Let be a random variable satisfying the Poisson distribution, the generating
function of is then given by

[ ]

g (x ) = E x = x k
k =0

k
e = e x e = e (x 1)
k!

With the help of g (x ) , one can calculate the moments of :

dg(x )
= lim e (x 1) =
x 1 dx
x 1

E[] = lim

d 2 g (x )
+ = lim 2 e (x 1) + = 2 +
x 1 dx 2
x 1

[ ]

E 2 = E[( 1)] + E[] = lim

] [ ]

2 = E ( E[]) = E 2 E 2 [] = 2 + 2 =
2

[ ]

Definition Let be a random variable, then the function (t ) = E e jt


characteristic function of .

A.BENHARI

-39-

is called the

5. Normal Random Vectors


Definition

= ( 1 , 2 , L , n )

Let

be

an

n-dimensional

= (E[ ]) = ( 1 , 2 , L , n ) and R = E ( )( )
T

],

random

vector,

is said to be normal if its n-

dimensional joint probability density function is as follows:

f (x ) =

(2)

n
2

1
2

1
(x )T R 1 (x )
2

1 2
1 12
-1

, R =
22
1 2

1 2

12
Remark: When n = 2 , R =
1 2

f (x,y) =

, where x = (x 1 , x 2 , L , x n ) R n

1
21 2 1

1 2
1
22

and

( x 1 )2
( x 1 )( y 2 ) ( y 2 )2
2
+

2
1 2
2 1 1
22

The 2-dimensional normal distribution is often denoted by N 1 , 2 , 12 , 22 , .

Theorem Let ( 1 , 2 ) be a 2-dimensional normal random vector and the correlation

coefficient, then

= 0 1 and 2 are independent with each other


Proof:

Since

f (x,y) =

1
21 2 1 2

( x m1 )2
( x m1 )( y m 2 )+ ( y m 2 )2
2

2
12
2 1 1
22

1
f 1 (x ) =
e
21

( x m1 )2
2 12

1
, f 2 (y ) =
e
2 2

, =

cov(1 , 2 )
1 2

( y m 2 )2
2 22

we have

= 0 f (x , y ) = f 1 (x )f 2 (y ) #

Example The marginal and conditional distributions of a multivariate normal distribution are

still normal.

A.BENHARI

-40-

Proof:

Suppose the random vector (, ) is normally distributed N 1 , 2 , 12 , 22 , , then

Marginal distributions:
f (x ) =

( x m1 )2

2 1

= N 1 , , f (y ) =

2 12

2
1

1
2 2

( y m 2 )2

2 22

= N 2 , 22

Conditional distributions:

1
f (y x ) =

f (x , y )
f (x )

21 2 1

( x 1 )2
( x 1 )( y 2 ) ( y 2 )2
2
+

2
1 2
2 1 1
22

1
2 1

2 1 2
2

( x 1 )2
2 12

1 2 ( x 1 )2
( x 1 )( y 2 )+ ( y 2 )2
2

12
12
22
2 1 2

( (1 ) )
2

e 2 (1 )
2

22

y 2 + 2 (x 1 )

2 1 2
2

y 2 x 1

1
2 1 2

e (

= N 2 (x 1 ) + 2 , 1 2 22 #
1

Remark: Since
E[ x ] =

yf (y x )dy = (x ) +
2

the random variable E[ ] is nothing but the linear regression of .

Theorem Let = (1 , 2 , L , n )

a 11 a 12

a 22
a
A = 21
M
M

a
m1 a m 2

be an n-dimensional normal random vector and

L a 1n

L a 2n
, then = A is an m-dimensional normal random vector.
O M

L a mn

Remark: This theorem shows that the linear transform of a normal random vector is still
normal.

A.BENHARI

-41-

Theorem An n-dimensional random vector = (1 , 2 , L , n ) is normal if and only if for all


T

numbers 1 , 2 , L , n , = i i is a normal random variable.


i =1

Remark 1: The theorem can also be stated as follows:


The random variables 1 , 2 , L , n are jointly normal if and only if all possible
linear combination of them is normal.

Remark 2: It is possible that random variables 1 , 2 , L , n are not jointly normal even
though each of them is normal.

Remark 3: If random variables 1 , 2 , L , n are independent and each of them is normal,


n

then for all numbers 1 , 2 , L , n , = i i is a normal random variable.


i =1

Memo
Definition
E[g ( )] =

g(x )f (x )dx , E[g()] = g(k )P{ = k}

E[g (, )] =

+ +

g(x, y )f (x, y)dxdy , E[g(, )] = g(k, m )P{ = k; = m}

k ,m

Examples
( E )( E)
2
E[] , D = E ( E ) , = E

D D

Properties

A.BENHARI

-42-

E i i = i E[ i ]
i
i

E[f ( )g ( )] = E[f ( )]E[g ()] , D i i = i2 D[ i ] (Statistical Independence)


i
i

[ ][ ]

E[] E 2 E 2

Linear Regression
L ( ) =

2
2
( 1 ) + 2 , E L () = 22 1 2
1

where 1 = E , 12 = E 1

],

= E , 22 = E 2

Regression
+

Let g(x ) = E = yf y dy , then for all f (x )


x
x

E g ( )

] E[ f () ]
2

Normal Distribution

f (x, y ) = N 1 , 2 , 12 , 22 ,

f (x ) = N 1 , 12 , f (x ) = N 2 , 22 , f y = N 2 (x 1 ) + 2 , 22 1 2
x

1 , 2 , L , n are jointly normally distributed

A.BENHARI

-43-


i =1

is normal

Limit Theorems
1. Inequalities
Hajek & Renyi Inequality Let 1 , L , n be independent random variables with finite second
moment and C1 , L , C n be numbers such that C1 L C n 0 , then for all 1 m < n and all
>0,
j
m
n

P max C j ( i E i ) 2 C 2m D i + C 2j D i
i =1
j=1
j= m +1
m j n

Kolmogorov Inequality Let 1 , L , n be independent random variables with finite second


moment, then for all > 0 ,
j

1
P max ( i E i ) 2
1 j n i =1

D
j=1

Hint: Kolmogorov inequality can be regarded as a special case of Hajek&Renyi inequality


when letting m = 1 and C1 = L = C n = 1 .

Chebyshev Inequality Let be a random variable with finite second moment, then for all
>0,

P{ E }

1
D
2

Hint: Chebyshev inequality can be regarded as a special case of Kolmogorov inequality when
letting n = 1 . Chebyshev inequality can also be proven directly
P{ E } =

f (x )dx

x E

A.BENHARI

x E

x E

-44-

f (x )dx

1
2

D
x E f (x )dx =

2. Convergences of Sequences of Random Variables


Convergence in Almost Everywhere A sequence of random variables 1 , L , n , L is said to
converge almost everywhere to a random variable if

P , lim n () = () = 1
n +

Convergence in Probability A sequence of random variables 1 , L , n , L is said to


converge in probability to a random variable if for all > 0 ,

lim P , n () () = 0

n +

Convergence in Distribution A sequence of random variables 1 , L , n , L is said to


converge in distribution to a random variable if for all x at which F(x ) is continuous,
lim Fn (x ) = F(x )

n +

where F(x ) and Fn (x ) are distribution functions of and n , n = 1,2,L , respectively.

Remark: Note that


lim Fn (x ) = F(x ) lim P{ , n () < x} = P{ , () < x}

n +

n +

Convergence in the rth mean/moment A sequence of random variables 1 , L , n , L is said


to converge in the rth mean/moment to a random variable if

lim E n

n +

]= 0

Remark: If r = 2 , the convergence is the well-known mean square convergence.

The relation between different types of convergence


Convergence Almost Everywhere Convergence in Probability
Convergence in Distribution

A.BENHARI

-45-

3. The Weak Laws of Large Numbers


Definition A sequence of random variables 1 , 2 , L , n , L is said to satisfy the weak law of
large numbers if there is a sequence of numbers a 1 , a 2 , L , a n , L such that for all > 0

1 n

lim P k a n = 0
n +
n k =1

Remark: The convergence involved in the weak laws of larger numbers is exactly the type of
convergence in probability. In fact, let n =

1 n
k a n , n = 1,2,L , then
n k =1

1 n

lim P k a n = lim P{ n } = 0
n +
n k =1
n +
This means that the sequence of random variables 1 , 2 , L, n ,L converges in probability to
zero.

Theorem (The Weak Law of Large Numbers, Khintchine) Suppose the second-order
random variables 1 , 2 , L , n , L are independent and identically distributed, then for all
>0,

1 n

lim P k = 0
n +
n k =1

where = E[ k ] .

Proof:
n 2
E k

n
k =1 n 2
k

= 2 n
0
+
2
n
k =1 n
Chebyshev Inequality

where 2 = E ( k ) . #

A.BENHARI

-46-

4. The Strong Laws of Large Numbers


Definition A sequence of random variables 1 , 2 , L , n , L is said to satisfy the strong law of
large numbers if there is a sequence of numbers a 1 , a 2 , L , a n , L such that for all > 0

1 n

P lim k a n = 0 = 1

n + n k =1

Remark 1: The convergence involved in the strong laws of larger numbers is exactly the type
of convergence almost everywhere. In fact, let n =

1 n
k a n , n = 1,2,L , then
n k =1

1 n

P lim k a n = 0 = P lim n = 0 = 1
n +

n + n k =1

This means that the sequence of random variables 1 , 2 , L , n , L converges almost


everywhere to zero.

Remark 2: Since the convergence almost everywhere will lead to the convergence in
probability, a sequence of random variables satisfying the strong laws of large number must
satisfy the weak ones:

1 n

1 n

P lim k a n = 0 = 1 lim P k a n = 0 for all > 0


n +

n + n k =1

n k =1

Theorem (The Strong Law of Large Numbers, Kolmogorov) Suppose the second-order
random variables 1 , 2 , L , n , L are independent with each other and

D k
< + , then
2
n =1 n
+

( k E k )

1 n

P lim k =1
= 0 =n P lim k a n = 0 = 1
1
n +
n + n
n
k =1

a n = n k=1E k

Theorem (The Strong Law of Large Numbers, Khintchine) Suppose the second-order
random variables 1 , 2 , L , k , L are independent and identically distributed, then

A.BENHARI

-47-

1 n

P lim k = = 1
n + n
k =1

where = E k .
Hint: Since the random variables 1 , 2 , L , k , L are identically distributed, one can have
+
D k
1
D
=

< +

k
2
2
k =1 k
k =1 k
+

=1
p
Remark: If k satisfies the 0-1 distribution: P{ k = } =
, then
1 p = 0
1 n

E k = p and P lim k = p = 1
n + n k =1

Note that

1 n
k represents the frequency of occurrence of the event { k = 1} in n Bernoulli
n k =1

experiments, the law of large numbers implies that the frequency will approximate the
corresponding probability p as n + .

A.BENHARI

-48-

5. The Central Limit Theorems


Let 1 , 2 , L , i , L be a sequence of independent random variables with finite second
n

moments and n =

i E i
i =1

, n = 1,2, L , the central limit theorems are concerned with

i =1

D i
i =1
n

the conditions under which the distribution of n will tend to the standard normal distribution
N(0,1) as n + , i.e.,

lim P{ n < x} =

n +

e
2

t2
2

dt

Remark 1: Note that n is the standardized variable of

i =1

Remark 2: The convergence involved in the central limit theorems is exactly the type of
convergence in distribution. In fact, let (x ) =

e
2

t2
2

dt and n (x ) = P{ n < x} ,

n = 1,2,L , then
lim P{ n < x} =

n +

e
2

t2
2

dt lim n (x ) = (x )
n +

The Central Limit Theorem (Lindeberg & Levi Theorem) Let 1 , 2 , L , n , L be a


sequence of independent and identically distributed (IID) random variables with finite second
moment, then,

lim P{ n < x} =

n +

where n =

i =1

i =1

i E i
n

D
i =1

A.BENHARI

i =1

e
2

t2
2

dt

, = E i , 2 = D i .

-49-

The Central Limit Theorem (de Moivre & Laplace Theorem) Let 1 , 2 , L , n , L be a
sequence

of

IID

random

variables

with

finite

second

moment,

if

k =1
p
P{ i = k} =
for all i, then,
1 p = q k = 0

lim

P i = k
i=1

n +

1
2 npq

1 k np

2 npq

i np

= 1 , lim P i =1
< x =
n +
npq

1
2

t2
2

dt

Remark: For the approximation calculation of

i =1

, we so far have

n
(np ) np
P i = k
e , when n is large enough and p is small enough
k!
i=1

1 k np
npq

1
2
P i = k
e
2 npq
i=1

, when n is large enough

t2
x
i np

1
i =1
P
< x
e 2 dt , when n is large enough. In this case,

npq
2

can be regarded as a standard normal variable, which leads to

n
np

P 0 i < x = P

i =1

npq

A.BENHARI

i =1

np

npq

-50-

x np
1
<
x
npq
2

x np
npq

np
npq

t2
2

dt

i =1

np

npq

Conditioning. Conditioned distribution and expectation.


1. The conditioned probability and expectation.

Let (, K, P) be a probability space. Let A K be an event such that P(A)


0. Let B be another event from K. Define

P( BA)
P( A)
This is called the conditioned probability of B given A.
Of course that P(BA) = P(B) P(BA) = P(B)P(A) A and B are independent.
If A is given, we may consider the function PA : K [0,1] given by
(1.2) PA(B) = P(BA)
It is obvious that PA is a new probability on the -algebra K, called the
conditioned probability given A.
The integral of a random variable X with respect to it will be denoted by
E(XA) or EA(X). The computing formula is
E ( X 1A )
PROPOSITION 1.1. E(XA) =
P ( A)
Proof. Obvious for X = 1B . Then apply the usual method of four steps: X simple,
(1.1) P(B A) =

X nonnegative, X any.
Let now Y be a discrete random variable and I be the set {y P(Y = y)
0}. Then I is at most countable and Y admits the cannonic representation
Y=
y1{Y = y} (a.s.) . In many statistical problems one gets interested in

yI

computing the probability of an event B if one has an information about Y. In


other words one wants to know P(B Y = y). It is natural to define P(B Y) as
(1.3) P(B Y) = P(B Y = y)1{Y = y} .
yI

This quantity will be called the conditioned probability of B given the random

variable Y.
EXAMPLE.. An urn I has n labelled balls (that is I ={1,2,..,n}. One draws two
balls without replacing. The first one is Y and the second one is X . One wants to
compute P(X=x Y) and to compare it with P(X = x ) . Accepting that we are in the
classical context , = I 2 \ {(i,i)i I} , thus = n(n-1) . Then P(X = x
{X = x, Y = y} 0 if x = y
P ( X = x, Y = y )
Y = y ) =
=
= 1
(as Y has only n -1
if x y
{Y = y}
P(Y = y )
n 1
possibilities). It means that

P(X=x Y) =

{ }

yI \ x

1
1
1
1 {Y = y} =
1{Y x } . Compare this with P(X=x) =
.
n 1
n 1
n

Looking at (1.3) one remarks four things :(i). the conditioned probability is a
random variable ; (ii). the random variable does not depend as much on Y as on the

A.BENHARI

-51-

sets {Y=y} which form a partition of ;(iii). This random variable is measurable
with respect to the -algebra (Y) := Y-1(B()) and, finally, (iv). The random
variable may be not defined everywhere, but only almost surely : if P(Y=y) = 0,
then P(B Y=y) may be any number form 0 to 1. A convention, as good as any other,
would be that in this case to decree that P(B Y=y)=0.
It means that a more natural definition would be the conditioned probability
of B given a partition = (j) j I where I is at most countable . Then the analog
of (1.3) would be

(1.4) P(B ) =

j:P ( j ) 0

P ( B j )1 j

Taking into account Proposition 1. one is suggested to define

(1.5) E(X ) =

j:P ( j ) 0

E ( X j )1 j , X L1

(the condition that X L1 means that E X < ; it is not necessary, but


makes things easier)
The definition (1.5) has the advantage that E(1B ) = P(B ) , as it is
normal to be .
We want to generalize the definition (1.5) in other situations. The most
general situation is when we replace partition by -algebra . If in (1.5) we
denote by F the -algebra given by by (remark that A F A =
j for some

U
jJ

J I ) , we can say that the right hand of (5) is a definition for E(XF) ,
instead of E(X ). So

E(X F ) =

(1.6)

j:P ( j ) 0

E ( X j )1 j , X L1

What properties characterize the dfinition (1.6) which can be generalized to


an arbitray sub--algebra of K?
Remark that if we denote by Y the right hand of (1.6), then
(i).
Y is F measurable ; moreover, Y L1(, F, P)
(ii).
If A F then E(X1A) = E(Y1A)
Indeed, Y1 = EY

E E( X
jI

)1 j

j: P ( j ) 0

about the claim (ii) , let A F A =

E ( E ( X j )1 j ) = EX < . As

for some J I. Then E(Y1A) =

jJ

E (Y1
jJ

) (by Lebesgues dominated convergence) =

disjoint) =

E(E( X
jJ

E( X
jJ

) P( j ) =

E( X 1
jJ

)1 j ) (since j are

) (by Proposition 1) = E(X1A).

The conditions (i) and (ii) are used to define E(X F ) in general
situations.
Definition.11. Let X L1(,K
K,P) and F K be a sub -algebra. We say that Y =
E(X F) (read : Y is the conditioned expectation of X given F ) iff
(1.7) Y is F measurable and A F E(X1A) = E(Y1A)
Definition.2. Let B K . By P(B K
K) we shall understand E(1B F ). Read: the

A.BENHARI

-52-

conditioned probability of B given F .


Definition. 3. Let X be a random variable and F K be a sub -algebra. By PoX
1
(BF ) we shall understand the random variable P(X-1(B) F). Read: the
conditioned distribution of X given F .

One may remark that the key concept is that of conditioned expectation.

2. Properties of the conditioned expectation.


Property 1. Almost sure unicity. If X is an integrable r.v., then E(XF) exists
and is unique a.s. , i.e. if Y1 and Y2 are two versions of E(XF), then Y1 = Y2
(a.s.)

Proof. The signed measure XP : F is absolutely continuous with respect to P


, since P(A)=0 (XP)(A) = X1AdP = 0 (as X1A = 0 a.s.). The Radon Nikodym
theorem says that there must be a density of XP with respect to P : there must
exist Y which is F measurable such that XP = YP . Notice that we think both
measures living on the -algebra F. The unicity is guaranteed by the same Radon
Nikodym theorem; but one may check it directly, as an exercise. If Y1P = Y2P , the
meaning is that

(Y1-Y2)1A dP = 0 A F ; one may as well choose A ={Y1>Y2}=

1
> Y2 + and get that P(Y1>Y2) = 0. In the same way one gets that P(Y1<Y2) =
n
n =1
0, that is P(Y1Y2)=0 Y1 = Y2 (a.s.).

U Y

Property 2. Generalizing the usual expectation. Suppose that is F is trivial,


meaning that A F P(A) {0,1}. Then E(XF) = EX. Moreover, if X is already F
measurable, then E(XF) = EX. It means that the F- measurable functions behave
as the constants do, in the usual case.
Proof. Let Y = E(XF) . As Y is F- measurable, Y must be a constant a.s.
Indeed, the sets Lb={Y < b} F. They are an increasing family, in the sense that

b < c LbLc.Their probability can be either 0, or 1. As 0 = P(bLb) = limbP(Lb) it means that some of these sets will heve probability 0. Let c = sup {b
P(Lb) = 0}. Then, due to the definition of c, P(Lc+ ) = 1 > 0. In the same
way P(Lc) = 0. By the monotonous continuity of any measure it follows that P(Y
c) = 1 but P(Y<c) = 0 P (Y = c) = 1 Y = c (a.s.). So Y is a constant a.s.
If in (1.7) we take A = , we get that EX = E(X1A) = E(Y1A) = EY = Ec = c.
As about the second claim, it is obvious from 1.7.

Property 3. Projectivity.
Projectivity If F G are two -algebras then E(E(XG)F )=E(XF ).
As a consequence of property 2, we get that EX = E(E(XG)).
Proof. Let Y = E(XG ) and Z = E(XF ). We want to check that E(Y F) = Z .
Firstly, Z is F measurable. Secondly, let A F. Then E(Z1A) = E(X1A) (by 1.7) =
E(Y1A) (again by 1.7; notice that A F A G ! ) It means that E(Y F) = Z.

A.BENHARI

-53-

Property 4. Linearity.
Linearity If a,b and X1,X2 L1 then E(aX1+bX2F ) = aE(X1F
)+bE(X2F ) (a.s.)
Proof. Let Yj = E(XjF ), j = 1,2. Let Y = aY1+bY2 and A F . Then Y is F
measurable and, moreover, E(Y1A) = E((aY1+bY2 )1A) =a E(Y1 1A) +b E(Y2 1A) = a E(X1 1A)
+b E(X2 1A) (by 1.7) = E((aX1+bX2 )1A) , checking the second condition from 1.7.
Property 5. Monotonicity.
Monotonicity If X1 X2 then E(X1F ) bE(X2F ) (a.s.)
Proof. Using Property 4, it is enough to check that X 0 E(XF ) 0 (a.s.).
Let Y = E(XF ) . Y is F measurable and A F E(Y1A) = E(X1A) 0 since X
0. If one puts A = {Y<0} it follows that E(Y1A) = -E(Y-) 0 E(Y-) 0 E(Y-)
= 0 (a.s.) Y = Y+ (a.s.) Y 0 (a.s.) .
Property 6. Jensens inequality.
inequality Let X : I be a random variable and f :
I be convex (here I is an interval!). Then E(f(X)F ) f(E(XF )).
Proof. A convex function f can be written as f = sup {haa }, at most
countable and ha affine functions, ha(x) = max+na. (for instance = QI and if a
, ha is a tangent for f at (a, f(a)) ; it is known that a convex function has at
least one tangent at every point)
Then E(f(X)F ) = E(sup {ha(X)a }F ) sup {E(ha(X)F)a } (by Property
6, monotonicity) = sup {E(maX+na)F)a }= sup {maE( XF ) + naa } (by
linearity and Property 2 the expectation of a constant is the constant itself) =
f(E(XF)).

Property 7. Contractivity.
Contractivity Let p[1,] and X Lp . Then

E(XF ) p X p.
As a consequence the conditioned expectation is a linear contraction from
p
p
L (,K,P) to L (,F,P)
Proof. There are two cases.
1. 1p < . The claim is EE(XF )p EXp . Let f(x) = xp. Then f :
is convex so we know that E(f(X)F ) f(E(XF )) E(XpF )
E(XF )p . If we take the expectation, we get E(E(XpF )) E(E(XF
)p) which, because of Property 3 is exactly our claim.
2. p = . Let then M = X . It means that X M (a.s.) E(X F )
E(MF ) (by property 5, monotonicity) E(X F ) M (a.s.)
E(X F )

M .

Property 8. Conditioned BeppoBeppo-Levi, Fatou and Lebesgue theorems. Precisely, the


claim runs as follows:
1. If Xn g L1 and Xn X (or XnX, Xn g L1) then E(XnF ) E(XF )
(a.s.) (or E(XnF ) E(XF ) (a.s.)). (Beppo Levi);
2. If Xn g L1 (resp. Xn g L1) then E(liminfn XnF ) liminfn E(Xn
F )(resp. E(limsupn XnF ) limsupn E(Xn F ) (Fatou) ;
3. If Xn X (a.s.) and Xn g L1 , then a.s.-lim E(XnF ) = E(XF )
(Dominated convergence, Lebesgue)

A.BENHARI

-54-

Proof. Let Yn = E(XnF ). Due to monotonicity, Yn is almost surely increasing. Let Y


be its supremum , which is a.s. the same with its limit. The claim is that Y =
E(XF ). According to (1.7) what we have to do is to check the measurability
(obvious) and the fact that A F E(X1A) = E(Y1A). But E(X1A) = E(limXn1A) =
limE(Xn1A) (usual Beppo-Levi) = limE(Yn1A) (by (1.7)) = E(limYn1A) (again Beppo
Levi) = E(Y1A) . That checks property 1.
As about 2., the proof is the same as in the usual case, (monotonicity and
conditioned Beppo-Levi) : E(liminfn XnF ) = E(supn infk Xn+kF ) = E(supn YnF )
(with Yn = infk Xn+k, an increasing sequence) = E(limYnF) = limE(YnF )
(conditioned Beppo-Levi) = supnE(infk Xn+kF ) supn infk E(Xn+kF ) (monotonicity) =
liminfn E(Xn F ).
The conditioned Lebesgue theorem puts no problems: so X = liminfn Xn = limsupnXn
. we apply conditioned Fatous lemma :
limsupn E(XnF ) E(limsupnXnF)= E(XF ) = E(liminfnXnF ) liminfn E(Xn
F ) meaning that limsupn E(XnF ) = liminfn E(Xn F ) = E(XF ).

Property 9. The F-measurable functions behave as constants.


Precisely, the property runs as follows: if X Lp and Y Lq , with

1 1
+ = 1 , p,q
p q

1, then E(XYF ) = YE(XF ). Remark that if F is trivial then Y is a constant.


Proof. The condition X Lp and Y Lq is put for convenience, what we want is that
XY L1.
The proof will be standard. Let Z = YE(XF ). Our claim means that Z is F
measurable (obvious) and that A F E(XY1A) = E(Z1A) .
Step 1. Y = 1B, B F . Then E(Z1A) = E(YE(XF )1A) = E(E(XF )1A1B) = E(E(XF )1AB)
= E(X1AB) (as A,B F AB F , too!) = E(X1A1B) = E(XY1A) so in this case we
are done.
n

Step 2. Y is simple, i.e. Y =

b 1
i =1

bi E (1A1Bi E(XF )) =
i =1

i Bi

, Bi F . Then E(Z1A) = E(YE(XF )1A) =

i =1

i =1

bi E ( X 1A1Bi F ) (by step 1!) = E ( X 1A bi1Bi F ) (by

linearity) = E(XY1A) finishing the proof in this case, too.


Step 3. Y is nonnegative. Then Y is the limit of a nondecreasing sequence of
simple functions, Yn. We have: E(Z1A) = E(YE(XF )1A) = E(YE(X+F )1A) - E(YE(X-F
)1A) = E(limnYnE(X+F )1A) - E(limnYnE(X-F )1A) = limn E(YnE(X+F )1A) - limn
E(YnE(X-F )1A)
(Beppo-Levi!) =limn E(E(X+Yn1AF ) - limn E(E(X-Yn1AF ) (Step
2! Yn1A is simple!) =limn E(X+Yn1A) - limn E(X-Yn1A ) (Property 3!) = E(X+limn Yn1A)
- E(X-limn Yn1A ) (Beppo Levi again!) = E(X+Y1A) - E(X-Y1A ) = E((X+-X-)Y1A) =
E(XY1A) .
Step 4. Y is any. Then Y = Y+ -Y- hence E(Z1A) = E(YE(XF )1A)
= E(Y+ E(XF )1A) - E(Y- E(XF )1A) = E(E(X Y+1A F )) - E(E(X Y-1A F )) (by step
3 ! Y+1A and Y-1A are nonnegative) = E(X Y+1A) - E(X Y-1A ) (property 3) = E(X( Y+-Y-)1A
) = E(XY1A) .

A.BENHARI

-55-

Property 10. Optimality.


Optimality Let X L2. Consider the function D: L2(,F,P) [0,)
given by

D(Y) =

X-Y 2 . Then D is convex and has an unique (a.s.) point of minimum which
is exactly Y = E(XF ). Moreover, the following Pythagora rule holds:
X-Y 22 = X-Y 22 + Y- E(XF ) 22.
2
2
As a consequence the mapping EF : L L (,F,P) given by EF (X) = E(XF ) is the
orthogonal projector from the Hilbert space L2 to the Hilbert subspace L2(,F,P).
Proof. Let Z= E(XF ). Then X-Y 22 = E(X-Y)2 = E((X-Z)+(Z-Y))2 =
2
2
E((X-Z) ) + E((Z-Y) + 2 E((X-Z)(Z-Y)) . The last term is equal to 2 E(E((X-Z)(ZY)F )) (property 3) = 2E((Z-Y) E((X-Z)F )) = 2E((Z-Y) (E(XF )-Z)) = 2E((Z-Y)
(Z-Z)) = 0. It means that
X-Y 22 = X-Y 22 + Z-Y 22 .
Property 11. Conditioning and independence. If X is independent on F , then
E(XF) = EX . It is not true in general that E(XF) = EX X is independent on
F . However, if P(BF ) = const P(BF ) = P(B) B is independent on F .
Proof. Let X be independent on F and Y = EX. The task is to prove that Y
fulfills the conditions (1.7). As measurability is obvious, let A F (hence A is
independent on X X and 1A are independent) Then E(X1A) = EX E1A = EXP(A) =
E(EX1A) = E(Y1A) checking the first claim. As about the converse, it cannot be
true since it is enough to choose X = 1A 1B with P(A) = P(B) = p and F =()
where = (j)jJ is an (at most) countable partition of . Then EX = 0 and
E(XF) = P(AF ) P(BF ) =

(P( A
jJ

) P( B j 1 j . If we choose A and B such

that P(Aj) = P(Bj) pP(j) , that would be an example that it is possible that
E(XF) = EX = 0 but X be not independent on F, since P(X=1,j) = P(Aj)
P(X=1)P(j).
However, suppose that P(BF ) = c where c is a constant. By (1.7) this
means that E(1A1B) = E(c1A) A F, or that P(AB) = cP(A) A F. If A =
one finds the constant c = P(B) and discovers that the definition relation (1.7)
means that P(AB) = P(A)P(B) A F, in other words, that B is independent on F
.

Property 12. Regression.


Regression If F = (Y) = Y-1(B ) where (E,B
B ) is a measurable space
and Y : E is measurable then the conditioned expectation E(X (Y)) is
denoted by E(XY) and is called the regression function of X given Y. The property
is that E(XY) = h(Y) where h : is some measurable function.
Proof. It has nothing to do with conditioned expectation, but with the following
fact called the universality property: let (E,B
B ) be a measurable space and Y :
E be any. Endow with the -algebra (Y). Let Z : be (Y)measurable. Then there must exist a measurable function h : E such that Z =
hY . The proof is standars: if Z = 1A then A (Y) A = Y-1(B) for some B B ,
hence Z = 1Y 1 ( B ) = 1B Y . It means that in this case h = 1B. The next step is when Z
is simple: Z =

a 1

1 j n

A.BENHARI

i Ai

with Ai (Y) Ai = Y-1(Bi) for some Bi B . The h =

-56-

a 1

1 j n

i Bi

. If Z is any, then it is a limit of simple functions Zn = hnY . It is

enough to put h = liminfn hn. In our very case the only fact that matters is that
the regression function E(XY) must be (Y) measurable.

Property 13. Strict Jensens inequality. If f is twice differentiable and


strictly convex, then E(f(X)F ) = f(E(XF )) X =E(XF ). As a consequence, if
E(f(X)) = E( f (E(XF ))) , then X =E(XF ).
Proof. The assertion holds for any strictly convex function, but we shall prove it
in the particular case when f is twice differentiable. Recall that a function f is
said to be strictly convex iff the equality f(px+(1-p)y) = pf(x)+(1-p)f(y) with
0p1 is possible iff p {0,1} or if x = y. Or, equivalently, that the graph of
f contains no segment of line.
Let then f be strictly convex and twice differentiable. Then
(2.1) f(x) = f(a) + f(a)(x-a) + f((x))

( x a )2
2

for some lying somewhere between a and x. Remark that the mapping x a f((x)),
being a ratio between two continuous functions is continuous itself and thus,
measurable. Now replace in (2.1) x with X and a with E(XF ). We get
(2.2) f(X) = f(a) + f(a)(X-a) +

2
(
X a)
f((X))

Apply in (2.2) the conditional expectation. Then


(2.3) E(f(X)F ) = f(a) + f(a)E(X-aF) +E( f((X))

( X a )2 F)
2

We applied the fact that f(a) and f(a) are already F measurable and property 8.
Taking into account that E(X-aF) = a a = 0 it follows that
(2.4) E(f(X)F ) = f(a) + E( f((X))

( X a )2 F )
2

If E(f(X)F ) = f(a) = f(E(XF )) , then it means that E(

2
(
X a)
f((X))
F )

0. But f is convex, thus f > 0. Being strictly convex, the set on which f = 0
contains no interval. But if Y 0 and E(YF) = 0, then Y = 0 a.s. Thus f((X))

( X a )2
2

= 0 a.s. Let A = { f((X()))=0} and B = {

( X () a )2 =

0}. We

2
know that P(AB) = 1. If a A then f(X()) = f(a) + f(a)(X()-a) . Well, that
may happen only if X() = a , else on the interval joining a andX() f would be
linear, which we denied. So in this case X() = E(XF)(). If B there is no
problem either: X() a . So X= E(XF) a.s. The second assertion is stronger,
but it comes from the fact that E(f(X)) = E( f(E(XF ))) E(E(f(X)F)) = E(
f(E(XF ))) E(f(X)F)) = f(E(XF )) (as if we know that U V and EU = EV
then U = V, too!) X = E(X F).
Property 14. The interior and adherence of a set in a -algebra.

A.BENHARI

-57-

Let F K be a sub -algebra and let A K. Define


(2.5) (A )F = {P(AF ) > 0} and (A )F = {P(AF ) = 1}
Call (A )F the adherence and (A )F the interior of the set A in the -algebra F
. (Remark the quotation marks!). Remark also that these sets are defined only
(a.s.), their definition depending on what version one uses for the conditional
expectation!
Then
(2.6) (A )F A (A )F (a.s.) and (A )F , (A )F F .
(2.7) If C A (a.s.), C F then C (A )F ( a.s.)
(2.8) If A B (a.s.), B F then (A )F B (a.s.)
Notice that properties (2.7) and (2.8) are similar with the properties of the
usual interior and adherence of a set in a topological space. Except that the
inclusions are understood to be only a.s., namely C B means that P(C \ B) = 0

Proof. We prove first (2.6). Let C = (A )F , B = (A )F. As B,C F and 0


P(AF ) 1 it follows that E(1C F ) = 1C P(AF )( = E(1AF)!) 1B = E(1B F
) E(1A 1CF) 0 E(E(1A 1CF)1) 0 F E((1A 1C )1) 0
F (by the definition (1.7)!) P(A) P(C) 0 F. If we choose = C
it follows that P(AC) P(C) 0 P(AC) = P(C) P(C \ A) = 0 C A (a.s.)
. On the other hand E(1B - 1AF ) 0 P(B) P(A) 0 F. If we choose
= Bc it follows that P(BBc) P(ABc) 0 P(A \ B ) = 0 A B (a.s.).
Now suppose that A B (a.s.), B F then 1A 1B (a.s.) E(1AF)
E(1BF) = 1B (a.s.) { E(1AF) > 0} {1B > 0} (A )F B (a.s.). The same
method if C A (a.s.), C F : then 1C 1A (a.s.) 1C = E(1CF) E(1AF)
{1C = 1} { E(1AF) =1} C (A )F ( a.s.) .
Example. If F =(), where = (j)jJ is an at most countable partition of
, then (A )F is the union of all the atoms j having the property that P(Aj)>0
and (A )F is the union of all the atoms j such that P(j \ A) = 0.
Property 15. Strict contractivity. If 1 < p < , then E(XF )p = Xp X =
E(XF ).
If p {1,} this is not true, but the following conditions hold:
(2.9) E(XF )1 = X1 E(X+F )E(X-F ) = 0 {X > 0}F {X < 0}F =
(a.s.)
(2.10) E(XF ) = X P(X>X-F ) = 1 > 0.

Proof. Case 1. p (1, ) . The function f(x) = x is strictly convex and


X pp = E(f(X)) and E(XF ) pp = E(f(E(XF )). The assertion is a consequence
of Property 11 (Strict Jensen Inequality).
Case 2. p = 1. E(XF ) 1 = X 1 means that E(E(XF )) = EX=E(E(XF ))
(we applied Property 3). Using the convexity of the function f(x) = x it follows
that E(XF ) E(XF ). As these two functions have the same expectation, the
only explanation is that that E(XF ) = E(XF ) Y - Z = Y + Z , where
Y = E(X+F ) 0 and Z = E(X-F ) 0 . That happens iff Y = 0 or Z = 0 YZ = 0.
Let us prove the second equivalence. Let B = {E(X+F ) > 0} and C = {X >
A.BENHARI

-58-

0}F . We claim that B = C. Indeed, both these sets belong to F . Due to the
definition (1.7) we have that E(X+1B) = E(E(X+F )1B) = E(E(X+F )) (since always
EY = E(Y1{Y0}) !) = E(X+). But X+1BX+ and have the same expectation X+1B = X+
(a.s.) {X+ 0} B {X > 0} B {X > 0}F B (by (2.8)) C B. For
the converse inclusion, remark that E(X+F) 1C = E(X+1CF ) (property 8!) =
E(X1{X>0}1CF ) (as X+ = X1{X>0}!) = E(X1{X>0}F ) (as {X>0} C !) = E(X+F). Meaning
that { E(X+F)} C B C. In the same way one checks that the sets {E(X-F )
> 0} and {X < 0}F coincide. Now it is clear that YZ = 0 {Y 0} {Z 0} = .
Conversely, if {X > 0}F {X < 0}F = it follows that {Y > 0} {Z >
0} = (a.s.) Y-Z= Y + Z , proving our equivalences (2.9).
Example. If X = 1A 1B , X1 = P(A) + P(B) and E(XF )1= E(P(AF )-P(BF )).
These two quantities coincide iff (A)F (B)F = (a.s.).
Case 3. p = . Let M = X . As X = X we may as well
suppose that X 0. We already know that E(XF ) M . Let > 0. Then
X M- + 1{X>M-} E(XF) M - + P(X > M - F)
E(XF ) M - + P(X > M - F) = M - + P(X > M - F) . If
E(XF ) = M , then M M - + P(X > M - F) P(X > M - F)
P(X > M - F) 1 P(X > M - F) = 1 proving the implication
. For the other implication remark that X (M - )1(X > M - } E(XF) (M)P(X > M - F) E(XF) (M-)P(X > M - F) = M- for any >0.
Meaning that E(XF) = 1.
Example. Let = [1,), K = B([1,)), F =() with ={[n,n+1)}n1 , P = ,
(x) = 1/x2 Let Ak =

U [n +

, n + 1) where n < 1 and n 0 as n , k 1.

n=k

Then P(AkF) =

P( Ak n )1n =
n=k

(1
n=k

n
1 has the property that P(Akn + n n

F) = 1Ak = 1 . Notice that (Ak)F = , (Ak)F = [k,) (a.s.) and, if X is the


indicator of Ak, then {X >M-} = {X = 1} = Ak has void interior. Still, P(X > M F) = P(AkF) = 1 > 0.

3. Regular conditioned distribution of a random variable.


Let X : E be a measurable function, where (E,E
E ) is a measurable space.
Let F K be a sub--algebra. Then we know that the conditioned distribution of
X given F is the mapping B a (PoX-1)(B F ) from E to the set of the F measurable
random variables assuming values between 0 and 1. This mapping is somewhat similar
to a distribution in the following sense: if (Bn)n is a sequence of disjoint sets
from E , then

(PoX )(
-1

(3.1)

U
n =1

Bn F ) =

(PoX-1)(Bn F ) (a.s.).

n =1

The reason is the following : (PoX-1)(

U
n =1

A.BENHARI

-59-

Bn F ) = P(X-1( U Bn) F ) (by


n =1

definition!)

= E( 1

X 1 (

U Bn )

F ) (again by definition of the conditioned

n =1

probability) = E( 1

U Bn

(X)F ) (since 1X 1 ( B ) = 1B(X) !) = E(

n =1

1
n =1

Bn

(X)F ) (as the sets

are disjoint!) =

n =1

!) =

n =1

E( 1Bn (X)F ) (a.s.) (by Property 8.1 conditioned Beppo-Levi

E( 1X 1 ( B ) F ) =
n

P(X-1(Bn)F ) =

n =1

(PoX )(Bn F ) .
-1

n =1

The trouble is that the equality (3.1) holds only almost surely. That is, the
set of those having the property that (PoX )(
-1

U
n =1

Bn F )()

(PoX )(Bn
-1

n =1

F ) () is neglectable. We would like to find a neglectable set , N such that if


N then (PoX-1)(

U
n =1

Bn F )() =

(PoX-1)(Bn F ) () for all the sequences

n =1

of disjoint sets (Bn)n . In that case PoX-1(F )() would be a real probability on
(E,E ) for all N . That is the regular conditioned distribution of X given F.
To be precise
Definition. Let (E,E ) be a measurable space and X : E be a measurable
function. A function Q : E [0,1] having the properties
(i). Q(,B) is a version for P(X-1(B) F ) () B E ;
(ii). B Q(,B) is a probability on (E,E
E)
is called the regular conditioned distribution of X given F. Another name for this
object could be: a regular version for the conditioned distribution of X given F.
At a first glance it is not at all obvious why such a regular version
should exist at all.
We shall prove the following rather remarkable fact:

Proposition 3.1. If (E,E


E ) = (,B()) then a regular version for PoX-1(F
) exists for any sub--algebra F .
Proof. Let be the set of rational numbers .
Let us define the function G: [0,1] by G(x,) =P(X xF )() = E(1(,x](X)F )() . (We choose arbitrary versions for P(X x F) !). Let x < y .
Let Ax,y = {G(x,) > G(y,)}. Due to the monotonicity of the conditional
expectation (Property 5) all the sets Ax,y are neglectable. Let then x be any
1
and define the sets Bx = { lim G ( x + , ) G(x,)}. As 1(-,x] = lim 1
1 ,
n
n
( , x + ]
n
n
the conditioned Beppo Levi theorem (Property 8.1) says that P(X xF ) = lim P(X

1
F ) (a.s.) , i.e. the sets Bx are neglectable, too. Let further C :=
n
{limx - G(x,) 0} and D ={limx + G(x,) 1} . Again by Beppo-Levy ,
the sets C and D are neglectable.
Let N be the union of all these sets : N =
U Ax, y U Bx C D F . Being a countable union of neglectable sets , N is

x+

x < y

A.BENHARI

-60-

neglectable itself. Let 0 = \ N. Then P(0)=1 and


(3.2) 0 x a G(x,) is non-decreasing , G(x,) = limnG( x+
,) = 0, G(,) = 1
Let us define a new function F : [0,1] by

1
,), and G(n

inf {G ( y, ) y , y > x} if 0
1[ o , ) ( x)
if 0

(3.3) F(x,) =
We claim
(i).
(ii).
(iii).

that
x a F(x,) is a distribution function for any ;
a F(x,) is F measurable for any x;
F(x) = P(X xF ) (a.s.) for any x .
Let us check (i). For 0, there is nothing to prove. Suppose that
0. Clearly F is non decreasing. If x , then by (3.2) we see that F(x,) =
G(x,) . So F(-,)=0, F(,) = 1. The only problem is to prove that F(,) is
right-continuous. If 0 , this is obvious. In this case F(x,) = 1[0,)(x).
Suppose that 0 is fixed. We shall not write it, to simplify the writing. Then
lim F(y) = inf {F(y)y (x,)} (as F is non-decreasing !) = inf {inf
y x

{G(a)a(y,)} y (x,)} = inf {G(a) a

(y,)} (as for any

y( x , )

function G and any family of sets (A)I the equality inf {inf G(x)x A} I
} = inf {G(x) x
A } obviously holds check it as an amusing exercise!) =

inf {G(a)a (x,) } = F(x) . So F is right continuous. As the functions


G(a) are F measurable it follows that F is F measurable, too.
Now we shall check (iii). Actually we shall prove more. Let (,) be
the probability measure on (,B()) whose distribution function is F(,), i.e.
((-,x],) = F(x,) x . Let us denote by C the family of sets fulfilling
the relation
(3.4) the set NB : = { (B,) E(1B(X)F )() } is neglectable
The claim is that
(i).
C contains the family M = {(-,a] a }
(this is clear: ((-,a],) = F(a,)
= G(a,) = E(1(-,a](X)F )() 0!)
(ii).
C is a -system.
(indeed,
if B C then (B,.) = E(1B(X)F ) (a.s.) . On the other hand (Bc,.)
= 1- (B,.) = 1- E(1B(X)F ) (a.s.) = E(1 1B(X)F ) (a.s.) = E( 1B c (X)F ) (a.s.)
Bc C .
if Bn C are disjoint then

Bn,.) =

(Bn,.)

(Bn,.) = E( 1Bn (X)F ) (a.s.) (

(as (.,) are probabilities) =

n =1

A.BENHARI

U
n =1

n =1

-61-

E( 1Bn (X)F ) (a.s.) = E(

1Bn (X)F ) (a.s.) (by property 8.1, Conditioned Beppo-Levy) =E( 1 (X))
U Bn

n =1

n =1

(a.s.)

Bn C.

n=1

C.
From (i) and (ii) it follows that C contains the =system generated by M . It
happens that this coincide with B(). The conclusion is: (B,.) = E(1B(X)F )
-1
(a.s) B B(). Or, in another notation, (B,.) = PoX (B F ) (a.s.) .
-1
Therefore is a regular version for PoX (F ).
-

The utility of the regular conditioned distribution is given by

Proposition 3.2. The transport formulla.


Let (E,E ) be a measurable space , X : E be a measurable function and F K
be a -algebra. Suppose that X admits a regular version for its conditioned
distribution (PoX-1)(F ).
Let f : E be measurable such that f(X) L1. Then
(3.5) E(f(X)F ) = f d(PoX-1)(F ) (a.s.)
Proof. It is standard. Let us denote the regular version of (PoX-1)(BF )() by
(B,). To avoid confusions, we shall denote the integral with respect to this
family of measures by f(x)(dx,). If we shall write fd we shall
understand the random variable (

fd)() :=

f(x)(dx,).

Step 1. f is an indicator. So let f = 1B , B E .


Then E(f(X)F ) = E(1B (X)F ) = P(X-1(B)F ) (a.s.) = 1B d
n

Step 2. f is simple. Then f =

ai 1Bi hence E(f(X)F ) = E( ai 1Bi (X)F


i =1

a E (1

) =

i =1

Bi

( X ) F ) (a.s.) (by Property 4, inearity) = a i 1Bi ( X ) d =

a 1
i =1

i Bi

i =1

( X ) d =

i =1

fd E(f(X)F ) =

fd (a.s.)

Step 3. f is nonnegative. Then f = lim f n , fn 0 simple. It means that

E(f(X)F ) = E( lim f n F ) = lim E(fnF ) (a.s.) ((by Property 8.1 ) = lim


n

lim fn d (by Step 2!) =


n

fd E(f(X)F ) =

fn d

fd (a.s.)

Step 4. f is any. Then f = f+ - f- where f+ , f- are the positive


(negative) parts of f . It follows that E(f(X)F ) = E(f+(X)F ) - E(f-(X)F )
(a.s.) (linearity) = f+ d - f- d (a.s.) (by step 3) = fd E(f(X)F )
=

fd (a.s.).

Corollary 3.3. Conditioned expectation and variance. Let X : be a


random variable from L2 and F K be a -algebra. Let be a regular version

A.BENHARI

-62-

for its conditioned distribution, = (PoX )(F ). We know that exists due to
Proposition 3.1.
Then the conditioned expectation is given by
-1

(3.6) E(XF ) () =

x (dx,) (a.s.), E(X2F ) () =

x2 (dx,)

And the conditioned variance E((X E(XF )2F ) is given by


(3.7) Var(X2F ) () =

x2 (dx,) - ( x (dx,) )2

Proof. These are easy consequences of the transport formulla, the first relation
2
with the function f(x) = x . For the second one notice that E((X E(XF ) F ) =
2
2
2
2
E(X 2X E(XF) + E(XF ) F ) = E(X F ) - 2 E(XF )E(XF ) + E(XF ) (by Property
2
2
9!) = E(X F ) E(XF ) .
Now we shall busy ourselves to find more or less practical formulae to
compute the conditioned regular distributions.

Corollary 3.4. If X is a real r.v. , (E,E


E ) is any measurable space and Y :
-1
E ia measurable then a regular version for (PoX )((Y)) exists . It is
denoted by
(PoX-1)(Y) and has the form (PoX-1)(BY)() = (B, Y()) where :
B() E [0,1] has the properties
(i).
B a (B, y) is a probability on (,B()) y Range(Y) ;(
If Range(Y)E then may be chosen such that (i) hold for any y E!)
(ii) y a (B, y) is E -measurable B B().
Proof. Let F = (Y). According to Proposition 3.1. a regular version for
(PoX )(F) exists. Denote it by . According to the definition, fulfills the
-1

following assumptions:
B a (B,) is a probability on (,B()) ;
a (B,) is F - measurable B B() ;
The set NB = { (B,) P(X-1(B)F )()} is neglectable B B()
.
As F = (Y) by property 12 (B,) must be of the form (B,) = hB(Y())
where hB : E is E measurable, and this measurability explains the claim
(ii). Let us denote hB(Y()) by (B, Y()). Then B (B, y) is a probability on
(,B()) y Range(Y). Indeed, let y = Y() Range(Y) and let (Bn)n be a
sequence of disjoint Borel sets. Then (

n =1

(Bn,) (as (,)

U Bn , y) = ( U Bn , Y()) = ( U Bn , )
n =1

is a probability) =

n =1

n =1

(Bn,Y()) =

n =1

(Bn,y). The

n =1

problem is that B (B, y) may not be a probability when y Range(Y) . If we


know that Range(Y)E , that will not be a problem We may define, for instance
*(B,y) to be equal to (B,y) if y Range(Y) and with 0(B) if y Range(Y). In
that way we shall obtain a probability on (,B()) and the measurability will be
preserved due to the following fact: if f : E is measurable and A E , then
g := f 1A + c 1 Ac is measurable, too no matter of the constant c. In our case f =

A.BENHARI

-63-

(B,) and c = 0(B) = 1B(0).


In some cases we can find more useful formulae. For instance, when F is
given by an at most countable partition (i)iI . In that case a regular
conditioned distribution exists if X : E, (E,E
E ) any measurable space.

Proposition 3.5. Let (E,E


E ) be a measurable space , X : E be
Then a
measurable and F be given by an at most countable partition (i)iI .
regular conditioned distribution of X given F exists and it is given by the
formula
-1
(3.8) (PoX )(F ) =

(P

iI 0

o X 1 1i + *1

where I0 = {i I P(i) 0} and Pi is the conditioned probability given i, i.e.

Pi (A) =

P( A i )
as defined in 1.1, * is arbitrary and is the union of the
P ( i )

neglectable athoms i . Of course is neglectable itself. If there are no


neglectable athoms, this second term of (3.8) vanishes.

Proof. Let B E . Then (PoX-1)(BF ) = P(X-1(B) F ) =

P( X
iI 0

P
iI 0

( X 1 ( B ))1i =

(P
iI 0

o X 1 )( B )1i . Let (B,) =

(P
iI 0

( B ) i )1i =

o X 1 )( B )1i (). The F -

measurability of the function a (B,) is obvious, the fact that for any given
the function B a (B,) = ( Pi o X 1 ) (B) (with i the unique set containing )
is a probability is clear, too, due to the definition 1.1. Finally, (B,)
coincides with (PoX-1)(F ) (a.s.).

Corollary 3.5. If (E,E ), (F,F ) are measurable spaces X : E, Y :


F is discrete (thus F contains the singletons) then
(3.9) (PoX-1)(Y ) = P{Y = y} o X 1 1{Y = y} + *1
yI 0

where I0 = {y F P(Y=y) > 0} and = { Y()=y,yRange(Y) \ I0 } is


neglectable.
Proof.. According to our hypothesis, I is at most countable. Then we have
P(X B Y ) = P(X B (Y )) =
P(X B Y = y) 1{Y = y} =
(P{Y = y}oX-1)(B)

yI

yI

1{Y=y}
We can let the formula as it is, but if belongs to the neglectable set
{Y()=yyRange(Y) \ I }, then P(X B Y )() = 0 B and that would not be a
probability. To have a regular version, we have to add a fictive probability * on
the set .

Corollary 3.6. The discrete case.


Suppose that the vector (X,Y) is discrete. It means that I :=

A.BENHARI

-64-

{(x,y)P(X=x,Y=y) 0} is at most countable and P((X,Y) (I ) ) = 0. Let p(x,y) =


p( x, y ) x,y . Then X is discrete, too. Let I1 =
P(X=x,Y=y), hence Po(X,Y)-1 =
-1

( x , y )I

pr1(I) and I2 = pr2(I). Of course I1 and I2 are at most countable, I I1I2 . Then
p ( x, Y )
-1
(3.10) (PoX )(Y) =
x
p 2 (Y )
xI1
(3.11) (PoY-1)(X) =

yI 2

where p1(x) =

p( X , y)
y
p1 ( X )

p(x,y) and p2(y) =

yI 2

p(x,y).

xI1

Proof. Remark that the distribution of X is PoX-1 =

p ( x)
xI1

distribution of Y is PoY-1 =

p ( y)

yI 2

p(x,y) and p2(y) = P (Y=y) =

P(X=xY=y)1{Y=y} =

yI 2

yI 2

yI 2

where p1(x) = P (X=x) =

and the

P(X=x,Y=y) =

yI 2

P(X=x,Y=y) =

xI1

p(x,y). Thus P(X=xY) =

xI1

P ( X = x, Y = y )
1 {Y=y} =
P(Y = y )

yI 2

p ( x, y )
1{Y=y} hence we can write
p2 ( y )

p ( x, Y )
x I1. This is a discrete distribution which can be written
p 2 (Y )
p ( x, Y )
x , proving (3.10). The equality
in a shorter form as (PoX1)(Y) =
p 2 (Y )
xI1

P(X=xY) =

(3.11) has the same proof.


Remark. In statistics one prefers the notation pX , pY, pXY and pYX instead of p1,p2,
p(X=xY=y) and p(Y=yX=x).
A remarkable fact is that an analog of (3.10) and (3.11) exists in the
absolutely continuous case. We shall prove that in the special case when X,Y are
real random variables and the vector (X,Y) is absolutely continunous, meaning that
Po(X,Y)-1 = 2, being the Lebesgue measure.
Proposition 3.7.
(3.12) (PoX-1)(Y)()= 12(,)
(3.13) (PoY-1)(X) = 21(,)
where 12(x,) =
2(y) =

( x, Y ( ))
, 21(y,) =
2 (Y ( ) )

(x, y )d (x )

( X ( ), y )
, 1(x) =
1 ( X ( ) )

(x, y )d ( y )

and

Remark. In statistics one uses the notations X instead of 1, Y instead of 2 ,


XY=y instead of 12 and YX=x instead of 21. They also use the notation P(X AY =
y) instead of P(XAY )() which can be very misleading for a beginner, because
they have no immediate meaning.

Proof.

A.BENHARI

-65-

It is easy to see that 1 and 2 are the densities of X and Y (For instance,

P(XA) = P((X,Y)A) =
(x,y)d(y))d(x) =

1d =

d2 =

( x, y ) d(x)d(y) =

( x)(

d(1) A B() PoX-1 = 1.

We shall prove (3.12). The task is to check that (PoX-1)(AY)()= (12(,))(A)


for almost all . Or , to check that E(1A(X)Y) =

( x , Y )
d(x) (a.s.). As the
2 (Y )

measurability is ensured by Fubini-Tonelly theorem,it follows that , according to


(1.7) we have to check only that

( x , Y )
d(x)1C) C (Y) . As any C with this property is of
2 (Y )
-1
the form C=Y (B) for some B B() , the task is to prove that
( x , Y )
(3.14) E(1A(X)1B(Y)) = E( 1A (x)
d(x)1B(Y))
2 (Y )
( x , Y )
( x , Y )
( x , y )
But E( 1A (x)
d(x)1B(Y)) = ( 1A ( x )
d(x)1B(Y))dP = ( 1A ( x )
2 (Y )
2 (Y )
2 ( y )

E(1A(X)1C) = E( 1A

d(x)1B(y))d(PoY-1)(y) (by the transport formula)


=

(1

( x)

( x , y )
d(x)1B(y))d(2)(y) =
2 ( y )

(1

( x)

( x , y )
2(y)1B(y))d()(y) d(x)
2 ( y )

( 1 ( x) (x,y)1 (y))d()(y) d(x) = 1 d = 1 d( ) =


= 1 (X,Y) dP (by the transport formula) = E(1 (X,Y)) hence
2

(by Fubini !) =

A B

d(Po(X,Y)-1

A B

A B

A B

A B

(3.14) follows. The equality (3.13) has a similar proof.


Remark. The statistical notation has its own reason. After all the formulae (3.12)
and (3.13) come from the natural feeling that something that holds in the discrete
case must also hold somehow in the absolutely continuous settings. Namely , if P(X
AY = y) should have a sense at all, it should be lim0 P(X Ay- < Y < y+).
Sometimes this is true and coincides with

1A(x) 12(x,y)d(x), and that is a

motivation for the notation XY=y. Precisely


Proposition 3.8. If and 2 are continuous, then
lim0 P(X Ay- < Y < y+) =

(3.15)

Proof. lim0 P(X Ay- < Y < y+) = lim0

1A(x) 12(x,y)d(x)

P( X A. y < Y < y + )
=
P ( y < Y < y + )

y +

1
lim

(u )1( y , y + ) (v)(u , v)d2 (u , v)

2 (v)1( y, y+) (v)d(v)

( 1

= lim

(u )(u , v)d(u ))dv

y +

(we used the fact


2

(v)dv

that for continuous functions the Lebesgue and the Riemann integrals coincide and
the fact that if the function v a

(u, v)d(u) is

continuous, then v a A(v) : =

(u )(u , v)d (u ) is continuous, too) . It follows that lim0 P(X Ay- < Y <

A.BENHARI

-66-

y +

( v ) dv

y
0 y +

y+) = lim

A ( y)
(one applies the Hospitals rule!) =
2 ( y)

1A(x) 12(x,y)d(x).

( v ) dv

(u )

(u , y )
d (u ) =
2 ( y)

Transition Probabilities
1. Definitions and notations.
Let (E,E) and (F,F) be two measurable spaces. A function Q: EF [0,1] is
called a transition probability from E to F if
(i). x a Q(x,B) is E measurable B F and
(ii). B a Q(x,B) is a probability on (F,F) x E
Thus we can imagine Q as a family Qx of probabilities on (F,F) indexed
on the set E. That is the way they do in statistics: they denote Q by (P) . We
shall denote by Q(x) the probability defined by Q(x)(B) = Q(x,B).
Q
We shall write in short Let E
F instead of Let Q be a transition
probability from E to F
Example 1. The regular conditioned distribution of a random variable X by a sub algebra F, denoted by PoX-1F is a transition probability from (,F ) to (,B())
(see Conditioning section. 3). Indeed, if we put Q(,B) = P(X BF)() = PoX1
F (B)() , then (i) and (ii) are fulfilled by the very construction of Q.
Example 2. A particular case is Q(x,B) when Q(X(),B) = P(XBY)() (the regular
version!) where X and Y are two random variables . This time Q is a transition
probability from (,B()) to itself.
Example 3. If F is at most countable and F = P(F) (all the subsets of F!) then all
the transition probabilities from E to F are of the form
(1.1) Q(x) =
q( x, y ) y

yF

where the mappings x a q(x,y) are measurable and

q ( x, y )

= 1 x F.

yF

Indeed, if we denote Q(x,{y}) by q(x,y), then 1 = Q(x,F) =

Q(x,{ y})
yF

q ( x, y ) .
yF

Moreover, by (i). these mappings should be measurable.


Example 4. If E is at most countable and E = P(E) then there are no measurability
problems and all families (Q(x))xE of probabilities on F are transition
probabilities.
Example 5. If both E and F are at most countable, then a transition probability is

A.BENHARI

-67-

simply a (possible infinite) matrix Q = (q(x,y))xE,yF with the property that


q( x, y ) = 1 x F. That is called a stochastic matrix. If E, F are even

yF

finite, this is an ordinary matrix with the sum of the entries on every line
equally to 1. We can think at a stochastic matrix as being a collection of
stochastic vectors that is,
is of nonnegative vectors with the sum of the
components equally to 1.

2. The product between a probability and a transition


probability.
Q
Let (E,E) and (F,F) be two measurable spaces and E
F . Let also be a
probability (or, more general, a signed bounded measure) on (E,E). Then we denote
by Q the function defined on E F by the relation

Q( x, C ( x,.)) d(x)

Q(C) =

(2.1)

Here C(x,.) = {y (x,y) C } is the section in C made at x.


We shall also use the notation
Q(B) = Q(EB) =

(2.2)

Q( x, B)) d(x)

Proposition 2.1.
(i).
If is a bounded signed measure on (E,E), then Q is a bounded signed
measure on E F. If is a probability, then Q is a probability, too. If f:
EF is measurable (nonnegative or bounded) then

f dQ

(2.3)

( f ( x, y)dQ( x)( y))d(x)

Remark. The meaning of (2.3) is that firstly we integrate f(x,.) with respect to
the measure Q(x) and then we integrate the resulting function with respect to the
measure . The notation from (2.3) is awkward, that is why one denotes
d(x) instead. The most accepted notation is, however,

f ( x,.)dQ

f ( x, y)Q( x, dy) d(x).

So

(2.3) written in a standard form becomes

f dQ

(2.4)

f ( x, y)Q( x, dy) d(x)

(ii).
If is a bounded signed measure on (E,E), then Q is a bounded signed
measure on F. If is a probability, then Q is a probability, too. If f : F
is measurable (nonnegative or bounded) then

f dQ

(2.5)

f ( y)Q( x, dy) d(x)

Proof. It is easy. Firstly, both Q and Q are measures because of Beppo Levi
theorem. Indeed, if Cn are disjoint, then Q(

U
n =1

Q ( x,

U
n =1

Cn ) =

UC

Q ( x, (

C n ( x,.)) d(x) =

n )( x,.))

d(x) =

n =1

Q (x,Cn(x,.))d(x) (by Beppo-Levi!) =

n =1

Q(C ).
n

Thus,

n =1

Q is a probability. Moreover Q(EF) =

Q( x, F ) d(x)

1 d(x)

= (E) ; so,

if (E) = 1 , Q(EF) = 1 too. As about the formula (2.4) its proof is standard,

A.BENHARI

-68-

into the usual steps: indicator, simple function, nonnegative function, any. The
same holds for (2.5).
Remark 2.1. Suppose that F is countable. Then Q has the form (1.1), then
(2.3) and (2.5) become
(2.6) Q(A{y}) =
(2.7) Q({y}) =

q( x, y)1

A ( x)

d(x)

q( x, y) d(x)

If, moreover, E is at most countable, too, then =

p (x)

therefore

xE

(2.6) and (2.7) become


(2.8) Q({(x,y)}) = p(x)q(x,y)
(2.9) Q({y}) =
p (x)q(x,y)

xE

The relation (2.9) motivates the notation Q. For, if we think as being


the row vector (p(x)xE and Q as being the matrix (q(x,y))xE,yF , then Q is the
usual product between and Q : Q({y}) is the entry (Q)y . That is why , when
dealing with the at most countable case it goes without saying that is a row
vector and Q a stochastic matrix.
Remark 2.2. If = x , then obviously Q = Q(x). Therefore
(2.10) xQ = Q(x)
If we are in the at most countable case, the probabilities x correspond to
the cannonical basis ex ; the meaning of (2.10) is that the product between ex and
Q is the row (Qx,y)y .

Let M(E,E) denote the set of all the bounded signed measures on the
measurable space (E,E) , Prob(E,E) be the set of all the probabilities on that
space and let Bo(E,E) denote the set of all the bounded measurable functions f : E
.
Notice that M(E,E) is a Banach space with respect to the norm variation
defined as = +(E) + -(E) where = + - - is the Hahn-Jordan decomposition
of . Recall that + is defined by +(A) = (AH) where H is the Hahn-Jordan set
of , that is a set (almost surely defined) with the property that (H) = sup
{(A) A E }. In this Banach space the set Prob(E,E) is closed and convex.
On the other hand Bo(E,E) is a Banach space too, with the uniform norm f
= sup {f(x); x E}. The connection between these two spaces is given by
Lemma 2.2.

f Bo(E,E), f = 1}

{ f d : M(E,E), = 1}

(i).

M(E,E) = sup { f d:

(ii).

f Bo(E,E) f = sup

(iii) f d f
It means that the mapping (,f) a <,f > : =

f d is

a duality . These

spaces form a dual pair.


Proof. Let H be the Hahn Jordan set of . Then +(E) = (H) and -(H) = -

A.BENHARI

-69-

(Hc). So = (H) - (Hc) =

{ f d:

f d where

f = 1H 1H c . As f = 1, sup

f Bo(E,E), f = 1}. On the other hand f d= f d+ -

f d
-

f d+ + f d- f+(E) + f-(E) = f(+(E) + -(E)) = f hence

f = 1 f d so (i). and (iii). hold. As about (ii), it is even


simpler: (iii) implies that

f = sup { f d :

M(E,E), = 1} and

if (xn)n is a sequence of points from E such that f = limnf(xn), then f =

limn f d xn proving the converse inequality.


Q
Let now (E,E) and (F,F) be two measurable spaces and E
F Consider the
mappings T : M (E,E) M (F,F) and T : Bo(F,F) Bo(E,E) defined by
(2.11) T() = Q

(2.12) T(f) = Qf

defined by Qf(x) =

f dQ(x) = f ( y) Q(x,dy)

Proposition 2.3. Both T and T are linear operators; T = T = 1 and T


is the adjoint of T in the sense of the duality <,>. That is
(2.13) <T(),f > = <, T(f) > or, explicitly,

f dT()

Bo(F,F), M (E,E)
Proof.

f dT()

f dQ

f ( y)Q( x, dy) d(x)

linearity is obvious. Moreover T = sup{T

T '( f )

d f

T ' ( f ) (x)d(x). The


= 1} = sup { f d:

=1, f=1} 1 (by Lemma 2.2(iii). But if is a probability, then =


T = 1 as T is a probability, too.
Remark 2.3. If F is at most countable, then by (1.1) Q(x) =
q( x, y ) y hence

yF

(2.15)

Qf(x) = q( x, y ) f(y)
yF

We can visualize f as being a column vector and Q as being a matrix .


Clearly (2.15) means the product between the matrix Q and the vector f. That
motivates the notation. So, from now on, it goes without saying that in the at
most countable case the measures are row vectors and the functions are column

vectors.

3. Contractivity properties of a transition probability.


Q
Let (E,E) and (F,F) be two measurable spaces and E
F . In the previous
section we have defined the operator T = Q.
We shall accept that the first space has the property that the singletons {x}
belong to E. As a consequence the Dirac probabilities x and x have the property
that x x x - x = 2 . Indeed, if = x - x , then + = x - = x
= +(E) + -(E) = 1 + 1 =2.
(This may be not true if there are singletons {x} E ; since in that case it
is possible to exist x E such that any set containing x contains x, too . It

A.BENHARI

-70-

means that x - x = 0. )
Let us define the quantity
(3.1) (Q) =
-

( x ' )Q
1
sup{Q(x) Q(x) x,x E} = sup{ x
x x}
x x'
2

This is the contraction coefficient of Dobrushin. Remark that , as Q(x) and


1
Q(x) are probabilities, then Q(x)=Q(x) = 1 hence -(Q)
sup(Q(x) +
2
Q(x)) = 1. It means that the contraction coefficient has the property 0 -(Q)
1.

Proposition 3.1. The following inequality holds for a M (E,E)


(3.2) Q -(Q) + (1--(Q))(E).
Proof. Let us fix some notations. Let H be the Jordan set of , K be its
complementary, m be the variation of , m = =+ + - ,a = +(E) = m(H), b = (E) = m(K) . Then
(3.3) = (1H 1K)m , a + b = , a b = (E)
Taking into account Lemma 2.2 (i) one sees that the task is to prove that

(3.4) f Bo(F,F) , f= 1 f d(Q) -(Q)(a+b) + (1--(Q))a-b


If b = 0, is an usual measure then = (E) hence (3.2) becomes Q
and this is true because of proposition 2.3 (namely T=1!) . The same if a
= 0; now = -(E) = (E) and (3.2) becomes again Q .
So we shall suppose that a0, b0 and, moreover that a b (if not, replace
with - and (3.2) remains the same! ) .
Then, as a-b= a-b -(Q)(a+b) + (1--(Q))a-b = -(Q)(a+b) + (1-(Q))(a-b) = 2b-(Q) + a-b hence (3.4) becomes

f Bo(F,F) , f= 1 f d(Q) 2b-(Q) + a-b

(3.5)

Now f d(Q) =
=

f ( y)Q( x, dy) d

f ( y)Q( x, dy) 1

f ( y)Q( x, dy) d(x)

((1H m) (x) H

(x)dm
m (x) -

f ( y)Q( x' , dy) d

f ( y)Q( x' , dy) 1

f ( y)Q( x, dy) d

((1H m) (x)

(x)dm
m(x)

= (

1
m(x))
1K (x') dm
b

f (y)Q(x,dy) 1

f ( y)
m2(x,x) Q ( x, dy ))1H K ( x, x' ) dm
b

1
m2(x,x)
( af ( y )Q ( x, dy ) bf ( y )Q ( x' , dy ))1H K ( x, x' ) dm
ab

(x)dm
m(x) - (

1
m(x))
1K (x) dm
a

((1H 1K)m) (x)

f (y)Q(x',dy)1

(x)dm
m(x)

f ( y)
m2(x,x)
Q ( x' , dy ))1H K ( x, x' ) dm
a

1
m2(x,x)
af ( y )Q ( x, dy ) bf ( y )Q( x' , dy ) 1H K ( x, x' ) dm
ab
1
sup af ( y )Q( x, dy ) bf ( y )Q( x' , dy ) 1H K ( x, x' ) dm

m2(x,x)
ab x , x 'E
1
m2 = sup fd (aQx bQx ' ) (as m2(HK) = ab
= sup af ( y )Q( x, dy ) bf ( y )Q( x' , dy ) H K dm
ab
x , x 'E
x , x 'E

A.BENHARI

-71-

; we denoted Qx instead of Q(x) for fear of confusion!) sup faQ(x)


x , x 'E

sup aQ(x) bQ(x) (as f=1) . But aQ(x)

bQ(x) (see Lemma 2.2(iii)!) =

x , x 'E

bQ(x) = b(Q(x) Q(x)) + (a b) Q(x) b(Q(x) Q(x)) + (a b) Q(x)


= 2b

1
(Q(x) Q(x)) +a b. It follows that
2

sup aQ(x) bQ(x) 2b

x , x 'E

(Q) + a b , which is exactly 3.5


Corollary 3.2. Let T0 be the restriction of T on the Banach subspace M0 (E,E) of
the measures with the property that (E) = 0. Then RangeT0 M0 (F,F) and T0
= (Q). As a consequence, if 1, 2 are probabilities on (E,E), then 1Q - 2Q
2 (Q)

Proof. The first assertion is immediate: (T0)(F) = Q(F) = Q (x,F)d(x)


= (E) = 0. For the second one, remark that if (E) = 0, then (3.2) becomes
(3.6)
Q (Q)
T
Now, according to the definition of the norm of an operator, T0= sup 0 =

sup

(Q). The other inequality is obvious since (Q) =


-

Q(x) x,x F} =

T
1
sup{(x - x)Q x,x F} = sup 0

2
X

1
sup{Q(x)
2

T0 where X =

{(x - x)/2 x x E} M0 (F,F). The last claim comes from the fact that
1Q - 2Q = (1 - 2)Q -(Q)1 - 2 (since (1-2)(E) = 1 1 = 0!) (Q)(1+2) = -(Q)(1+1).
If F is at most countable, then the coefficient -(Q) is computable.
Indeed, if Q(x) =
q (x,y)y and Q(x) =
q (x,y)y , then Q(x)

yF

Q(x) =

q ( x, y ) q ( x ' , y ) .

yF

This is a consequence of the fact that if is a -

yF

finite measure, then = 1 =

d ;

in our case = card =

is -

yF

finite since F is at most countable. If E is at most countable, too, then we have


the following consequence:
Corollary 3.3. Suppose that E and F are at most countable. Then is a stochastic
vector ((x))xE and
1
(3.7) -(Q) =
sup{
q(x,y) q(x,y) : x,x E}
2
yF

In this case (3.2) becomes


(3.8)

( x ) q ( x, y )
y

A.BENHARI

(Q)
-

( x)

+ (1 - -(Q))

( x)
x

-72-

4. The product between transition probabilities.


Since a transition probability is a kind of matrix , sometimes it is possible
to multiply two of them. Suppose now that we have three measurable spaces (Ej,Ej)
Q1
Q2
and two transition probabilities E1
E2 , E2
E3 .
Then we may construct two other transition probabilities denoted by Q1Q2 and
Q1Q2 . The first one is a transition probability from E1 to E2E3 and the second one
is from E1 to E3. Here are the definitions:

Q ( x , A )1 (x )Q ( x , dx )
Q Q (x , E A ) = Q ( x , A )Q ( x , dx )

(4.1) Q1Q2(x1, A2A3) =


(4.2) Q1Q2(x1, A3) =

A2

Proposition 4.1.
(i).
If f : E2E3 is bounded or nonnegative then
(4.3) f dQ1Q2(x1) = f ( x2 .x3 )Q2 ( x2 , dx3 )Q1 ( x1 , dx2 ) ( = ((Q1Q2)f)(x1) )
If f : E3 is bounded or nonnegative then

(ii).
(4.4)

d Q 1Q 2 =

f ( x )Q ( x , dx )Q ( x , dx ) (
3

= (Q1Q2)f)

Proof. Standard ; the four steps.


Remark. If the spaces Ej are at most countable then we deal with
stochastic matrices: Q1=(q1(x1,x2)) x1E1 , x2E2 , Q2 = (q2(x2,x3)) x2E2 , x3E3 and
(4.1),
(4.2) become
(4.5) Q1Q2(x1, {x2}{x3}) = q1(x1,x2)q2(x2,x3)
(4.6) Q1Q2(x1, {x3}) =
q1(x1,x2)q2(x2,x3)

x2E2

(4.7) ((Q1Q2)f)(x1) =

(x2,x3)q1(x1,x2)q2(x2,x3)

x2E2 , x3E3

(4.8) ((Q1Q2)f)(x1) =

(x3)q1(x1,x2)q2(x2,x3)

x2E2 , x3E3

The relation (4.6) is interesting: it is the usual product of the


stochastic matrices Q1 and Q2. The equality (4.5) has no obvious analog between the
matrix operations. It is easy to see that this product is associative.
Proposition 4.2. The associativity.
Let be a bounded signed measure on E1. Then
(4.9) (Q1)Q2 = (Q1Q2)
(4.10) Q1(Q2f) = (Q1Q2)f
Q3
If (E4,E 4) is another measurable space and E3
E4 then
(Q1Q2)Q3 = Q1(Q2Q3)

(4.11)

Proof. Let f : E3 be bounded or nonnegative. Then

f d[(Q )Q ]
1

g (x )d(Q )(x ) (with g(x ) = f (x )Q (x ,dx ) = Q f(x )


= g (x )Q (x ,dx )d(x ) = ( f (x )Q (x ,dx )) Q (x ,dx )d(x ). On the other hand
f d[(Q Q )] = f (x )(Q Q )(x ,dx )d(x ) = f ( x )Q ( x , dx )Q ( x , dx ) d(x ) (by
(x3)Q2(x2,dx3)d(Q1)(x2) =
2

!)

(4.4)) so both quantities coincide. As about (4.11) one gets (Q1Q2)Q3(x) = x(Q1Q2)Q3
= (xQ1Q2)Q3 and [Q1(Q2Q3)](x) = (xQ1)(Q2Q3) =(xQ1Q2)Q3 which is the same.

A.BENHARI

-73-

Remark. If all the spaces are at most countable, then (4.9) and (4.10)
are the usual products between a row vector and a matrix (this is (4.9)) or
between matrix and column vector (this is (4.10)).
Corollary 4.3. The Dobrushin contraction coefficient is

submultiplicative.
The following inequality holds
(4.12) (Q1Q2) -(Q1)-(Q2)
Proof. Let T1 : M0 (E1,E 1) M0 (E2,E 2) and T2 : M0 (E2,E 2) M0 (E3,E
)
be
defined
as T1() = Q1 and T2() = Q2. Then we know from Corollary 3.2 that
3
(Q1) = T1 and -(Q2) = T2. Notice that T2T1() = T1()Q2 = (Q1)Q2 = (Q1Q2). It
means that (Q1Q2) = T2T1 T2T1 = (Q1) (Q2).
-

Suppose now that (Ej,Ej)j are measurable spaces and that Qj are transition
probabilities from Ej to Ej+1. Because of the associativity the product Q1Q2Qn is
well defined . If all these spaces coincide and Qi = Q, then this product will be
denoted by Qn .
The fact that is submultiplicative has important consequences.

5. Invariant measures. Convergence to a stable matrix


Definition. A transition probability Q is called scrambling if -(Qk) < 1 for
some k 1. A probability is called invariant if Q = .
Proposition 5.1. If Q is scrambling, then the sequence Qn(x) converges to the
same invariant probability . Moreover this probability is unique and the
convergence is uniform in x.
Proof. We shall prove that the sequence Qn(x) is Cauchy in norm. Let us write n
k
n+m
n
= kc(n) + r(n) where c(n) = [n/k]. Let also = (Q ). Then Q (x) Q (x) =
xQmQn - xQn = (Qm(x) - x)Qn Qm(x) - x-(Qn) (by Corollary 3.2) 2-(Qn)
2(-(Q))n (by Corollary 4.3) 2[-(Q))k]c(n) = 2c(n) < if n is great enough.
As M (E,E) is a Banach space, Qn(x) must converge to some probability (x).
Then (x)Q = (limnQn(x))Q = (limn xQn)Q = limn xQn+1 (by continuity of T) = limnQn+1(x)
= (x). So (x) is invariant.
Now suppose that and are both invariant. Then =Q = Q2 = Q3 = . Hence
= = Qn - Qn = (-)Qn 2(Qn) 2c(n) 0. Therefore -=0
= .
n
It follows that Q (x) where is the unique invariant probability.
Moreover we have the estimation - Qn(x) = Qn - xQn 2-(Q)n which
points out the uniformity of the convergence.

A.BENHARI

-74-

Disintegration of the
probabilities on product
spaces
1.

Regular conditioned distributions. Standard Borel


Spaces

Let (,K,P) be a probability space. Recall the following result from the
lesson Conditioning:

Proposition 3.1. If X is a real random variable (thus X : (,K)


-1
(,B()) is measurable) then a regular version for PoX (F ) exists for any
sub--algebra F of K.
We are interested in replacing (,B()) with more general spaces: at
least with n instead of .
So instead of being a real random variable, X is a measurable mapping
from (,K) to some measurable space (E,E) .
To begin with: what happens if E ? What is the meaning o
measurable?
Now the -algebra on E is the trace of B() on . Meaning that A E
iff A = EB for some Borel set B.
Or, more formally, E = i 1(B()) where i : E is the so called
cannonical embedding of E into : simply i(x) = x x E.
We can look of course at X as being real random variable. Formally,
replace X with Y = ioX and clearly Y : .
Let F be a sub--algebra of

K. Then we know that a regular version for

PoY (F) exists. In other words there exists a transition probability Q from
-1

(,F) to (,B()) such that

P (Y BF )() = E(1B(Y)F )() = Q(,B) for almost all B

(1.1)

B()
What is wrong with this Q?
We would like to have a transition probability Q* from (,F) to (E,E)
such that
(1.2)

P (X AF )() = E(1A(X)F )() = Q(,A) for all A E and

almost all
If B1 and B2 are two Borel sets such that A = EB1 = EB2 (= i 1(B1) = i
1
(B2)!) then P(X-1(A)F ) = P(X-1(i 1(B1))F ) = P((ioX)-1(B1)F ) = P(Y-1(B1)F ) =
Q(,B1) (a.s.) and
P(X-1(A)F ) = P(X-1(i 1(B2))F ) = P((ioX)-1(B2)F ) = P(Y-1(B2)F ) = Q(,B2) (a.s.)
hence

A.BENHARI

-75-

EB1 = EB2 Q(,B1) = Q(,B2) (a.s.)

(1.3)

Seemingly, it makes sense to define


(1.4)
Q*(,A) = Q(,B ) if A = EB
This definition makes sense because of (1.3).
The trouble is that we are not able to infer anymore that Q*(.) is a
probability. For, if (An)n are disjoint we cannot infer that (Bn)n are disjoint,
too!
There is a happy case.
Namely, if E is a Borel set itself.
For, in that case we could take B = A since in this happy case

E = {A

E A

B()}. Indeed, A E iff A = EB for some Borel set B. But EB is itself


a Borel set. Meaning that A E iff A E and A is a Borel set.
Replacing i with some other function we arrive at the following result:

Proposition 1.1. Suppose that the measurable space (E,E) has the
following property:
There exists a mapping i : E such that

(1.4)

E = i 1(B()) and

i(E) B()
Let X : E be measurable and

F be a sub--algebra of K. Then X has a regular


conditioned distribution with respect to F. Namely, if Q is a regular conditioned
distribution of the real random variable Y := ioX with respect to F, then
(1.5) Q*(,A) := Q(, i(A))
is a regular conditioned distribution of X with respect to the same -algebra.
Proof. First we should check that (1.5) makes sense. Meaning, firstly,
that A

E i(A) B(). But A E B B() such that A = i 1(B) . So


i(A) = i(i 1(B)) = Bi(E) B().
Next we should check that A Q*(,A) is a probability.
Let (An)n be a sequence of disjoint sets from

E. We claim that the sets (i(An))n

are disjoint, too. Indeed, An are of the form i 1(Bn) with Bn Borel sets. Replacing,
if need may be, Bn with the new Borel sets Bni(E) we may assume that (Bn)n are
disjoint as well. Then i(An) = i(i 1(Bn))
) = Bni(E) = are disjoint. It follows that

Q*(, U An ) = Q(,i( U An )) = Q(, U i ( An ) ) =


n =1

n =1

n =1

n =1

Q (,i(An)) =

Q *(,A ).
n

The

n =1

measurability of a Q*(,A) is no problem , so the only remained thing to check


is that Q*(,A) = P(X AF). But recall that A = i 1(B) for some Borel set B
hence Q*(,A) = Q(,i(A)) = P(i(X) i(A)F)() = P(i(X) i(i 1(B))F)() =

P(i(X) Bi(E))F)() = P(X i 1(Bi(E))F)() = P(X i 1(B)F)() = P(X


AF) .
A situation when Proposition 1.1 holds is if E is standard Borel.

Definition. A measurable space (E,E) is called Standard Borel if there


exists an isomorphism between (E,E) and (B,B(B)) where B is a Borel set of . An
isomorphism is a mapping i : E B which is one to one, onto, measurable and A

A.BENHARI

-76-

E i(A)

B(B). In other words both i and i 1 are measurable.


Corollary 1.2. If (E,E) is standard Borel, then any random variable X :
E has a regular conditioned distribution with respect to any sub=-algebra F
of K .
Proof. Let i be an isomorphism between (E,E) and (B,B(B)) .The only not
1
that obvious thing is that E = i (B()) . But A E i(A) B(B) B())
1
1
1
A i (B(B)) i (B()) E i (B()). The other inclusion means
simply that i is measurable.
Example 1. Any Borel subset E of is standard Borel, but that is not
big deal.
Example 2. E = (0,1)2 is standard Borel.
This may be a bit surprising! Let p 2 be an counting basis (for

d n ( x)
instance p = 10 or p = 2). Then any x (0,1) can be written as x =
n
n =1 p

where the digits dn(x) are integersa from 0 to p-1. Imposing the condition that
any x of the form x = kp-n be written with a finite set of digits (that is denying
the possibility of expansions of the form x = 0.c1cnaaaa. where a = p-1) this
expansion is unique . Now consider the mapping i : (0,1)2 (0,1) defined by
d1 ( x ) d1 ( y ) d 2 ( x ) d 2 ( y ) d 3 ( x ) d 3 ( y )
(1.6)
+
+
+
+
+
+ ...
i(x,y) =
p
p2
p3
p4
p5
p6
(on the odd positions the digits of x and on the even ones the digits of y) this
function is one to one and measurable (since all the functions dn are measurable)
. It is true that i is not onto, because in Range(i) there are no numbers z of the
form z = 0.ac2ac4ac6. with a = p-1 since we denied that possibility. However, the
function j : (0,1) (0,1]2 defined by
d ( z) d ( z) d ( z)
d 2 ( z ) d 4 ( x) d 6 ( y )
(1.7)
+
+
+ ... )
j(z) =( 1 + 3 2 + 5 3 + ... ,
p
p
p
p
p2
p4
has the obvious property that j(i(x,y)) = (x,y) x,y (0,1) and it is
measurable. This fact ensures the measurability of i 1: B := Range(i) (0,1)2
because of the following equality
(1.8)
(i 1)-1(C) = i(C) = j1(C)Range(i)
Indeed, z i(C) z = i (u), u C j(z) =j(i(u)) = u C z j
1
(C)Range(i). Conversely, z j1(C)Range(i) j(z) C, z = i(u) for some u
2
(0,1) j(i(u)) C u C, z = i(u) z i(C). So the only problem is to
check that Range(i) is a Borel set. But that is easy: its complement is the set of
all the numbers x with the property that, starting from some n on all the odd
(even) positions there is the digit a = p-1 . Meaning that (0,1) \ Range(i) =

UO

En

n =1

where On = {x(0,1)dj(x) = p-1 j n, j odd} and En ={x (0,1)dj(x) = p-1 j


n, j even }. And all these sets are Borel sets. For instance En =

I{ x(0,1)d
j> n

A.BENHARI

-77-

(x)=a n i j, i even} is the intersection of a countable family of sets , all


of them being finite union of intervals.
This phenomena is more general. Namely

Proposition 1.3. If (Ej,Ej) are Standard Borel spaces then (E1E2, E1


E2) is Standard Borel, too.

Proof. Let Bj , j=1,2 be Borel sets on the line isomorphic with Bj . Let
fj : Ej Bj the isomorphisms. Then f = (f1,f2) : E1E2 B1B2 is an isomorphism,
2
2
2
too. Let then i be the cannonical embedding of B1B2 into , h : (0,1) be an
-x
-x
isomorphism (for instance h(x,y) = (h(x),h(y)) with h(x) = e /(1+e ), the logistic
2
usual function) and : (0,1) Range() be the isomorphism from Example 2. The
composition := ohoiof is then an isomorphism from E1E2 to Range().

2.

The disintegration of a probability on a product of


two spaces
Let (Ej,Ej) be measurable spaces. Let X = (X1,X2) : E1E2 be measurable.

Proposition 2.1. Suppose that the second space (E2,E2) is Standard


-1
Borel. Let = PoX1 and let Q be a transition probability from E1 to E2 such that
P(X2 B2 X1)() = Q(X1(),B2) (a.s.) for all B2 E2. Then PoX 1 = Q . Or, to
serve as a thumb rule
(2.1)
Po(X1,X2)-1 = PoX 11(PoX2-1X1) (the regular version)
Proof. Recall from the lesson Conditioning that Q is the probability
measure on the product space with the property that

f dQ

(2.2)

f (x,y)Q(x,dy)d(x)

Recall also that P(X2 B2 X1) means actually P(X2 B2 F) where

F = (X1) := X11
(E1). Then X2 has a regular conditioned distribution of the form P(X2 B2 F) =
Q*(,B2) where Q* is a transition probability from (,(X1)) to (E2,E2) because
of Corollary 1.2. The fact that Q* is of the form Q*(,B) = Q(X1(),B) for some
other transition probability Q comes from the universality property studied at the
lesson Conditioning.
Now all we have to do is to check that the equality
Ef ( X ) =

(2.3)

f dQ

holds for every measurable bounded f.


Step 1. Let f be of the special form f(x,y) = f1(X)f2(y). Then Ef(X) =
E(f1(X1)f2(X2)) = E(E(f1(X1)f2(X2)X1)) (by Property 3 from Conditioning) =

E(f1(X1)E(f2(X2)X1)) (by Property 9) = E(f1(X1) f 2 (y)Q(X1,dy)) (this is the

transport formula, Proposition 3.2 from conditioning) =


-1

A.BENHARI

( f (x) f (y)Q(x,dy)))dPoX () (now this is the usual transport


( f (x) f (y)Q(x,dy)))d(x) = f (x)f (y)Q(x,dy)d(x) = f

(y)Q(X1,dy)))dP =
formula) =

( f (X ) f

-78-

(x,y)Q(x,dy)d(x) =

f dQ

(by (2.2)!)

So our claim holds in this case.

Step 2. Let f = 1C , C E1E2. We want to check (2.3) in this case.

C = {C E1E2(2.3) holds for f = 1C}. According to the first step, C


contains all the rectangles C = B1B2 , Bj Ej. On the other hand, C is a system (You check that, it is easy!) hence C contains the -system generated by
the rectangles. Well, this is exactly E1E2, because the intersection of two

Let

rectangles is a rectangle itself.


Step 3. f =
ci 1Ci , I finite (that is, f is simple). Ef(X) =

iI

c 1
i

iI

Ci

dQ =

c E(1
i

iI

Ci

(X)) =

f dQ.

Step 4. f 0. Apply Beppo-Levi.


Step 5. f = f+ - fCorollary 2.2. The disintegration theorem. Let (Ej,Ej) be measurable
spaces. Let P be a probability on the product space (E1E2,

E1 E2). Suppose that

the second space (E2,E2) is Standard Borel. Then P disintegrates as P = Q where


is a probability on E1 and Q is a transition probability from E1 to E2.
Proof. Consider the probability space (E1E2, E1 E2, P) and the random
variables X1 = pr1 (the projection on E1) , X2 = pr2 (the projection on E2). Then P =
PoX 1 . Apply Proposition 2.1.
Corollary 2.3. Special cases. Let (Ej,Ej) be Standard Borel spaces. Let P
be a probability on the product space (E1E2,

E1 E2). Then P disintegrates as P =

Q where is a probability on E1 and Q is a transition probability from E1 to E2.


As a consequence any probability in plane disintegrates .

3.

The disintegration of a probability on a product of


n spaces
Let now n standard Borel spaces (Ej,Ej)1j n and

let X = (Xj)1 j n be a

random vector X : E , where E is the product space E = E1E2En endowed


with the product -algebra

E=E1E2En. . Then E is standard Borel itself,

according to Proposition 1.3 (induction!) . It we think at E as being the product


of the two spaces E1E2En-1 and En and apply Proposition 2.1, we may write
(3.1)
PoX 1 = Po(X1,,Xn-1)-1Qn-1
where Qn-1 is a transition probability from E1E2En-1 to En which characterizes
the conditioned distribution of Xn given (X1,,Xn-1). Precisely
(3.2)

P(Xn Bn X1,X2,,Xn-1) = Qn-1(X1,,Xn-1;Bn) (a.s.) Bn En

So we have, applying (2.1) the equality


(3.3)
PoX 1 = Po(X1,,Xn-1)-1PoXn-1(X1,,Xn-1)
Repeating this thing we get the thumb rule
PoX 1 = PoX1-1PoX2-1(X1)PoXn-1(X1,,Xn-1)
(3.4)

A.BENHARI

-79-

where one takes the regular versions for the conditioned distributions.
If we denote by Qi these conditioned distributions (the precise meaning is:
Qi(X1,,Xi;Bi+1) = P(Xi+1 Bi+1 X1,X2,,Xi) (a.s) , i = 1,2,,n-1 ) and we denote be
the distribution of X1, then one can write the not very precise relation (3.4)
as
PoX-1 = Q1Qn-1
(3.5)
This product is to be understood as being computed in the prescribed order. We
have no associativity rule yet.
If all the spaces are discrete (meaning that Ej are at most countable and

Ej= P(Ej)

an obvious standard Borel space) then (3.4) says nothing more that the well
known multiplication rule
(3.6)
P(X1=x1,, Xn=xn) = P(X1=x1)P(X2=x2X1=x1)P(Xn=xnX1=x1,,Xn-1=xn-1)
(of course, if the right member has sense) and (3.5) says the same thing using
transition probabilities
(3.7)
P(X1=x1,, Xn=xn) = p(x1)q1(x1;x2)q2(x1,x2;x3)qn-1(x1,x2,,xn-1;xn)
where p(x1) = ({x1}) and qi(x1,x2,,xi;xi+1) = Qi(x1,x2,,xi;{xi+1}) =
P(Xi+1=xi+1X1=x1,,Xi=xi).
We want to define the associativity of the product (3.5). To do that,
the first step is to define the precise meaning of Q1Q2.
So, now n = 3. We can look at the product E1E2E3 as being in fact
E1(E2E3).
If we apply Proposition 2.1 for the standard Borel space E2E3 and
Proposition 2.1 from the lesson Transition probabilities we obtain
(3.8)

f ( x,y,z)Q(x,d(y,z))d(x)

PoX 1 = Q Ef(X) =

if f is

measurable, bounded
where Q is a transition probability from E1 to E2E3 with the property that

P((X2,X3)CX1) = Q(X1,C) (a.s.) C E2E3

(3.9)

Comparing (3.8) to (3.5) written as


(3.10)

PoX 1 = (Q1)Q2

Ef(X) =

(x,y,z)Q2(x,y;dz)Q1(x;dy)d(x) (same f)
which should hold for any (x included)
product of Q1 with Q2 by the relation
(3.11)

Q1Q2(x,C) =

we see that we may define Q, the

(y,z)Q2(x,y;dz)Q1(x,dy)

This product makes sense for any transition probabilities Q1 from E1 to E2


and Q2 from E1E2 to E3. The result is a transition probability from E1 to E2E3.
An elementary calculus points out that Q1Q2 is indeed a probability on E2E3
since Q1Q2(x;E2E3) =

E2 E3

(y,z)Q2(x,y;dz)Q1(x,dy) =

Example. In the discrete case (3.11) becomes


(3.12)
Q1Q2(x;y,z) = q1(x;y)q2(x,y;z)
We arrived at the following result:

A.BENHARI

-80-

1 Q (x,y;dz)Q (x,dy) = 1
2

Proposition 3.1. The associativity. If is a probability on E1, Q1 is a


transition probability from E1 to E2 and Q2 is a transition probability from E1E2
to E3 then
(3.13)
(Q1)Q2 = (Q1Q2)
where the product Q1Q2 is defined by (3.11).
Moreover, if Q3 is another transition probability from E1E2E3 to E4 then
(3.14)
(Q1Q2)Q3 = Q1(Q2Q3)
Proof. As (3.13) was already proven ( the very definition of the product ensures
the first associativity) we shall prove (3.15). This should be a transition
probability from E1 to E2E3E4. Let f: E2E3E4 be measurable and bounded and

let Q = Q1Q2. This is a transition probability from E1 to E2E3 . So


d[(Q1Q2)Q3](x)=

f d[QQ ](x)
3

f (y,z)Q (x,y;dz)Q(x,dy) (according


3

very definition ! . Notice that here x E1, y E2E3 and z E4) =


(y1,y2,z)Q3(x,y1,y2;dz)[Q1Q2](x,dy) =

to the

f (y ,y ,z)Q (x,y ,y ;dz)Q (x,y ,dy )Q (x,dy ).


1

On the other hand, let Q* = Q2Q3 . This is a transition probability from E1E2 to

E3E4. Therefore

f d[Q (Q Q )](x)= f d[Q Q*](x)


1

f (y,z)Q*(x,y;dz)Q (x,dy)
1

(here x E1, y E2, z E3E4)


=

f (y,z ,z )Q (x,y,z ;dz )Q (x,y,dz )Q (x,dy).


1

It is the same integral. With more natural notations both of them can be written
as
(3.15)

f d[Q Q Q ](x ) = f (x ,x ,x )Q (x ,x ,x ;dx )Q (x ,x ,dx )Q (x ,dx ).


1

As in the lesson about transition probabilities, we can define the


usual product between Q1 and Q2 by
(3.16)

Q1Q2(x,B3) := Q1Q2(x,E2B3) =

B3

(z)Q2(x,y;dz)Q1(x,dy) = Q

(x,y;B3)Q1(x,dy)
This is transition probability from E1 to E3.
Proposition 3.2. The usual product is associative, too.
Namely the following equalities hold:
(3.17)
(Q1)Q2 = (Q1Q2)
(3.18)
( Q 1Q 2) Q 3 = Q 1( Q 2Q 3

Proof. [(Q1)Q2](B3) = [(Q1)Q2](E2B3) =

(x2,B3)dQ1(x2) =

(x2,B3)Q1(x1,dx2)d(x1) and [(Q1Q2](B3) = [(Q1Q2)](E1B3) =

Q Q
1

(x1,B3)d(x1) and,

applying (3.16) ane sees that the result is the same.


As about (3.18), the proof is the same: [(Q1Q2)Q3](x,B4) = [(Q1Q2)Q3](x,E3B4) =
Q1Q2Q3 (x,E2E3B4) and [Q1(Q2Q3)](x,B4) = [Q1(Q2Q3)](x,E2B4) = Q1Q2Q3
(x,E2E3B4).
Here is the meaning of the usual product :

A.BENHARI

-81-

Proposition 3.3. Using the above notations


-1
(3.19)
P(X3 B3X1) = Q1Q2(X1,B3)
and PoX3 = Q1Q2
Proof. Using (3.9) one gets P(X3 B3X1) = P((X2,X3)E2B3X1) = Q(X1,E2B3) =
Q1Q2(X1,E2B3) = Q1Q2(X1,B3). Using the transport formula we see that the equality
(3.20)
E(f(X3)X1) = f dQ1Q2(X1) := f (z) Q2(X1,y;dz)Q1(X1,dy)
should hold for any bounded measurable f : E3 . Then E(f(X3)) = E(E(f(X3)X1))

= E( f dQ1Q2(X1)) =

f (z)

Q2(X1,y;dz)Q1(X1,dy)dP =

f (z)

Q2(x1,y;dz)Q1(x1,dy)d(x1). As this equality holds for indicator functions one gets


P(X3 B3) = E 1B3 (X3) = Q 2(x1,y;B3)Q1(x1,dy)d(x1) = (Q1Q2)(B3) = (Q1Q2)(B3) by
associativity.

Example. In the discrete case one gets Q1Q2(x,{z}) =

q (x,y)q (x,y;z)
1

yE2

Here are two generalizations of the above discussions:

Proposition 3.4. Let f : E1EnEn+1 be bounded and measurable. Then


(3.21)
E(f(X1,.,Xn+1)X1,,Xn) = f ( X 1 ,..., X n , xn+1 ) Qn(X1,,Xn; dxn+1) =
(Qnf)(X1,..,Xn)

Proof. Step 1. f(x1,x2,xn+1) = f1(x1)fn(xn)fn+1(xn+1). Then


E(f(X1,.,Xn+1)X1,,Xn) = f1(X1)fn(Xn)E(fn+1(Xn+1) X1,,Xn) = f1(X1)fn(Xn) f n+1 ( x n+1 )
Qn(X1,,Xn; dxn+1) =
1C , C

f ( X ,..., X
1

n , x n +1 )

Qn(X1,,Xn; dxn+1); so (3.21) holds. Step 2. f =

E1En . The set of those C for which (3.21) holds is a -system which

contains the rectangles B1Bn ; Step 3. f is simple. Etc.

Proposition 3.5. Let (En,En)n 1 be a sequence of Standard Borel Spaces and


let X = (Xn)n 1 be a sequence of random variables Xn : En . Let = PoX1-1 .
Then there exist a sequence of transition probabilities from E1E2En to En+1 ,
denoted with Qn such that
(3.22)
Po(X1,X2,,Xn)-1 = Q1Q2Qn-1
According to Proposition 3.1 (the associativity) the right hand term
from (3.20) is well-defined. Moreover,
(3.23)
PoXn-1 = Q1Q2Qn-1
and
(3.24)
P(Xn+k Bn+kX1,X2,.. Xn) = (QnQn+1Qn+k-1)(X1,,Xn;Bn+k)
Proof. Induction. The only subtlety is in (3.24). For k = 1 P(Xn+1
Bn+1X1,X2,.. Xn) = Qn(X1,X2,,Xn; Bn+1) by the very definition of Qn . For k = 2 P(Xn+2
Bn+2X1,X2,.. Xn) = E( 1 X n + 2 (Bn+2) X1,X2,.. Xn) = E(E( 1 X n + 2 (Bn+2) X1,X2,..

Xn,Xn+1)X1,X2,.. Xn) = E(Qn+1(X1,,Xn+1;Bn+2) X1,X2,.. Xn) =

n +1

(X1,,Xn,xn+1)Qn(X1,,Xn;dxn+1) = (QnQn+1)(X1,,Xn;Bn+k) hence (3.24) holds in this case,


too. Apply Proposition 3.4 many times.

A.BENHARI

-82-

The Normal Distribution


1.

OneOne-dimensional normal distribution

Let us recall some elementary facts.


Definition. Let X be a real random variable. We say that X is normally standard
distributed if PoX-1 = 0,1 where is the Lebesgue measure on the real line and
x2

0,1(x) =

1 2
e . We denote that by X N(0,1). The distribution function of
2

N(0,1) is denoted by . Thus


(1.1) (x) =P(X x)

1
2

= N(0,1)((-,x]) =

u2
2

du

There exists no explicit formula for , but it can be computed numerically. Due
to the symmetry of the density 0,1, it is easy to see that (-x) = 1 - (x)
x

1
2

(0)=0.5 , therefore for any x > 0 we get (x) = 0.5 +

u2
2

du and the last

integral can be easily approximated by Simpsons formula, for instance.


The characteristic function of a standard normal r.v. X is X(t) = EeitX :=

t2
2

N(0,1)(t)= e , its expectation is EX = -iX(0)= 0 , its second order moment is EX2


= - X(0) = 1, hence the variance V(X) = EX2 (EX)2 = 1. Thats why one also
reads N(0,1) as the normal distribution with expectation 0 and variance 1
Let now Y N(0,1) , >0 and . Let X = Y + . Then the distribution

function of X is FX(x)= P (X x) = P(Y

) = (

) . Thus the density

of X is
X(x) = FX(x) =

1
) =

0,1(

1
) =
e
2

( x )2
2 2

. We denote this

density with , 2 and the distribution of X with N(,2). Due to obvious reasons we
read this distribution as the normal with expectation and dispersion . Its
characteristic function is
(1.2)

2.

X(t) = Ee

it(+Y)

itX

= Ee

it

= e Ee

itY

= e

it

t 2 2
2

Multidimensional normal distribution

Let X : n be a random vector. The components of X will be denoted by Xj, 1


j n. The vector will be considered a column one. Its transposed will be denoted
by X. So, if t n is a column vector, t will be a row one with the same
components. With these notations the scalar product <s,t> becomes st. The
euclidian norm of t will be denoted by t. Thus t =

t
j =1

A.BENHARI

-83-

2
j

We say that X L if all the components Xj L , 1 p .


The expectation EX is the vector (EXj)1jn. This vector has the following
optimality property
Proposition 2.1. Let us consider the function f:n given by
p

f(t) = X t

(2.1)

2
2

: =

Xj tj

j =1

E(Xj tj )2

j =1

Then f(t) f(EX). In other words EX is the best constant which approximates X if
the optimum criterion is L2.
n

Proof. We see that f(t) =

j =1

t j2 - 2 t jE X j +
j =1

E(Xj ) =

j =1

(tj-EXj) +

j =1

j =1

(Xj).
2

The analog of the variance is the matrix C = Cov(X) with the entries ci,j =
Cov(Xi,Xj) where
(2.2) Cov(Xi,Xj) = EXiXj - EXiEXj
The reason is
Proposition 2.2. Let X be a random vector from L2, C be its covariation
n
matrix and t . Then
(2.3)
Var(tX) = tCt
Proof. Var(tX) =E(tX)2 (E(tX))2 =
titjE(XiXj) titjE(Xi)E( Xj) =

1i, j n

1i, j n

1i, j n

ci,jtitj = tCt.
Remark. 2.1. Any covariance matrix C is symmetrical and non-negatively
defined , since according to (2.3) , tCt 0 t n. We shall see that for any
non-negatively defined matrix C there exists a random vector X having C as
covariance matrix.
Remark. 2.2. We know that, if X is a random variable, then Var( + X) =
2
Var(X). The n dimensional analog is
(2.3) Cov(+AX) = ACov(X)A
Indeed, Cov(+AX) = Cov(AX) (the constants dont matter) and (Cov(AX))i,j
=E((AX)i(AX)j) - E((AX)i)E((AX)j) = E((
ai ,r X r )(
a j ,s X s ) (
ai ,r EXr)(
a j ,s

E X s) =

1 r , s n

i ,r

1 r n

a j ,s (E(XrXs) - E(Xr)E( Xs)) =

1 r , s n

1 s n

i ,r

1 r n

1 s n

a j ,s (Cov(X))r,s = ACov(X)A.

Now we are in position to define the normal distributed vectors.


Definition. Let X1,,Xn be i.i.d. and standard normal. Then we say that X
N(0,In). Here 0 is the vector 0 n.
Remark that X N(0,In) PoX-1 = N(0,1) = (0,1) = (0,10,10,1)
1 j n

1 j n

hence the density X is


n

1
0, I n ( x ) =
e
2

(2.4)

x12 + x22 +...+ xn2


2

n
2

= (2 ) e

The characteristic function of N(0,In) is


(2.5)

N ( 0, I ) (t ) =Ee
n

A.BENHARI

itX

Ee

it j X j

(due to the independence) = e

j =1

-84-

Remark.2.3 Due to the unicity theorem for the characteristic functions,


(2.5) may be considered an alternative definition of N(0,In) : X N(0,In)

X(t) = e
Let
(2.6) X =
Its
ACov(Y)A
Its

t .
now Y N(0,Ik) and A be a nk matrix . Let n. Consider the vector
+ AY
expectation is and, applying (2.3) its covariance C= C(X) =
= AA (since clearly Cov(Y) = In ).
it X
it ( +AY)
it
-it AY
it
-i(A t) Y
characteristic function is X(t) = Ee = Ee = e Ee = e Ee

A 't

( A't )' A 't


2

t ' AA't
2

it

t 'C t
2

Y(At) = e e
= e e
= e e
= e
.
The first interesting fact is that X depends on C, rather than on A. The
second one is that C can be any non-negative nn defined matrix. Indeed, as one
knows from the linear algebra, any nod-negative defined matrix C can be written as
C = ODO where O is an orthogonal matrix and D a diagonal one , with all the
elements dj,j non-negative. Let A = OO with the diagonal matrix with j,j =
it

i t

= e

i t

i t

d j , j . Then 2 = D hence AA = OO (OO) = O(OO)O = OO = ODO = C . That


is why the following definition makes sense:
Definition. Let X be an n-dimensional random vector. We say that X is
normally distributed with expectation and covariance C (and denote that by X
N(,C) !) if its characteristic function is
(2.7)

X(t) = N(,C)(t) = e

it

t 'C t
2

Remark 2.4. Due to the above considerations, an equivalent definition


would be : X N(,C) iff X can be written as X= + AY for some nk matrix A
such that C = AA and with Y N(0,Ik).
Not always a normal vector is absolutely continuous. But if det( C ) >
0, this indeed the case: it has a density.
Proposition 2.3. Suppose that the covariance C = Cov(X) is invertible and X
N(,C). Then X has the density

n
2

( x )'C 1 ( x )
2

(2) e
Proof. Let A be such that X = + AY , C = AA. We choose A to be square and
invertible. Then det( C ) =det(AA) = det(A)det(A) = det2(A). Let f : n be
measurable and bounded. Then Ef(X) = Ef(+AY) = f ( + AY ) dP = f ( + Ay ) dPoY-1(y)
(2.8) ,C(x) =det(C)

-1/2

f ( + Ay ) (2)

n
2

dn(y) Let us make the bijective change of variable x =

+Ay y = A-1(x-). Then , computing the Jacobian


det(A) dn(y). It means that
Ef(X)

A.BENHARI

f (x ) (2)

n
2

A1 ( x )
2

det(A)-1 dn(x)

-85-

D( x)
one sees that dn(x) =
D( y )

= det(A) (2)
-1

-1/2

=det(C)

=det(C)

f (x )

n
2

(2)

-1/2

,C(

f (x ) e
n
2

( A1 ( x ))' A1 ( x )
2

f (x ) e

n
2

d (x)

( x )' A '1 A1 ( x )
2

dn(x) (as det C = det(A)2 )

( x )'C 1 ( x )
2

f (x ) e
d (x) = f d(

(2)
x)

,C

d (x) (as A A = (AA) )


n

-1

-1

-1

) . On the other hand, by the


n

transport formulla,
Ef(X)

3.

= f dPo X

-1

. It means that PoX = ,C .


-1

Properties of the normal distribution

Property 3.1. Invariance with respect to affine transformations. If X is normally


distributed then a + AX is normally distributed, too. Precisely, if X is ndimensional, A is a m n matrix and a m, then
(3.1) X N(,C) a + AX N(a+A, ACA)
Proof. Let Y N(0,Ik) and B be a n k matrix such that BB = C and X = + BY. It
means that Z = a + AX = a + A(+BY) = a + A + ABY . By Remark 2.4, Z N(a+A,
AB(AB)) and AB(AB) = ABBA = ACA.
Corollary 3.2.
(i).
X N(,C), t n
tX N(t,tC t). Any linear combination of the
components of a normal random vector is also normal.
(ii).
X N(,C), 1 j n Xj N(j,cj,j). The components of a normal random
vector are also normal .
(iii). Let X N(,C) and Sn be a permutation. Let X() be defined as (X())j =
X(j). Then X() is also normally distributed. By permutting the components of a
random vector we get another random vector.
(iv).
Let X N(,C) and J {1,2,,n} . Let XJ the vector with J components
obtained from X by deleting the components j J. The XJ N(J,CJ) where J is the
vector obtained from by deleting the components j J and CJ is the matrix
obtained from C by deleting the entries ci,j with (i,j) JJ. Deleting components
of a random normal vector preserves the normality.
Proof. All these facts are simple consequences of (3.1): (i) is the case m =1,
a=0; (ii) is a particular case of (i) for t = ej = (0,0,1,0,..,0) (here 1 is on
the jth position); (iii) is the particular case when A=A is a permutation
matrix, namely ai,j = 0 iff i (j) and ai,j = 1 iff i = (j). Finally, (iv) is
the particular case when A is a deleting matrix, namely a J n matrix defined
as follows: suppose that J=k and that J ={j(1)<j(2)<<j(k)}. Then a1,j(1) = a1,j(1) =
= ak,j(k) = 1 and ar,s = 0 elsewhere. The reader is invited to check the details.
It is interesting that (i) has a converse.
Property 3. 2. Let X be a n-dimensional random vector. Suppose that tX is normal

A.BENHARI

-86-

for any t . Then X is normal itself. If any linear combination of the


n

components of a normal vector is normal, then the vector is normal itself.


Proof. If t = ej then tX =Xj. According to our assumptions, Xj is normal 1jn.
2
2
1
It follows that Xj L j X L XiXj L i,j . Let = EX and C =
Cov(X). Then EtX = tEX = t and Var(tX) = tCt (by 2.3). It follows that tX
N(t, tCt). By (1.2) its characteristic function is tX(s) =Ee

is(tX)

= e

is ( t ' )

s 2t 'Ct
2

t 'Ct
it '
2

Replacing s with 1 we get tX(1) =Eei(tX) =X(t)= e


. But acoording to (2.7),
this is the characteristic function of a normal distribution.
Maybe the most important property is
Property 3.3. In a normal random vector non
non--correlation implies independence. The
precise setting is the following: let X be a n-dimensional random vector. Let J
{1,2,,n}. Suppose that iJ, j Jc Xi and Xj are not correlated, i.e. r(Xi,Xj)
= 0. Then XJ is independent of X J c .

Proof. Due to (iii). from Corollary 3.2 we may assume that J = {1,2,,k} hence Jc
= {k+1,,n}. If i J , j J then Cov(Xi,Xj) = r(Xi,Xj)(Xi)(Xj) = 0 . Let Y = XJ
and Z = X J c . We can write then X =(Y,Z) . From (iv)., Corollary 3.2, we know
that Y and Z are normally distributed: the first one is Y N(J, CJ) and Z N(K,
CK) with K = Jc. Moreover, as i J , j K Cov(Xi,Xj) = 0 it follows that C =

CJ

0
. Let t n. Write t = (tJ,tK). It is easy to see that tCt = tJCJ tJ +
C K
it

t 'C t
2

it J J

t J 'C J t J
2

it K K

t K 'CK tK
2

tK CK tK . From (2.7) it follows that X(t) = e


= e
e
. Thus
(Y,Z)(tJ,tK) = Y(tJ)Z(tK) or , otherwise written (Y,Z) = YZ. The unicity theorem

says that if two distributions have the same characteristic function, they must
coincide. It means that Po(Y,Z)-1 = (PoY-1) (PoZ-1) Y and Z are independent.
Property 3.4. Convolution of normal distributions is normal. Precisely
X1 N(1,C1), X2 N(2,C2), X1 independent of X2 X1 + X2 N(1+2,C1+C2)
(Here it is understood that X and Y have the same dimension!)

Proof. It is easy. According to (2.7) X1 (t) = e

it1

t 'C1 t
2

, X 2 (t) = e

it2

t 'C2 t
2

. It

xn

follows

X 1 + X 2 (t) = X 1 (t) X 2 (t) = e

it1

t 'C1 t
2

it1

t 'C1 t
2

= e

it ( 1 + 2 )

t '( C1 +C2 ) t
2

Corollary 3.5. x is independent on s. Let (Xj)1jn be i.i.d. ,Xj N(,2). Let

X 1 + X 2 + ... + X n
(from the law of large numbers we know
n
that xn ; in statistics one calls xn an estimator of ) and let s :=sn(X) =

x = xn be their average

(X

) (
2

x + X 2 x + ... + X n x
n 1

A.BENHARI

2
j

nx

j =1

n 1

-87-

(by the same law of large numbers sn

) . Then xn is independent on sn.


2

Proof. Let us firstly suppose that Xj N(0,1). Let us consider the matrix A=
1
1
1
1
1

...

n
n
n
n
n

1
1 2

0
0
...
0

1 2
1 2

1
1
1 3

0
...
0

. The reader is
23
23
23

1
1
1
1 4
...
0

3 4
3 4
3 4
3 4

...
...
...
...
...
...

1
1
1
1
1
1 n

n(n 1)
n(n 1)
n( n 1)
n(n 1)
n(n 1)
n(n 1)

invited to check that A is orthogonal, that is, that AA = In . Let X = (Xj)1jn and
Y = AX. By (3.1), Y N(0,AInA) = N(0,In). Thus Yj are all independent , according
2
2
2
to property 3.3. So, Y1, Y2 , Y3 , ..,Yn are independent, too. But Y1 = x n . On the
n

other hand Y22+ Y32+ ...+Yn2 =

Y j2 - Y12 =

j =1

( AX ) 2j - ( x n )2 =

j =1

a
(

j ,k

X k )2 - n x =

j =1 k =1

a j ,k a j ,r X k X r ) - n x =XAAX - n x = XX - n x (since AA = In !) =

j =1 1 k ,r n

2
j

- nx =

j =1

(n-1)s . It follows that x n is independent on (n-1)s hence the assertion of the


corollary is proven in this case.
In the general case Xj = + Yj with Yj independent and standard normal. Then
xn =+ y n and sn(X) = 2sn(Y). We know that y n is independent on sn(Y) , therefore

f( y n ) is independent on g(sn(Y)) for any functions f and g. As a consequence xn is


independent on sn(X) .

4.

Conditioning inside normal distribution

Let X = (Y,Z) be a m + n dimensional normal distributed vector. Thus Y


=(Yj)1jm and Z = (Zj)1jn . We intend to prove that the regular conditioned
distribution (see lesson Conditioning,
Conditioning 3) PoY-1(Z) is also normal.
First suppose that EX = 0. Let H be the Hilbert space spanned in L2 by
(Zj)1jn. Recall that the scalar product is defined by <U,V> = EUV. Thus
n

(4.1) H = {

Z
j

j , 1jn}

j =1

Let U L2. We shall denote the orthogonal projection of U onto H by U*. Hence
n

U* =

(i).

Z
j

for some = (j)1jn n

j =1

U U* Zj 1 j n

(ii).

We shall suppose that all the variables Zj are linear independent (viewed as
n

vectors in the Hilbert space L2), i.e. the equality

Z
j

= 0

holds iff =0. In

j =1

that case U* can be computed as follows: write (ii). as <U-U*,Zj > = 0 1 j n


A.BENHARI

-88-

. Replacing U* from (i). , we get the following system of n equations with n


unknowns 1,,n (the so called normal equations)
n

(4.2)

< Z j , Z k > = <U,Zk> 1 k n

j =1

The matrix G =(<Zj,Zk>)1j,kn is called the Gramm matrix. Remark that this matrix is
invertible since if t n then tGt =

< Z j , Z k > t j tk =

1 j ,k n

t Z
j

22 0 and, as Zj

j =1

were supposed to be independent, the equality is possible iff t = 0. Thus the


matrix G is positively defined hence invertible; therefore (4.2) has the unique
-1
solution =G b(U) with b(U) = (<U,Zk>)1kn . Therefore the projection U* is U* = Z
= (G-1b(U))Z = b(U)G -1Z (G = G!).
Proposition 4.1. Suppose that all the variables Zj are linear independent.
Then the conditioned distribution PoY-1(Z) is also normal. Precisely
(4.3) PoY-1(Z) = N(Y*,C)
-1
where Y* is the vector (Y*j)1jm = (b(Yj)G Z)1jm and ci,j = Cov(Yi-Y*i,Yj-Y*j) = <YiY*i,Yj-Y*j>.
Proof. We shall compute the conditioned characteristic function YZ(s) =
isY
E(e Z). Let us consider the vector (Y-Y*,Z). It is normally distributed, too,
because it is of the form AX for some matrix A. As Cov(Yj - Y*j, Zk) = E(Zk(Yj-Y*j))
= < Zk ,Yj-Y*j > = 0 1jm, 1kn, Property 3.3 says that Y Y* is independent
on Z. Therefore
E(eisYZ) = E(eis(Y-Y*)+isY*Z) = E(eis(Y-Y*)e isY*Z) = eisY* E(eis(Y-Y*)Z) (by Property 11,
lesson Conditioning ) = eisY* E(eis(Y-Y*)) (by Property 9, lesson Conditioning ). Now
Y-Y* is normally distributed by Corollary 3.2(iv) and its expectation is E(Y-

Y*)=0. Then Y-Y*(s)= e

s 'C s
2

where C is the covariance matrix of Y-Y*. We discovered

s 'C s
is 'Y *
2
e

that YZ(s) =
. For every this is the characteristic function of
N(Y*(),C).
Remark. As a consequence, the regression function E(YZ) coincides with Y*.
Indeed, by the transport formula 3.5 , lesson Conditioning, E(YZ) is the integral
with respect to PoY-1(Z), i.e. with respect to N(Y*,C). And that is exactly Y*.
It follows that the regression function is linear in Z. Remark also that the
conditioned covariance matrix C does not depend on Z.
The restriction that all the Zj be linear independent is not serious and
may be removed.
Corollary 4.2. If X =(Y,Z) is normally distributed, then the regular
conditioned distribution PoY-1(Z) is also normal.
Proof. Let k be the dimension of H. Choose k r.v.s among the Zjs which
form a basis in H. Denote them by { Z j1 , Z j2 ,..., Z jk }. Then the other Zj are linear
combinations of these k random variables, thus the -algebra (Z) is generated
Z j1 , Z j2 ,..., Z jk . It follows that PoY-1(Z) = PoYonly by them. Let Z0 be the vector
1

(Z0) and this is normal.


Now we shall remove the assumption that EX = 0.

A.BENHARI

-89-

Corollary 4.3. If X = (Y,Z) N(,C) then PoY-1(Z) is normal, too.


Proof. Let us center the vector X. Namely, let X0 = X - , Y0 = Y Y and Z0 = Z
0
0
Z where Y = EY and Z = EZ . Then Z = Z + Z and Y = Y + Y . From Proposition
0 -1
0
0
0
0
4.1. we already know that Po(Y ) (Z ) = N(Y *,C ) where Y * is the projection of Y
0
0
0 -1
onto H and C is some correlation matrix. But (Z) = (Z ) therefore Po(Y ) (Z) =
0
0
0
-1
0
0
N(Y *,C ). It means that Po(Y + Y ) (Z) = N(Y+ Y *,C ).
Maybe it is illuminating to study the case n=2. Let us first begin with the
c1.2
c
case EX = 0. The covariance matrix is C = 1,1
with ci,j = EXiXj . Then c1,1 =
c
c
2,1 2, 2
EX12 = 12, c1,2 = c2,1 = r12 where r is the correlation coefficient between X1 and X2
EX 1 X 2
(r =
) and c2,2 = EX22 = 22. Remark that Xj N(0,j2) j = 1,2; and, det( C )
1 2
2
22
1
r1 2
r1 2
2
2
2
1

= det 1
=

=
1 2 (1-r ) and the inverse C
2
2 2
2
2
12
1 2 (1 r ) r1 2
r1 2
1
r

s 'C s
2

1 2
1 1
2 =
=
.
Then
the
characteristic
function
is

X( s ) = e
2

r
1
(1 r ) r

1 2
22

12 s12 + 2 r12 +22 s22


2

and from (2.8) the density is

0,C(x) =det(C)

-1/2

(4.4)

(2)

n
2

( x )'C 1 ( x )
2

x12 2 rx1 x2 x22

+
1 2 22

12

2 (1 r 2 )

.
21 2 1 r 2

In this case the projection of X1 onto H is very simple : X1* = aX2 with
r1
a chosen such that <X1-aX2,X2> = 0 r12 = a22 a =
. The covariance matrix
2
from (4.3) becomes a positive number Var(X1 X1*) = E(X1 X1*)2 = 12 2ar12 +
a222 = 12(1-r2) thus
r
(4.5) Po(X1)-1(X2) = N( 1 X2, 12(1-r2))
2
In the same way we see that
r
(4.6) Po(X2)-1(X1) = N( 2 X1, 22(1-r2))
1
If EX = (1,2) then, taking into account that Xj and Xj-j generate the same algebra, the formulae 4.4-4.6 become

(4.7) 0,C(x) =

( x1 1 ) 2 2 r ( x1 1 )( x2 2 ) ( x2 2 )2

+
1 2
12
22
2 (1 r 2 )

21 2 1 r 2
r
(4.8) Po(X1)-1(X2) = N(1+ 1 (X2-2), 12(1-r2))
2

A.BENHARI

-90-

(4.9) Po(X2) (X1) = N(2+


-1

5.

r 2
2
2
(X1-1), 2 (1-r ))
1

The multidimensional central limit theorem

The uni-dimensional central limit theorem states that if (Xn)n is a sequence of


2
i.i.d. random variables from L with EX1 = a and (X1) = , then sn:=
X 1 + X 2 + ... X n na
converges in distribution to N(0,2). The multi-dimensional
n
analog is

Theorem 5.1. Let (Xn)n be a sequence of i.i.d random k-dimensional vectors . Let a
= EX1 and C = Cov(X1). Then
X + X 2 + ... X n na
Distribution

N(0,C)
(5.1) sn := 1
n
Proof. We shall apply the convergence theorem for characteristic functions. Let Yn
= Xn - a, let be the characteristic function of Y1 and n be the characteristic
t
function of sn. Thus (t) = E e it '( X1 a ) and n(t) = E e it 'sn = n ( ) . We shall prove
n
that n(t) N(0,C)(t).
Let Z n = tYn . Then the random variables Zn are i.i.d., from L2 , EZn = tEYn = 0 and
Z + Z 2 + ...Z n
Var(Zn) = tCt. Using the usual CLT , 1
converges in distribution to
n
Z + Z 2 + ...Z n
N(0, tCt). Let n the characteristic function of 1
. It is easy to see
n
that n(1) = n(t). But n (1) N(0,tCt)(1) = e
N(0,C)(t).

12

t 'Ct
2

=e

t 'Ct
2

= N(0,C)(t) hence n(t)

Corollary 5.2. Let X , Y be two i.i.d. random vectors from L2 with the property
1

X +Y
that PoX = Po
. Then PoX-1 = N(0,C) for some covariance matrix C.
2

X +Y
X +Y
X
Proof. If X and
have the same distribution, then EX = E
= 2E
= 2
2
2
2
EX hence EX = 0. Now let Xn a sequence of i.i.d. random vectors having the same
X 1 + X 2 + ... + X 2n
distribution as X. It is easy to prove by induction that
has the
2n
same distribution as X. (Indeed, for n = 1 it is our very assumption. Suppose it
X 1 + X 2 + ... + X 2n
X 2n +1 + X 2n + 2 + ... + X 2n + 2n
holds for n, check it for n+1. So
and
are
2n
2n
X 1 + X 2 + ... + X 2n
1
i.i.d. and both have the distribution of X. Then
(
+
2
2n
X 2n +1 + X 2n + 2 + ... + X 2n + 2n
X 1 + X 2 + ... + X 2n +1
) =
must have the same distribution). But
2 n+1
2n
-1

A.BENHARI

-91-

X 1 + X 2 + ... + X 2n

sn :=

converges in distribution to N(0,C) where C = Cov(X). As


2n
-1
-1
the distribution of sn does not change, being PoX it means that PoX = N(0,C).

Another intrinsic characterization of the normal distribution is the following:

Proposition 5.3. Let X and Y be two i.i.d. random vectors. Suppose that X+Y and XY are again i.i.d. Then X N(0,C) for some covariance matrix C = Cov(X).
Proof. Let k be the dimension of X. Let t k . Then tX and tY are again i.i.d.
As X+Y and X-Y are i.i.d, it follows that tX + tY and tX tY are i.i.d.
Thats why we shall prove first our claim in the unidimensional case. That is, now
k = 1.
Let be the characteristic function of X. As X+Y and X-Y are i.i.d, it follows
is(X+Y)+it(X-Y)
= Eeis(X+Y) Eit(X-Y) EeiX(s+t)+iY(s-t) = EeisX EeisY
that X+Y,X-Y(s,t) = X+Y(s)X-Y(t) Ee
EitXEe itY which is the same with
(5.2) (s+t)(s-t) = 2(s)(t)(-t) s,t.
On the other hand, X+Y and X-Y have the same distribution. It means that they have
the same characteristic function. As X+Y(t) = X(t)Y(t) = 2(t) and X-Y(t) =
X(t)Y(-t) = (t)(-t) we infer that (t) = (-t) = (t ) t . It follows
that (t) t hence (5.2) becomes
(5.3) (s+t)(s-t) = 2(s)2(t) s,t
If s = t (5.3) becomes (2s)(0) = 4(s) s (2s) = 4(s) s (2s) 0
s (s) 0 s . Thus is non-negative and (t) = (-t) t.
Let h = log . Then (5.3) becomes
(5.4) h(s+t) + h(s-t) = 2(h(s)+h(t)) s,t
If in (5.4) we let t = 0, we get 2h(s) = 2(h(s) + h(0)) h(0) = 0.
If in (5.4) we let s = 0, we get h(t) + h(-t) = 2(h(t) + h(0)) = 2h(t) h(t) =
h(-t).
Finally, replacing h with kh , we see that (5.4) remains the same. Thats why we
shall accept that h(1)=1. By induction one checks that h(n) = n2 n positive
integer. Indeed, for n = 0 or n = 1 this is true. Suppose it holds for n, check it
for n+1. Letting in (5.3) s=n,t=1 we get
(5.5) h(n+1) + h(n-1) = 2(h(n)+h(1)) h(n+1) + (n-1)2 = 2n2 + 2 h(n+1) =
(n+1)2
It follows that h(x) = x2 x integer.
Let now set s=t . Then (5.4) becomes h(2t) = 4h(t). If 2t is an integer, we see
that (2t)2 = 4 h(t) h(t) = t2. So the claim holds for halfs of integers.
Repeating the reasoning, the claim h(x)=x2 holds for any number of the form x =
m2-n , m , n integers. But the numbers of this form are dense, so the claim holds
for any x. Remembering the constant k we get
(5.6) h(x) = kx2 x
On the other hand, 1 h 0 k 0 k = -2 for some nonnegative .
The conclusion is that
2 2
(5.7)
(t) = exp(- t ) for some 0.

A.BENHARI

-92-

Otherwise written, PoX = N(0,).


k
The proof for an arbitrary k runs as follows: let t . Then tX and tY are
2
again i.i.d. Moreover, tX + tY and tX tY are i.i.d. so tX N(0, (t)). As
tX is in L2 for any t it follows that X is in L2 itself. As EtX = 0 t n , EX
= 0. Let C be the covariance of X. Then Var(tX) = tCt . But we know that tX is
n
normally distributed, hence tX N(0,tCt) t . From property 3.2 we infer
that X N(0,C) .
-1

A.BENHARI

-93-

A.BENHARI

-94-

II. STATISTCS

A.BENHARI

-95-

Basic Concepts

A.BENHARI

-96-

1. Populations, Samples and Statistics


By random samples 1 , 2 , L , n , L we mean the random variables that are independent and
taken from the same population , i.e., random samples are independent and identically
distributed random variables.

A function of random samples is called a statistic. The commonly-used statistics are listed as
follows:

Sample Original Moments of k-Order:


(nk ) =

1 n k
i , k = 1,2,L
n i =1

Especially, the sample moment of one-order (n1) =

1 n
i = is also called sample
n i =1

mean.

Sample Central Moments of k-Order:


(nk ) =

1 n
( i )k , k = 1,2,L

n i =1

Sample Variances:
S 2n =

1 n
( i )2

n 1 i =1

Note that the sample variance S 2 is different from the sample central moment of second
order (n2 ) =

1 n
( i )2 .

n i =1

Theorem Let 1 , 2 , L , n , L be the random samples taken from the population , then for

[ ]

1 n

all positive integers k, P lim ik = E k = 1 .


n + n
i =1

Proof:
Note that 1k , k2 , L , kn , L are independent and of the same distribution as that of k , it
follows from the strong law of large numbers that

A.BENHARI

-97-

[ ]

1 n k

P lim i = E k = 1 #
n + n
i =1

Remark: The theorem shows that sample average approximates to statistical average.

A.BENHARI

-98-

2. Sample Distributions
The distribution of a statistic is called sample distribution.

2.1. 2 (Chi-Square)-Distribution
Definition A continuous random variable is said to be 2 (Chi-Square) distributed with n
degree of freedom if its density functions is as follows:
n 2 2 x2
x e
n
f (x ) = 2 n
2

x >0
others

Remark 1: The degree of freedom n is the only parameter of 2 distribution.


Remark 2: For all 0 < < 1 , the value 2 (n ) , called the upward percentage point, is defined
+

as

f (x )dx = . The upward percentage point can be obtained by looking up the probability

(n )

table concerned.

Theorem If the random variable has a 2 -distribution with n degrees of freedom, then

= E [ ] = n , 2 = E ( E )2 = 2 n

Theorem If the random variables 1 , 2 , L , n are independent and of the same standard
normal distribution N(0,1) , then the random variable 2 = 12 + 22 + L + 2n is distributed in
accordance with the 2 (Chi-square) distribution with n degree of freedom.

A.BENHARI

-99-

Theorem If random variables 12 , 22 ,L, 2k are independent and possess 2 -distribution with
k

n 1 , n 2 , L , n k degrees of freedom respectively, then the random variable

2 -distribution with

i =1

2
i

possesses the

n
i =1

degree of freedom.

2.2. t(Student)-Distribution
Definition A continuous random variable is said to possess the so-called t- (Student)
distribution with n degree of freedom if its density functions is as follows:
f (x ) =

n +1

2
n x

n 1 +
2
2
2

n +1
2

, where < x < +

Remark 1: The degree of freedom n is the only parameter of t- (Student) distribution.


Remark 2: For all 0 < < 1 , the value t (n ) , called the upward percentage point, is defined
+

as

( f) (x )dx = . The upward percentage point can be obtained by looking up the probability

t n

table concerned.

Theorem If the random variable has a t-distribution with n degrees of freedom, then

= E [ ] = 0 , 2 = E ( E ) =
2

n
for n > 2
n2

Theorem If the random variable is distributed in accordance with the standard normal
distribution N(0,1) , the random variable in accordance with the 2 -distribution with n
degree of freedom, and if and are independent with each other, then the random variable

n
is distributed in accordance with the t-distribution with n degree of freedom.

2.3. F-Distribution

A.BENHARI

-100-

Definition A continuous random variable is said to possess the so-called F-distribution with
m and n degrees of freedom if its density functions is as follows:
m
m2
2
m

x 2 n + m

n 2

n+m
m 2 n m
1 + x
n
2 2

f (x ) =

x>0

others

Remark 1: The degrees of freedom n and m are the only two parameters of F- distribution.
Remark 2: For all 0 < < 1 , the value F (m, n ) , called the upward percentage point, is
+

defined as

( f () x )dx = . The upward percentage point can be obtained by looking up the

F m , n

probability table concerned.

Theorem If the random variable ~ F(n , m ) , then


Hint: It follows from the theorem that

1
~ F(m, n ) .

1
= F (m, n ) . In fact,
F1 (n , m )

1
1
1 = P{ > F1 (n, m )} = P <
= 1 P

F1 (n, m )
F1 (n, m )
1
1

1
= P
= P F (m, n )

F1 (n, m )

1
= F (m, n )
F1 (n , m )

Theorem If the random variable has a F-distribution with m and n degrees of freedom,
then

A.BENHARI

-101-

= E [ ] =

n
2 n 2 (m + n 2 )
2
for n > 2 , 2 = E ( E ) =
for n > 4
2
n2
m(n 2 ) (n 4 )

Theorem If the random variable possesses 2 -distribution with m degree of freedom, the
random variable 2 -distribution with n degree of freedom, and if and are independent
with each other, then the random variable

m
possesses F-distribution with m and n degrees
n

of freedom.

A.BENHARI

-102-

3. Normal Populations

Theorem Let 1 , 2 , L , n be the random samples taken from a normal population N , 2 ,


=

1 n
1 n
2
( i )2 sample variance, then
sample
mean
and
S

i
n i =1
n 1 i =1
(1) and S 2 are independent of each other

2
(2) ~ N ,
n

(n 1)S 2

,
~ 2 (n 1) ,
~ t (n 1)
2

S n

Theorem Let 1 , 2 , L , n and 1 , 2 , L , m be the random samples taken from two

independent normal populations N 1 , 12

and N 2 , 22

respectively, =

1 n
i ,
n i =1

(
n 1)S12 + (m 1)S 32
2
1 m
1 n
1 m
2
2
2
2
( i ) , S =
,
= i , S1 =
( i ) , S 2 = m 1
m i =1
n 1 i =1
n+m2
i =1
then
(1)

( ) (

2 )

12 32
+
n
m

S12 12
=
(2) 2
S 2 22

(3)

( ) (

(n 1)S12 12
(n 1)
(m 1)S 22 22
(m 1)
1

2 )

1 1
S
+
n m

A.BENHARI

~ N(0,1)

~ F (n 1, m 1)

~ t (n + m 2 ) if 1 = 2 =

-103-

Parameter Estimation

A.BENHARI

-104-

1. Point Estimation
1.1. Point Estimators
Let 1 , 2 , L , n be the random samples taken from the same population characterized by a
random variable , and an unknown parameter appearing in the distribution of , by point
estimation we mean the attempt to look for a statistic g(1 , 2 ,L, n ) to estimate the unknown
parameter .

Unbiased Estimators An estimator g(1 , 2 ,L, n ) for a parameter is said to be unbiased


if E [g(1 , 2 ,L, n )] = .

Consistent Estimators An estimator g(1 , 2 ,L, n ) for a parameter is said to be


consistent if for all > 0 , lim P {g(1 , 2 , L , n ) } = 0 .
n +

Mean Square Consistent Estimators An estimator g(1 , 2 ,L, n ) for a parameter is said

to be mean square consistent if lim E g ( 1 , 2 , L , n )


n +

]= 0 .

Efficient Estimators A unbiased estimator g 1 (1 , 2 , L , n ) for a parameter is said to be


more efficient than another unbiased estimator g 2 ( 1 , 2 , L , n ) if

E g 1 (1 , 2 , L , n )

] E [g ( , , L ,

) 2]

1.2. Method of Moments (MOM)


Assume that random samples 1 , 2 , L , n are taken from a population characterized by a
random variable , if the distribution of population has m unknown parameters
1 , 2 , L , m , let

A.BENHARI

-105-

[ ]

1 n k
i = E ( 1 , 2 ,L, m ) k , k = 1,2, L, m

n i =1

This is a system of m equations with m unknowns, the solution to which is the so-called
MOM estimators of 1 , 2 , L , m .
Remark: The method of moments is motivated by the following equation:

[ ]

[ ]

[ ]

1 n
1 n
1 n
E ( 1 , 2 ,L, m ) ik = E ( 1 , 2 ,L, m ) ik = E ( 1 , 2 ,L, m ) k = E ( 1 , 2 ,L, m ) k
n i =1
n i =1 n i =1

1.3. Maximum Likelihood Estimation (MLE)


Assume that random samples 1 , 2 , L , n are taken from the same population characterized
by a random variable , if the distribution of population has m unknown parameters
1 , 2 , L , m , one can define the likelihood function as follows:
if the random variable is continuous and f (1 , 2 ,L, m ) ( x ) its probability density function,
then Likelihood function is defined as
n

L (1 , 2 ,L, m ) (1 , 2 , L , n ) = f (1 ,2 ,L,m ) ( i )
i =1

if the random variable is discrete and p (1 , 2 ,L, m ) (x ) = P(1 , 2 ,L, m ) { = x} , then


Likelihood function is defined as
n

L (1 ,2 ,L,m ) ( 1 , 2 , L , n ) = p (1 ,2 ,L,m ) ( i )
i =1

The MLE estimators 1 , 2 ,L , m are ones such that

( , ,L , ) = argmax L(

1,2 ,L,m

1,2 ,L,m )

( 1 , 2 ,L , n )

Remark: It is clear that the resulting estimators 1 , 2 , L , m are functions of 1 , 2 , L , n .

In practice, if the derivatives of a likelihood function L (1 ,2 ,L,m ) with respect to the unknown
parameters exist, one can obtain the MLE estimator from the solution to the following
equations

A.BENHARI

-106-

] = 0 , i = 1,2,L, m

ln L(1 , 2 ,L, m ) (1 , 2 ,L, n )


i

A.BENHARI

-107-

2. Interval Estimation
Definition Let 1 , 2 , L , n be the random samples taken from the same population and an
unknown parameter appearing in the population distribution. If for all 0 < < 1 (usually
small enough), one can determine two statistics a (1 , 2 , L , n ) and b (1 , 2 , L , n ) such
that

P{a (1 , 2 , L , n ) < < b (1 , 2 , L , n )} = 1


the interval (a (1 , 2 , L , n ), b (1 , 2 , L , n )) is then called the confidence interval for the
unknown parameter , with the confidence coefficient 1 .

Remark: In practice, one can consider one-tailed confidence interval:

P{ < < b (1 , 2 , L , n )} = 1 or P{a (1 , 2 , L , n ) < < +} = 1

Example Suppose 1 , 2 , L , n are the random samples taken from a normal population

N , 2 .
(1) (Estimation of ) If the variance 2 is known, it follows from


~ N(0,1) that for all
n

0 < < 1,


< < + z 2
P
< z 2 = 1 P z 2
= 1
n
n
n

(2) (Estimation of ) If the variance 2 is unknown, it follows from


~ t (n 1) that for
S n

all 0 < < 1 ,

S
S

< < + t 2 (n 1)
P
< t 2 (n 1) = 1 P t 2 (n 1)
= 1
n
n
S n

(3) (Estimation of 2 ) It follows from

(n 1)S2

~ 2 (n 1) that for all 0 < < 1 ,

(n 1)S 2 < 2 (n 1) = 1 (n 1)S 2


(
n 1)S 2
2
P 12 2 (n 1) <
P 2
< < 2

= 1
2
1 2 (n 1)
2
2 (n 1)

Remark: z 2 , t 2 , 12 2 (n 1) and 2 2 (n 1) are the upward percentage points of the

A.BENHARI

-108-

corresponding distributions.

Example Suppose the ransom samples 1 , 2 , L , n are taken from a normal population

( )
N( , ) and the two populations are independent with each other.

N 1 , 12 , the random samples 1 , 2 ,L, m are taken from another normal population
2

(1)

2
2

(Estimation of 1 2 )

( ) (

2 )

2
+ 2
n
m
2
1

If the variances 12 and 22 are known, it follows from

~ N(0,1) that for all 0 < < 1 ,

( ) (1 2 )

P
< z 2 = 1
12 22

n
m

12 22
2 2
P ( ) z 2
+
< 1 2 < ( ) + z 2 1 + 2 = 1
n
m
n
m

(2) (Estimation of 1 2 ) If the variance 12 = 22 = 2 is unknown, it follows from

( ) (

2 )
(n 1)S12 + (m 1)S22 , that for all 0 < < 1 ,
~ t (n + m 2 ) , where S 2 =
n+m2
1 1
S
+
n m
1

( ) ( )

1
2
P
< t 2 (n + m 2 ) = 1
1 1

S
+

n m

1 1
1 1
P ( ) t 2 (n + m 2 )S
+ < (1 2 ) < ( ) + t 2 (n + m 2 )S
+ = 1
n m
n m

12
(
(
n 1)S12
m 1)S22
2

~ (n 1) and
~ 2 (m 1) that
(3) Estimation of 2 It follows from
2
2
1
2
2

S12 12
=
S 22 22

A.BENHARI

(n 1)S12 12
(n 1)
(m 1)S 22 22
(m 1)
-109-

~ F (n 1, m 1)

which leads to

S2 2
P F1 2 (n 1, m 1) < 12 12 < F 2 (n 1, m 1) = 1
S2 2

S12

12 S12
1
1
P 2
< 2 < 2
=1
S 2 F 2 (n 1, m 1) 2 S 2 F1 2 (n 1, m 1)

A.BENHARI

-110-

Tests of Hypotheses
Statistical hypothesis H 0 is an assumption about the unknown parameters appearing in a
population distribution or about the population distribution itself. A number of random
samples 1 , 2 , L , n taken from the population are then used to make the probability

P{H 0 is rejected H 0 is true} as small as possible. This is realized in practice by setting up the
following equation:

P{H 0 is rejected H 0 is true} =


Typically, = 0.05 , = 0.01 , or the like.

A.BENHARI

-111-

1. Parameters from a Normal Population


Test of the hypothesis H 0 : = 0 against the alternative H 1 : 0 of the mean of a normal
distribution with known variance 2 .
If the hypothesis H 0 : = 0 is true, then

~ N(0,1) , which leads to

P{H 0 is rejected H 0 is true} = P{H 1 is accepted H 0 is true} = P


z 2 =
n

if

< z 2 , then H 0 : = 0 , otherwise H 1 : 0

Test of the hypothesis H 0 : = 0 against the alternative H1 : < 0 ( > 0 ) of the mean of
a normal distribution with known variance 2 .
If the hypothesis H 0 : = 0 is true, then

~ N(0,1) , which leads to

< z =
P{H 0 is rejected H 0 is true} = P{H 1 : < 0 is accepted H 0 is true} = P
n

P{H is rejected H is true} = P{H : > is accepted H is true} = P 0 > z =

0
0
1
0
0

0
z , then H 0 : = 0 , otherwise H 1 : < 0
if
n

if 0 z , then H : = , otherwise H : >

0
0
1
0
n

Test of the hypothesis H 0 : = 0 against the alternative H 1 : 0 of the mean of a normal


distribution with unknown variance.
If the hypothesis H 0 : = 0 is true, then

A.BENHARI

0
S

-112-

~ t (n 1) , which leads to

P{H 0 is rejected H 0 is true} = P{H 1 is accepted H 0 is true} = P


t 2 (n 1) =
S n

if

0
S

< t 2 (n 1), then H 0 : = 0 , otherwise H 1 : 0

Test of the hypothesis H 0 : = 0 against the alternative H1 : < 0 ( > 0 ) of the mean of
a normal distribution with unknown variance.
0

If the hypothesis H 0 : = 0 is true, then

~ t (n 1) , which leads to

< t (n 1) =
P{H 0 is rejected H 0 is true} = P{H 1 : < 0 is accepted H 0 is true} = P
S n

P{H is rejected H is true} = P{H : > is accepted H is true} = P 0 > t (n 1) =

0
0
1
0
0

S n

0
t (n 1), then H 0 : = 0 , otherwise H 1 : < 0
if
S n

if 0 t (n 1), then H : = , otherwise H : >

0
0
1
0
S n

Test of the hypothesis H 0 : = 0 against the alternative H 1 : 0 of the variance of a


normal distribution.
If the hypothesis H 0 : = 0 is true, then

(n 1)S 2

2
0

~ 2 (n 1) , which leads to

(n 1)S 2
(n 1)S 2

U
P{H 0 is rejected H 0 is true} = P
<

(
n

1
)
2 (n 1) =
1 2
2
2

0
0

if 1 2 (n 1)

(n 1)S 2
02

< 2 (n 1), thne H o , otherwise H 1

Test of the hypothesis H 0 : = 0 against the alternative H 1 : < 0 ( > 0 ) of the variance
of a normal distribution.

A.BENHARI

-113-

If the hypothesis H 0 : = 0 is true, then

(n 1)S 2

2
0

~ 2 (n 1) , which leads to

(n 1)S 2

> (n 1) =
P{H 1 : > 0 is accepted H 0 is true} = P
2

P{H 0 is rejected H 0 is true} =


2

P{H : < is accepted H is true} = P (n 1)S < (n 1) =

1
0
0
1
2

(n 1)S 2
< (n 1), then H 0 : = 0 , otherwise H 1 : > 0
if
02


2
if (n 1)S > 1 (n 1), then H 0 : = 0 , otherwise H 1 : < 0

02

A.BENHARI

-114-

2. Parameters from two Independent Normal Populations


Test of the hypothesis H 0 : 1 = 2 against the alternative H 1 : 1 2 of the means of two
independent normal distributions with unknown variances 1 = 2 .
If the hypothesis H 0 : 1 = 2

S2 =

(n 1 1)S12 + (n 2 1)S 22
n1 + n 2 2

is true, then

1
1
S
+
n1 n 2

~ t (n 1 + n 2 2) , where

, which leads to

P{H 0 is rejected H 0 is true} = P


t 2 (n 1 + n 2 2 ) =
S 1 + 1

n1 n 2

if

1
1
S
+
n1 n 2

< t 2 (n 1 + n 2 2 ), then H 0 , otherwise H 1

Test of the hypothesis H 0 : 1 = 2 against the alternative H 1 : 1 < 2 (1 > 2 ) of the


means of two independent normal distributions with unknown variances 1 = 2 .
If the hypothesis H 0 : 1 = 2

S =
2

(n 1 1)S12 + (n 2 1)S 22

A.BENHARI

n1 + n 2 2

is true, then

, which leads to

-115-

1
1
S
+
n1 n 2

~ t (n 1 + n 2 2) , where

P{H : < is accepted H is true} = P

1
1
2
0

P{H 0 is rejected H 0 is true} =

P{H1 : 1 > 2 is accepted H 0 is true} = P

1 1
+
n1 n 2

1 1
+
n1 n 2

< t (n1 + n 2 2) =

t (n1 + n 2 2) =

> t (n 1 + n 2 2), then H 0 , otherwise H 1 : 1 < 2


if
1
1
S
+

n1 n 2

if
then H 0 , otherwise H 1 : 1 > 2
< t (n 1 + n 2 2)

1
1
+
S
n1 n 2

Test of the hypothesis H 0 : 1 = 2 against the alternative H 1 : 1 2 of the variances of


two independent normal distributions.
If the hypothesis H 0 : 1 = 2 is true, then

S12 12 S12
=
~ F(n 1 1, n 2 1) , which leads to
S 22 22 S 22

S 2
S2

P{H 0 is rejected H 0 is true} = P 12 < F1 2 (n 1 1, n 2 1) U 12 F 2 (n 1 1, n 2 1) =
S 2
S2

if F1 2 (n 1 1, n 2 1)

S12
< F 2 (n 1 1, n 2 1), then H 0 , otherwise H 1
S 22

Test of the hypothesis H 0 : 1 = 2 against the alternative H 1 : 1 < 2 (1 > 2 ) of the


variances of two independent normal distributions.
If the hypothesis H 0 : 1 = 2 is true, then

S12 12 S12
=
~ F(n 1 1, n 2 1) , which leads to
S 22 22 S 22

S12

{
}

<

=
P
H
:
is
acceted
H
is
true
P

2 < F1 (n 1 1, n 2 1) =
1
1
2
0

S 2

P{H 0 is rejected H 0 is true} =


2
P{H : > is acceted H is true} = P S1 > F (n 1, n 1) =
2

1
1
2
0

1
2

S 2

A.BENHARI

-116-


if


if

A.BENHARI

S12
F1 (n 1 1, n 2 1), then H 0 , otherwise H 1 : 1 < 2
S 22
S12
F (n 1 1, n 2 1),
then H 0 , otherwise H 1 : 1 > 2
S 22

-117-

III. RANDOM PTOCESSES

A.BENHARI

-118-

Introduction

A.BENHARI

-119-

1. Definition
Definition Let T be an index set, if for all t T , t is a random variable over the same
probability space, then the collection of random variables { t t T} is called a random
process.
Remark 1: { t t T} is called a discrete-time(parameter) random process if T is a countable
(finite or denumerable infinite) set. { t t T} is called a continuous-time random process if T
is a continuum.
Remark 2: The set of all possible values the random variables of a process may take is called
its state space of the process. The state space may be a continuum or a countable set.
Remark 3: There are four possible combinations for time and state of a random process:
continuous-time and continuous-state, continuous-time and discrete-state, discrete-time and
continuous-state, and discrete-time and discrete-state.

Definition A random process { t < t < +} is said to be periodic with period T if for all
t, P{ t + T = t } = 1 .

A.BENHARI

-120-

2. Family of Finite-Dimensional Distributions


A random process { t t T} is often characterized by the joint distributions of every possible
collection of finite random variables t1 , t 2 , L , t n taken from the process:

F(x 1 , t 1 ; x 2 , t 2 ; L ; x n , t n ) = P t1 < x 1 ; t 2 < x 2 ; L ; t n < x n

All these joint distributions constitute the family of finite-dimensional distributions of the
process.

The Properties of the family of finite-Dimensional Distributions:


(1) Symmetry
F(x 1 , t 1 ; x 2 , t 2 ; L ; x n , t n ) = F(x Q (1) , t Q (1) ; x Q (2 ) , t Q (2 ) ; L ; x Q (n ) , t Q (n ) )
where (Q(1), Q(2 ), L , Q(n )) is a permutation of (1,2, L , n ) .

(2) Consistency
F(x 1 , t 1 ; x 2 , t 2 ; L ; x n , t n ) = F(x 1 , t 1 ; x 2 , t 2 ; L ; x n , t n ; x n +1 = +, t n +1 ; L ; x n + m = +, t n + m )

Kolmogorov Theorem If a family of finite-dimensional distributions satisfies the symmetry


and consistency described above, there must be then a random process such that the family is
its family of finite-dimensional distributions.

Two random processes { t t T} and { t t T } are jointly characterized by the joint


distributions of every possible collection of finite random variables taken from the two
processes respectively

F (x 1 , t 1 ; L ; x n , t n ; y 1 , t 1 ; L ; y m , t m ) = P t1 < x 1 ; L ; t n < x n ; t1 < y 1 ; L ; t m < y m

Two random processes { t t T} and { t t T } are said to be independent if


F (x 1 , t 1 ;L; x n , t n ; y1 , t 1 ;L; y m , t m ) = F (x 1 , t 1 ;L; x n , t n )F (y1 , t 1 ;L; y m , t m )

A.BENHARI

-121-

3. Mathematical Expectations
Definition Let { t t T} be a random process, then

The mean value of the process is defined as

t = E[ t ] .

The variance of the process is defined as

2t = E t t

] = E[ ]
2
t

The correlation function of the process is defined as

R (t 1 , t 2 ) = E t 1 t 2

2
t

The covariance of the process is defined as

[(

)]

)(

cov (t 1 , t 2 ) = E t1 t1 t 2 t 2 = R (t 1 , t 2 ) t1 t 2

Definition A random processes { t t T} is said to be weakly stationary if for all t T and

t + T , E t t + = R ( ) , i.e., E t t + is independent of the choice of t.

Definition Two random processes t t T and t t T are said to be uncorrelated if for

all t 1 T and t 2 T , R (t 1 , t 2 ) = E t1 t 2 = 0 .

A.BENHARI

-122-

4. Examples
4.1. Processes with Independent, Stationary or Orthogonal
Increments
Definition (Independent Increments) A random process
independent

increments

if

for

all

t1 < t 2 < L < t n T

t T} is said to have
,

the

increments

t 2 t1 , t 3 t 2 , L , t n t n 1 are independent of each other.

Example Let { t a t < +} be a random process with independent increments and

P{ a = Const.} = 1 , then for all a t 1 < t 2 , cov(t 1 , t 2 ) = 2t1 .


Proof:
For all a t < + , let t = t E[ t ] , then the process { t a t < +} is one with
independent increments, mean zero and P{ a = 0} = 1 . Thus we have

[(
= E [(

)(

)] [

] [(

) ]

cov(t 2 , t 1 ) = E t 2 E t 2 t1 E t1 = E t 2 t1 = E t 2 t1 + t1 t1
t2

) ]

2
2
t1 t1 + E t1 = E t1 E t1 = 2t1 #

Remark: cov(t 1 , t 2 ) = 2min {t1 , t 2 }

Definition (Stationary Increments) A random process { t t T} is said to have stationary


increments if for all t < t + T , the distribution of increment t + t has nothing to do
with t.

Definition (Orthogonal Increments) A zero-mean random process { t t T} is said to have

orthogonal increments if for all t 1 < t 2 t 3 < t 4 T , E ( t 2 t 1 )( t 4 t 3 ) = 0 .


Remark: and are said to be orthogonal if E[ ] = 0

A.BENHARI

-123-

4.2. Normal Processes


Definition A random process { t t T} is said to be a normal/Gaussian process if all its
finite-dimensional distributions are normal/Gaussian.

A.BENHARI

-124-

Markov Processes (1)

A.BENHARI

-125-

1. General Properties
{

Definition A random process

t T} is called a Markov process if for all

t 1 < t 2 < L < t k < t k +1 T , its conditional distributions satisfy

F t k +1

t k L t1

(x k +1 , t k +1

x k , t k ; x k 1 , t k 1 ; L; x 1 , t 1 ) =

} {

= P t k +1 < x k +1 t k = x k ; t k 1 = x k 1 ; L ; t1 = x 1 = P t k +1 < x k +1 t k = x k

= F t k +1

t k

(x k +1 , t k +1

xk ,tk )

Remark 1: The definition of a Markov process means that the future is only dependent on the
present and has nothing to do with the past (History can tell nothing about future).

Remark 2: A Markov process is called a Markov chain if its state space is discrete.

Definition A Markov process { t t T} is said to be homogenous if for all t < t + T , the


conditional distribution F(y, t + x, t ) = P{ t + < y t = x} is independent of time t, i.e.,
F(y, t + x , t ) = F(y, x ) .

Theorem Let { t t 0} be an IID random process, i.e., for all 0 t1 < t 2 < L < t k < t k + , the
random variables t1 , t 2 , L , t k , t k + are independent and identically distributed, then the
process is a homogenous Markov process.

Proof:

P t k + < x t k = x k ; L ; t1 = x 1 =

P t k = x k ; L ; t1 = x 1

} {
}
= P{
< x}
P{ = x }L P{ = x }
P{
< x}P{ = x }
P{
< x; = x }
=
=
= P{
P{ = x }
P{ = x }
=

P t k + < x; t k = x k ; L ; t 1 = x 1

}{

P t k + < x P t k = x k L P t1 = x 1

independent

tk

t k +

tk

tk

t1

t k +

t k +

independent

tk

tk

tk +

< x tk = x k

This shows that the process is a Markov Process. Furthermore, for all 0 t < t + ,

P{ t + < x t = y}

independent

P{ t + < x}

identically distributed

This shows that the Markov process is homogenous. #


A.BENHARI

-126-

P{ < x} = P{ < x 0 = y}

Remark: From the proof of the theorem, one can see


Independence Markov; Identical distribution Homogeneity

Theorem Let { t t 0} be a random process with 0 = 0 .


(1) If the increments of { t t 0} are independent, the process is a Markov process.
(2) If the increments of { t t 0} are both independent and stationary, then the process is a
homogenous Markov process.
Proof:

P t k + < x t k = x k ; L ; t 2 = x 2 ; t1 = x 1 =
=

P t k = x k ; L ; t 2 = x 2 ; t1 = x 1

P t k + t k < x x k ; t k t k 1 = x k x k 1 ; L; t 2 t1 = x 2 x 1 ; t1 = x 1

P t k t k 1 = x k x k 1 ; L; t 2 t1 = x 2 x 1 ; t1 = x 1

}{

} {

}{
x }P{ = x }

P t k + t k < x x k P t k t k 1 = x k x k 1 L P t 2 t1 = x 2 x 1 P t1 = x 1

= P tk + tk < x x k =

} {

P t k t k 1 = x k x k 1 L P t 2 t1 = x 2

independent increment

P t k + < x; t k = x k ; L ; t 2 = x 2 ; t1 = x 1

P t k + < x; t k = x k

P tk = x k

P t k + t k < x x k ; tk = x k

P tk = x k

= P t k + < x t k = x k

t1

This shows that the process is a Markov Process. Furthermore, for all t 0 ,

} {

P t k + < x t k = x k = P t k + t k < x x k

stationary increments

P{ t + t < x x k }

P{ t + t < x x k }P{ t = x k }
P{ t + t < x x k ; t = x k }
=
independen t increments
P{ t = x k }
P{ t = x k }

P{ t + < x; t = x k }
= P{ t + < x t = x k }
P{ t = x k }

This shows that the Markov process is homogenous. #


Remark: In the probability space (, , P ) , we have

{ , () = x, () = y} = { , () () = x y, () = y}

A.BENHARI

-127-

2. Discrete-Time Markov Chains


For a discrete-time Markov chain, the conditional probability P{ n + m = y n = x} is often
called its m-step transition probability.

Definition A discrete-time Markov chain { n n T} is said to be homogenous if its transition


probability P{ n + m = y n = x} is independent of n.

Remark: From now on, discrete-time Markov chains appearing in this section are all
assumed to be homogenous.

2.1. Transition Probabilities


For a homogenous Markov chain, its k-step transition probability is often denoted as

p (xyk ) = P{ n + k = y n = x} , where k is a non-negative integer. Note that

1 x = y
p (xy0 ) = P{ n = y n = x} =
.
0 x y

Chapman-Kolmogorov Theorem Let { n n = 0,1,L} be a homogenous Markov chain, then


p (xyn + k ) = p (xzn ) p (zyk )
z

Proof:

p (xyn + k ) = P{ n + k + m = y m = x} =
=
z

P{ n + k + m = y; m = x}
P{ n + k + m = y; k + m = z; m = x}
=
P{ m = x}
P{ m = x}
z

P{ n + k + m = y k + m = z; m = x}P{ k + m = z; m = x}
P{ m = x}

= P{ n + k + m = y k + m = z}P{ k + m = z m = x} = p (xzk ) p (zyn ) #


z

Remark: From the Chapman-Kolmogorov theorem, one can conclude that k-step transition
probabilities can be derived from the one-step transition probabilities. In fact,
p (xy2 ) = p xz p zy , p (xy3) = p (xz2 ) p zy , , p (xyk ) = p (xzk 1) p zy ,
z

A.BENHARI

-128-

Example If let

p 00

p 10
P= M

p n0
M

p 01 L p 0 n
p11 L p1n
M
M
M
p n1 L p nn
M

L
p (00m )
(m )

L
p 10
(m )

M ,P = M

p (nm0 )
L

M
M

p (01m ) L p (0mn ) L

(m )
p 11
L p 1(mn ) L
M
M
M
M
p (nm1 ) L p (nnm ) L

M
M
M
M

be the one-step transition matrix and m-step transition matrix of the chain, respectively, then
the theorem can be expressed in matrix form
P (m ) = P m
In fact, from Chapman-Kolmogorov theorem we have
+

p (xy2 ) = p xz p zy = p xz p zy P (2 ) = P 2
z=0

p (xy3 ) = p (xz2 ) p zy = p (xz2 ) p zy P (3 ) = P (2 ) P = P 3


z =0

In this way we obtain that


P (m ) = P m

Theorem Let { n n = 0,1,L} be a homogenous Markov chain, then the distribution of n can
be expressed as

p(xn ) = P{ n = x} = P{n = x; 0 = y} = P{ n = x 0 = y}P{0 = y} = p y p(yxn )


y

where p y = P{ 0 = y} is the initial probability.

Remark 1: Recall that k-step transition probabilities can be derived from one-step transition
probabilities, the theorem shows that the distribution of n can be determined only by onestep transition probabilities as well as initial probabilities.

Remark 2: If let

p = (p0 , p1 ,L, p k ,L) , p (n ) = (p (0n ) , p 1(n ) , L , p (kn ) , L)


be the initial probability vector and the probability vector at the n moment, respectively, then
the theorem can be expressed in matrix form

A.BENHARI

-129-

p ( n ) = p P ( n ) = pP n

Theorem Let { n n = 0,1,L} be a homogenous Markov chain, then the joint distribution of
n1 , n 2 , L , n k , n k +1 can be expressed as

P n k +1 = x k +1 ; n k = x k ; L; n1 = x 1 =

{
= P{

}{

= P n k +1 = x k +1 n k = x k ;L; n1 = x 1 P n k = x k ;L; n1 = x 1
n k +1

}{

= x k +1 n k = x k P n k = x k ; L; n1 = x 1

{
= P{ = x }p (

= p (xnkkx+k1+1n k ) P n k = x k ; L; n1 = x 1
n1

n 2 n1 )
x1x 2

L p (xnkk1xnkk 1 ) p (xnkkx+k1+1n k )

Remark: Again, the joint distribution of n1 , n 2 , L , n k , n k +1 can be determined by one-step


transition probabilities as well as initial probabilities.

2.2. Classification of States


2.2.1. Communication
Definition A state y is said to be accessible from a state x if there is a nonnegative integer n
such that p (xyn ) > 0 , often denoted by x y . Two states x and y are said to communicate with

each other if they are accessible from one another, often denoted by x y .

Theorem Communication is an equivalence relation, i.e.,


(1) (Reflexivity) for all states x, x x
(2) (Symmetry) for any two states x and y, if x y , then y x
(3) (Transitivity) for any three states x, y and z, if x y and y z , then x z

Hint:

p (xx0 ) = P{ 0 = x 0 = x} =

P{ 0 = x; 0 = x}
= 1 > 0 (Reflexivity)
P{ 0 = x}

x y p (xyn ) > 0
xy
y x (Symmetry)
(k )
y x p yx > 0

A.BENHARI

-130-

p (xyn ) > 0 , p (yzk ) > 0 p (xzn + k )

K C equation

p ( ) p ( ) p ( )p ( ) > 0 (Transitivity)
n
xt

k
tz

n
xy

k
yz

Remark: Since communication is an equivalent relation, one can divide the state space into
disjoint equivalent classes, the states in the same equivalent class can communicate with each
other, while the states belonging to different equivalent classes cant.

Definition A homogenous Markov chain is said to be irreducible if any two states of the
chain can communicate with each other.

2.2.2. Recurrence
Let

f xy(k ) = P{ n + k = y; n + k 1 y;L; n +1 y n = x}, k 1


be the probability such that a homogenous Markov chain starting from the state x reaches the
+

state y for the first time after k steps. Furthermore, let f xy = f xy(k ) , f xy is then the probability
k =1

such that the chain starting from the state x reaches the state y for the first time after some
finite steps.

Remark 1: Note that for all positive integers k, 0 f xy(k ) f xy 1 .


n

Remark 2: It follows from the definition of f xy( k ) that for all n 1 , p (xyn ) = f xy(k ) p (yyn k ) .
k =1

Definition A state x of a homogenous Markov chain is said to be recurrent if, after starting
from it, the probability of returning to it after some finite steps is one, i.e., f xx = 1 . A state that
is not recurrent is said to be transient.

A.BENHARI

-131-

a 1
2

1
Example Let P =
2

c 1
4

d 0

1
1
1

0 be the one-step transition probability matrix of

1
4

2
4

a Markov chain, then the states a, b and d are recurrent, while c is transient..

Theorem A state x of a homogenous Markov chain is recurrent if and only if

p ( ) = + .
n =1

n
xx

Proof:
N

Nk

k =1

t =0

p (xxn ) = f xx(k ) p (xxn k ) = f xx(k ) p (xxn k ) = f xx(k ) p (xxt )


n =1

(1) Suppose

n =1 k =1

k =1 n = k

p ( ) = + , then
n =1

n
xx

Nk

n =1

k =1

t =0

k =1

t =0

p (xxn ) = f xx(k ) p (xxt ) f xx(k ) p (xxt )

p (xxn )
n =1
N

p (xxt )
t =0

p( )
n
xx

n =1

1 + p (xxt )

f xx(k )
k =1

t =1

p( )
lim

n
xx

n =1

N +

1 + p (xxt )

=
+

because p (xxn ) = +

1 f xx(k ) = f xx 1
k =1

n =1

t =1

f xx = 1

This implies that x is a recurrent state.


(2) Suppose f xx = 1 , we now prove that
+

assume that

p ( ) = + . By reduction to absurdity, we first


n =1

n
xx

p ( ) < + . Then, for all 1 N N ,


n =1

n
xx

Nk

N N

p( ) = f ( ) p( ) f ( ) p( )
n =1

n
xx

k =1

k
xx

t =0

t
xx

k =1

k
xx

t =0

t
xx

f ( )
k =1

k
xx

p( )
n
xx

n =1
N N

1+

p( )
t =1

A.BENHARI

-132-

t
xx

p (xxn )

f ( )
k =1

k
xx

lim

N +

n =1
N N

1+

t =1

(t )

because

N +

k
xx

k =1

xx

n =1
+

n
xx

(t )

1 + p xx

<1

t =1

p( )

f ( )

p ( n ) < +

n =1

xx
+

1 = f xx = lim

p( )

n =1
+

n
xx

(t )

1 + p xx

<1

t =1

This absurd result shows that the assumption

p ( ) < + is not true. #


n
xx

n =1

Remark: If a state x is recurrent, the chain will return to x infinite many times. If a state x is
transient, the chain will go away from x forever after returning to x finite many times.
Therefore, if the state space of a chain is finite, at least one of its states must be recurrent.

Theorem If x is recurrent and x y , then


(1) y x , i.e., x y
(2) y is also recurrent

Proof:
The conclusion y x is self-evident, otherwise x would not be recurrent. Furthermore,

x y p (xyn ) > 0 , p (yxk ) > 0


p (yyn + k + m ) = p (yzk + m ) p (zyn ) p (yxk + m ) p (xyn ) = p (xyn ) p (yzk ) p (zxm ) p (xyn ) p (yxk ) p (xxm )
z

m =1

m =1

p (yyn +k +m ) p (yxk ) p (xyn ) p (xxm )

x is recurrent

This implies that y is recurrent. #

Remark: Although a transient state can reach a recurrent state, a recurrent state can not reach
a transient state.

Theorem If a homogenous Markov chain with finite state space is irreducible, then all its
states are recurrent.

Proof:

A.BENHARI

-133-

Recall that a homogenous Markov chain with finite state space must have at least a recurrent
state x. For all other states y, it follows from the irreducibility of the chain that x and y are
communicate with each other and therefore y must also be recurrence. #

2.2.3. Decomposition of a State Space


Definition Let S be the state space of a homogenous Markov chain and A S , A is said to be
closed if the states in A can not reach the states outside A, i.e., for all x A , y A and

n 1 , p (xyn ) = 0 .
Remark: The fact that A is closed does not exclude the possibility of a state outside A
reaching a state inside A.

Theorem Let R be the set of all recurrent states of a homogenous Markov chain, then
(1) R is closed.
(2) If a binary relation ~ is defined on R such that for all x, y R , x ~ y = def x y ,
then the relation is an equivalent relation.
Hint: As we have proven in the preceding subsection, a recurrent state cant reach a transient
state. Thus R is closed.
Remark 1: Since the communication relation ~ in R is an equivalent relation, R can be then
divided into disjoint equivalent classes R = R 1 + R 2 + L . It is clear that each of equivalent
classes is also closed.

Remark 2: The state space S of a homogenous Markov chain can be decomposed as


S = T + R = T + R1 + R 2 + L
where T is the set of all transient states of the chain.

A.BENHARI

-134-

a 1
2

1
Example Let P =
2

c 1
4

d 0

1
1
1

2
2
4

0 be the one-step transition probability matrix of

1
4

0
1

a Markov chain, then the states a, b and d are recurrent, while c is transient. The state space
S = {a , b, c, d} can be decomposed as
S = T + R1 + R 2
where T = {c} , R 1 = {a , b} and R 2 = {d} .

2.2.4. Periodicity and Ergodicity


Definition Let x be a recurrent state of a homogenous Markov chain and Tx the number of
steps after which the state x returns to itself for the first time, then
+

k =1

k =1

(1).the state x is said to be null recurrent if E[Tx ] = kP{Tx = k} = kf xx(k ) = + .


(2) the state x is said to be positive recurrent if it is not null recurrent.

Definition A state x of a homogenous Markov chain is said to have period T > 1 if p (xxn ) = 0
when n kT and T is the largest positive integer with this property. A state that is not
periodic is said to be aperiodic.

Remark: One should tell the difference between the periodicity of a random process and that
of a state of the process.

Definition A state of a homogenous Markov chain is said to be ergodic if it is both positive


recurrent and aperiodic.

2.3. Stationary & Limit Distributions


2.3.1. Stationary Distributions
A.BENHARI

-135-

Definition Let pij be the one-step transition probability of a homogenous Markov chain, a
discrete distribution i is called the stationary distribution of the chain if

p
i

ij

=j.

Remark: if i 0 and

= 1 , then i is said to be a discrete distribution.

2.3.2. Limit Distributions


Definition A homogenous Markov chain is said to be ergodic if lim p (xyn ) = y 0 and
n +

= 1.

Remark 1: y are often said to be the chains limit distribution.


Remark 2: lim p (xyn ) = y means that p (xyn ) is independent of the starting state x when n is
n +

large enough.

2.3.3. The relation between Stationary Distributions and Limit


Distributions
Definition A homogenous Markov chain is said to be regular if there is a positive integer n
such that for all states x and y of the chain, p (xyn ) > 0 .
Remark: If the state space is finite, regular and irreducible are the same things, if the state
space is infinite, regular must lead to irreducible, but irreducible dose not necessarily lead to
regular.

Theorem (Ergodic Theorem) If a finite-state homogenous Markov chain is regular, then the
chain is ergodic and its limit distribution is also its stationary distribution.

2.4. Examples: Simple Random Walks


By the simple random walk of a particle on a line, one means that at each moment, the
particle moves its location either one step forwards with probability p or one step backwards
with probability q = 1 p .

A.BENHARI

-136-

Let { n n = 0,1, L} be a random process such that n indicates the location of the particle at
the moment n, we will then address the following issues:
Is the process a homogeneous Markov chain?
Let 1 , 2 , L , m , L be the random variables such that { m = 1} indicates the event that
the particle moves one step forwards at the moment m and { m = 1} the event that the
n

particle moves one step backwards at the moment m, then n = m + k 0 , where k 0 is


m =1

the initial location of the particle. Note that 1 , 2 , L , m , L are independent and

p
k =1

identically distributed with P{ m = k} =


for all m. It can be then
(1 p ) = q k = 1
easily proven that the process { n n = 0,1, L} is one with independent and stationary
increments and therefore a homogenous Markov chain.
P{ n = k} = ?

n +1 n + k k 0
n

n
P{ n = k} = P m + k 0 = k = P m
=
= C n +k k 0 p
2
m =1

m =1 2

2
P{ n +1 = j n = i} = ?
P{ n +1

A.BENHARI

= j n = i} = q
0

-137-

j= i +1
j = i 1, n 0 .
others

n+kk0
2

n k +k 0

Appendix Eigenvalue Diagonalization


Definition Let A be a n n matrix, if there is a nonzero number and a nonzero vector x
such that Ax = x , then is called an eigenvalue of A and x an eigenvector with respect to
.

Remark:
Ax = x (A I )x = 0 A I = 0

There are at most n different eigenvalues for a n n matrix.

Theorem If a n n matrix A has n linearly-independent eigenvectors x 1 , x 2 , L , x n , then A


can be diagonalized as

X 1 AX = = diag( 1 , 2 , L , n )
where X = (x 1 , x 2 , L , x n ) .
Remark:

AX = X A = XX 1 A n = Xn X 1

a
1 a
, where 0 < a, b < 1 .
Example Let A =
b 1 b
(1) The eigenvalues and eigenvectors of A are given as follows:

a
1 a

= (1 a )(1 b ) ab = (1 )2 (a + b )(1 ) = 0
A I =
1 b
b
1 = 1

2 = 1 a b

1
x 1 =
a

1
Ax = 1 x

1 , X 1 = 1 b a

X
=
(
x
,
x
)
=

1
2
b
a

a + b b b
Ax = 2 x
x =
1
1

2
b

0 1
1
X that
(2) It follows from A = X
0 1 a b

A.BENHARI

-138-

1
A = X
0
n

A.BENHARI

n
1
1 b a (1 a b )
X =

+
(1 a b )n
a + b b a
a+b

-139-

a a
1 b a

n +
a + b b a
b b

Markov Processes (2)

A.BENHARI

-140-

1. Continuous-Time Markov Chains


For a continuous-time Markov chain { t t T}, the conditional probability P{ t + = y t = x}
is often called its transition probability.

Difinition A continuous-time Markov chain { t t T} is homogenous if its transition


probability P{ t + = y t = x} is independent of t.
Remark: In this section continuous-time Markov chains are always assumed to be
homogenous.

Theorem (Chapman-Kolmogorov Equation) Let { t t T} be a homogenous continuoustime Markov chain and p ij ( ) = P{ t + = j t = i}, then

p ij ( + ) = P{ t + + = j t = i} =

P{ t + + = j; t = i}
P{ t = i}

P{ t + + = j; t + = k; t = i}
P{ t = i}

= P{ t + + = j t + = k; t = i}P{ t + = k t = i} = p ik ( )p kj ()
k

1.1. Transition Rates


Definition A homogenous continuous-time Markov chain { t t T} is said to be randomcontinuous if

1 i = j
lim+ pij ( ) = lim+ P {t + = j t = i} =
= ij .
0
0
0 i j
Remark: Random continuity means that the chain cannot change from one state to another in
no time. From now on, homogenous continuous-time Markov chains in this section are all
assumed to be random-continuous.

Theorem For a continuous-time Markov chain,

A.BENHARI

-141-

pij ( )

(1) qij = lim+

< + , where i j

pii ( ) 1

(2) qii = lim+

>

Remark 1: qij is called transition rate from state i to state j, which plays the same role as that
of one-step transition probability in the case of discrete-time Markov Chains.
Remark 2: q ij can be uniformly expressed as

p ( ) 1

lim+ ij
i= j

pij ( ) pij ( 0 ) 0

qij = pij ( 0 ) = lim+


=
0

lim pij ( )
i j
0 +

Definition If for all i,

ij

= 0 , the chain is said to be conservative.

Remark 1: If

ij

= 0 , then q ii = q ij .
j i

Remark 2: It can be proven that finite-state Markov chains are conservative. In fact,

qij = lim+
j

pij ( ) ij

p ( )
ij

= lim+

ij

= lim+

11

= lim+
0

=0

1.2. Kolmogorov Forward and Backward Equations


Theorem For a finite-state Markov chain, we have
(1) Kolmogorovs forward equation

dp ij ( )
d

= p ik ( )q kj , 0
k

(2) Kolmogorovs backward equation

dp ij ( )
d

= q ik p kj ( ) , 0
k

Proof:
dp ij ( )
d

A.BENHARI

= lim

p ij ( + ) p ij ()

= lim

-142-

p ()p () p ()
ik

kj

ik

kj

= p ik ( ) lim

= lim

p ij ( + ) p ij ( )

= lim
k

dp ij ( )

p kj () kj

= p ik ( )q kj
k

= lim

p ()p ()
ik

kj

ik

p kj ( )

p ik ( ) ik
p kj () = q ik p kj ( ) #

Remark: Kolmogorov equations are ordinary differential equations of the first order, which
can be solved out as long as the transition rates q ij and initial transition probabilities p ij (0 )
are given. Note that p ij (0 ) = ij if the process is random continuous.

Example (Two-State Markov Chain) Consider a two-state Markov chain { t t T} that


spends an exponential time with rate in state 0 before going to state 1, where it spends
another exponential time with rate before returning to state 0. Then, what are the
transition probabilities p ij ( ) = ? , where i, j = 0,1 .

Solution:

Transition rates
Suppose the chain has stayed at the state 0 for some time t , then

p 01 (t ) = P{ t + t = 1 t = 0} = P{ < t + t t } = t + o(t )
p 01 (t )
t + o(t )
= lim
=,
t +
t +
t
t

q 01 = lim

q 00 = q 01 =
Suppose the chain has stayed at the state 1 for some time t , then

p 10 (t ) = P{ t + t = 0 t = 1} = P{ < t + t t } = t + o(t )
p 10 (t )
t + o(t )
= lim
= ,
t +

+
t
t

q 10 = lim

q11 = q10 =

Kolmogorov forward equations

p i 0 ( ) = p ik ( )q k 0 = p i 0 ( )q 00 + p i1 ()q 10 = p i 0 ( ) + p i1 ( ) = ( + )p i 0 ( ) + [p i 0 () + p i1 ()]

k

p
(

)
=
k p ik ()q k1 = p i0 ()q 01 + p i1 ()q 11 = p i0 () p i1 () = ( + )p i1 () + [p i0 () + p i1 ()]
i1

A.BENHARI

-143-

From the first equation and p i 0 () + p i1 () = 1 , we have

pi 0 () + ( + )p i 0 () = [p i 0 ( ) + ( + )p i 0 ( )]e ( + ) = e ( + )

d ( + )

e
p i 0 () = e ( + ) p i 0 ( ) =
+ Ce ( + )
d
+

+ e ( + )

p00 ( ) =
1=p00 ( 0 ) = + +C C= +
+


( + )
0=p ( 0 ) = +C C=
p = e
10
(
)

10
+
+
+

From the second equation and p i 0 ( ) + p i1 ( ) = 1 , we have

p i1 ( ) + ( + )p i1 ( ) = p i1 () =

+ Ce ( + )
+

e ( + )
(
)
p

=
01 p 01 (0 )= 0
+

#

+
e ( + )
p11 ( ) =
p11 ( 0 )=1

1.3. Fokker-Planck Equations


Theorem (Fokker-Planck Equation) Let { t t 0} be a finite-state Markov chain and

p i (t ) = P{ t = i} , then
dp j (t )
dt

= p k (t )q kj
k

Proof:

dp j (t )
dt

dp ij (t )
d
d

=
P{ t = j; 0 = i} = p i (0)p ij (t ) = p i (0)
dt i
dt
dt i
i
=

Kolmogorov forward equation

p (0) p (t )q
i

ik

kj

= p i (0)p ik (t )q kj = p k (t )q kj #
k
k i

Remark: Again, Fokker-Planck equations are also ordinary differential equations of the first
order and can be solved out as long as the transition rates q ij as well as the initial
probabilities j = p j ( 0 ) are given.

A.BENHARI

-144-

1.4. Ergodicity
Definition A Markov chain { t t T} is said to be ergodic if all possible states i and j,
lim p ij ( ) = j 1 and

=1.

Remark 1: For a finite-state Markov chain, the requirement

= 1 is not needed. In fact,

0 j 1 and from

p () = 1 we have
ij

= lim p () = lim p () = 1
j

ij

ij

This means that j is a discrete distribution, which we often called limiting probabilities of
the chain.

Remark 2: For an infinite-state Markov chain { t t T},

= 1 is a necessary condition

for the chain to be ergodic.

Theorem For a finite-state Markov chain, if it is regular, i.e., there is a time period such
that for all possible states i and j, p ij ( ) > 0 , then it is ergodic.

Remark: If a finite-state Markov chain is irreducible, i.e., any two states of the chain can
communicate with each other, then it is regularity and therefore ergodic.

Theorem If a finite-state Markov chain is ergodic, then

lim p j (t ) = lim p i p ij (t ) = p i lim p ij (t ) = p i j = j < +

t +

t +

t +

Remark: j = lim p ij ( ) = lim p j (t )


+

t +

Theorem If a finite-state Markov chain is ergodic, its Kolmogorov forward equations will
reduce to linear equations when time is large enough.

Hint: In fact, since

A.BENHARI

-145-

lim p ij ( ) = lim lim

p ij ( + ) p ij ( )

+ 0

= lim lim

p ij ( + ) p ij ( )

0 +

= lim

j j

=0

we have

p ij () = p ik ( )q kj
+
k

q kj = 0

Theorem If a finite-state Markov chain is ergodic, its Fokker-Planck equations will reduce to
linear equations when time t is large enough.

Hint:

p j (t ) = p k (t )q kj
t +

q kj = 0

Remark: When the chain is ergodic, its Kolmogorov forward equations and Fokker-Plank
equations approximate to the same system of linear equations.

1.5. Birth and Death Processes


Definition A conservative Markov chain { t t T} is said to be a birth and death process if
its transition rates q ij = 0 for all i j > 1 .

Remark: The transition rates i = q ii +1 are often called birthrates and i = q ii 1 deathrates.
It follows from

ij

= 0 that q ii = ( i + i ) .

Example For a birth and death process,


(1) its Kolmogorovs forward and backward differential equations become
pij ( ) = pik ( ) qkj = pij 1 ( ) q j 1 j + pij ( ) q jj + pij +1 ( ) q j +1 j = j 1 pij 1 ( ) ( j + j ) pij ( ) + j +1 pij +1 ( )

=
ij ( ) qik pkj ( ) = qii 1 pi 1 j ( ) + qii pij ( ) + qii +1 pi +1 j ( ) = i pi 1 j ( ) ( i + i ) pij ( ) + i pi +1 j ( )
k

If the process is ergodic, from the forward equation, we have


lim pij ( ) = j 1 lim pij 1 ( ) ( j + j ) lim pij ( ) + j +1 lim pij +1 ( )

j1 j1 ( j + j ) j + j+1 j+1 = 0

(2) its Fokker-Planck equations become

A.BENHARI

-146-

p j (t ) = p k (t )q kj = q j1 j p j1 (t ) + q jj p j (t ) + q j+1 j p j+1 (t ) = j1 p j1 (t ) ( j + j )p j (t ) + j+1 p j+1 (t )


k

If the process is ergodic, we also have


lim pj ( t ) = j 1 lim p j 1 ( t ) ( j + j ) lim p j ( t ) + j +1 lim p j +1 ( t )

t +

t +

t +

t +

j1 j1 ( j + j ) j + j+1 j+1 = 0

Example If a birth and death process is ergodic, it follows from Fokker-Planck equations that

0 0 + 1 1 = 0

j 1 j 1 ( j + j ) j + j +1 j +1 = 0

m 1 m 1 m m = 0

j=1
L m - 1

0 0 + 1 1 = 0

j j + j +1 j +1 = j 1 j 1 + j j

j = 1,L ,m 1

j j + j+1 j+1 = 0 , j = 0,1,L ,m 1


j+1 =

j
j+1

j =

j j1
j+1

j
j1 = L = i 0 , j = 0,1,L ,m 1
j
i = 0 i +1
j

0 =
j =1
j 0


1 + i
j 0 i = 0 i +1
j

j +1 =

i=0

i +1


1 + i
j 0 i = 0 i +1
j

, j = 0,1,L ,m 1

1.6. Poisson Processes


1.6.1. Definition
Definition A random process { t t 0} is said to be a counting process if it satisfies the
following conditions:
(1) for all t, t 0 and is integer-valued
(2) for all 0 s < t , s t

A.BENHARI

-147-

Remark: The counting process is a continuous-time and discrete-state process, which is often
used to represent the total number of events that have occurred up to time t, i.e., within the
interval [0, t ] .

Definition A counting process { t t 0} is said to be a Poisson process having rate > 0 if it


satisfies the following conditions:
(1) 0 = 0
(2) the process has independent increments
(3) for all t 0 and 0 , P{ t + t

n
(
)
= n} =

n!

e , n = 0,1,2, L

Remark: It immediately follows from the condition (3) that the increments of a Poisson
process is stationary.

Theorem If { t t 0} is a Poisson process, then

P {t + t = 1} = e

k
k +1
+
+

k ( )
k ( )
+ o ( )
= 1 + ( 1)
=
= + ( 1)
k !
k ! o( )= + ( 1)k ( )k +1
k =1
k =1
k!
k =1

P {t + t 2} = 1 P {t + t = 0} P {t + t = 1} = 1 e + o ( )
+

= ( 1)

( )

k =2

k!

+ o ( ) = o ( )

Theorem A counting process { t t 0} is a Poisson process having rate > 0 if and only if it
satisfies the following conditions:
(1) 0 = 0
(2) the process has independent and stationary increments
(3) for all t, P{ t = 1} = t + o(t ) , P{ t 2} = o(t )

Proof:

A.BENHARI

-148-

If { t t 0} is a Poisson process, the conditions (1)-(3) are clearly satisfied. We now prove
that the conditions (1)-(3) are sufficient for { t t 0} to be a Poisson process. For
convenience, we denote by Pn (t ) = P{ t = n} the probability of occurrence of n events within
the interval [0, t ] .
From

P0 (t + h ) = P{ t + h = 0} = P{ t = 0; t + h t = 0}

independent increments

stationary increments

P0 (t )P{ h 0 = 0}

condition ( 3)

P{ t = 0}P{ t + h t = 0}

P0 (t )[1 h + o(h )]

one can have

P0 (t + h ) P0 (t )
P (t + h ) P0 (t )
o(h )
= P0 (t ) +
P0 (t ) = lim 0
= P0 (t )
h 0
h
h
h
P0 (t ) = Ce t

P0 ( 0 )= P{ 0 = 0}= C =1

P0 (t ) = e t

For n 1 ,

Pn (t + h ) = P{ t + h = n} =
n

= P{ t = n; t + h t = 0}+ P{ t = n 1; t + h t = 1}+ P{ t = n k; t + h t = k}
k=2

independent increments

P{ t = n}P{ t + h t = 0}+ P{ t = n 1}P{ t + h t = 1}+ P{ t = n k}P{ t + h t = k}


k =2

stationary increments

Pn (t )P{ h = 0}+ Pn 1 (t )P{ h = 1}+ Pn k (t )P{ h = k}


k =2

= Pn (t )P0 (h ) + Pn 1 (t )P1 (h ) + Pn k (t )Pk (h )


k =2

(1 h )Pn (t ) + hPn1 (t ) + o(h )

condition (3 )

one can have


Pn (t + h ) Pn (t )
o(h )
= Pn (t ) + Pn 1 (t ) +
Pn (t ) = Pn (t ) + Pn 1 (t )
h
h
e t [Pn (t ) + Pn (t )] = e t Pn 1 (t )

d t
e Pn (t ) = e t Pn 1 (t )
dt

when n = 1 ,

d t
e P1 (t ) = e t P0 (t ) = P1 (t ) = (t + C)e t P1 (t )
=
te t
P1 ( 0 )= P{ 0 =1}= C = 0
dt
when n = 2 ,

A.BENHARI

-149-

2
(t )2
t
d t
(
t ) t
t
2
e P2 (t ) = e P1 (t ) = t P2 (t ) =
P2 (t )
e
=
+ C e
P2 ( 0 )= P { 0 = 2}= C = 0 2!
dt
2!

In this way, one can obtain that


Pn (t ) = P{ t = n} =

(t )n
n!

e t #

Remark: If the increments are not stationary, the resulting process is called nonhomogenous
Poisson process.

1.6.2. Properties
Example (Statistical Average) Let { t t 0} be a Poisson process, then
(1) the mean value and variance are

E[ t ] = E[ t 0 ] = t , D[ t ] = D[ t 0 ] = t
This implies that { t t 0} is not a weakly stationary process.
(2) the correlation function is
2
E [t + t ] = E ( t + t + t ) t = E [ t + t ] E [t ] + E t

2
= 2 t + E t E [t ] + E [t ]

2
2
= 2 t + E t t + 2tE t t + E t

E [t ]= t

= 2 t + t + 2 t 2 = t ( + t + 1)

Theorem (Markov Property) A Poisson process is a homogenous Markov chain.


Hint: A Poisson process is one having independent and stationary increments with 0 = 0 .

Example (Transition Probabilities and Transition Rates) Let { t t 0} be a Poisson


process, then

random continuous

p ij () = P{ t + = j t = i} =

A.BENHARI

P{ t + = j; t = i} P{ t + t = j i; t = i}
=
P{ t = i}
P{ t = i}

-150-

P{ t + t = j i}P{ t = i}
= P{ t +
independent increments
P{ t = i}
=

() ji
e

t = j i} = ( j i )!

ji
others

ij
0 +

birth and death

e 1
lim
0+

e
pij ( ) ij lim
0 +
qij = lim+
=
j i
0

lim
j i 1e
0+ ( j i ) !

j =i

=
j i 2 0

j = i +1

j =i
j = i +1
j i 2& j < i

j <i

1.6.3. Examples
Example (Exponential Interarrivals) Let { t t 0} be a Poisson process representing the
total number of events that have occurred within the interval [0, t ] , Wn a continuous random
variable representing the time of occurrence of the n th event, n 1 , and Tn = Wn Wn 1 the
interval time between the occurrence of the n th event and that of the ( n 1) event, n 2 ,
th

then
+

(t )k

k =n

k!

FWn (t ) = P{Wn < t} = P{Wn t} = P{ t n} =


f Wn (t ) =

dFWn (t )
dt

e t

k
n 1
d + (t ) t
t ( t )

=
e = e
(n 1)!
dt k = n k!

P{Tn = Wn Wn 1 > } = P{ t + t = 0} = e
f Tn () =

d
d
P{Tn } = [1 P{Tn > }] = e
d
d

Example (The M/M/n Queue) Let { t t 0} be a Poisson process having rate representing
the number of customers arriving at an n-server service station. Each customer, upon arrival,
goes directly into service if any of the servers are free, and if not, joins the queue. When a

A.BENHARI

-151-

server finishes serving a customer, he leaves the station, and the next customer in the queue, if
there are anyone waiting in the queue, enters the service. The service time for a customer is
assumed to be an exponential-distributed random variable having mean

1
and independent of

the service time for other customers. Now let { t t 0} be a random process representing the
number of customers in the station at time t, is it a birth and death process?
Solution:

p ij ( ) = P{ t +

+ o( ) j = i + 1

i + o( ) j = i 1, 1 i n
= j t = i} =
n + o( ) j = i 1, i > n

ji >1
o()

p ij ( ) i
q ij = lim+
=
0

j = i +1
j = i 1, 1 i n
j = i 1, i > n

ji >1

Remark: M/M/n represent that interarrival time and service time are both exponentially
distributed and there are n servers in the system.

A.BENHARI

-152-

Appendix Queuing Theory


A queue is represented as A/B/c/K/m/Z, where
A and B represent the interarrival times and service times respectively and may be
G --- the interarrival or service times are identically distributed in accordance with
the distribution G
GI --- the interarrival or service times are independent and identically distributed in
accordance with the distribution G
M --- the interarrival or service times are exponentially distributed
c represents the number of identical servers
K represents the system capacity. K = + is assumed to be the default value.
m represents the number in the source, i.e., the number of customers allowed to
come. m = + is assumed to be the default value
Z represents the queue discipline and may be
FCFO/FIFO --- first come/in, first served
LIFO --- last in, first out
RSS --- random (default value)
PRI --- priority service
The first three parameters are indispensable, while last three parameters are optional. When
the last three parameters are not presented, they are assumed to take on default values

The queue theory often addresses the following questions:

The average number of customers in the system

The average number of customers waiting in the queue

The average time it takes for a customer to spend in the system

The average time it takes for a customer to wait in the queue

Example (The M/M/1 Queue) Let { t t 0} be a random process such that { t = k}


represents the event that there are k customers in the system, k = 0,1,2,L .

A.BENHARI

-153-

Suppose the average arrival rate of customers to the system and the average service rate are
and (> ) respectively, then the transition rates are given by

q ij = lim+

P{ t +

+ o( )

=
lim
0 +

= j t = i}
+ o()
= lim+
=

lim o( ) = 0
0 +

j = i +1
j = i 1
ji >1

Thus, the process is a birth and death process. It follows from the Fokker-Planck equation that

p0 (t ) = p 0 (t ) + p1 (t )

p j (t ) = p j1 (t ) ( + )p j (t ) + p j+1 (t ) j 1
0 = p 0 + p1

t + 0 = p ( + )p + p
j1
j
j+1

p p1 =
0

p j1 p j = p j p j+1

j1

j1

p = p 0 + p j = 1

pj =1

j= 0

The average number of customers in the system is then given

k
k
+

=
L = kp = k1 = 1 k =

k =0

1
k =0
k =0

The average number of customers in the queue is then given

2
+
1
L Q = (k 1)p k = L 1 p 0 =
1
k =1

A.BENHARI

-154-

2
( )

2. Continuous-Time and Continuous-State Markov


Processes
2.1. Basic Ideas
Theorem A continuous-time and continuous-state random process { t t T} is a Markov
process if and only if for all t 1 < t 2 < L < t n < t T , its conditional density functions satisfy

f t

t n t n 1 L t1

(y

x n , x n 1 ,L, x1 ) = f t

Remark 1: The conditional density function f t +

tn

(y x ) is

(y

xn )

often called transition density

function.
Remark 2: A continuous-time and continuous-state Markov process
homogenous if and only if its transition density function f t +

(y x ) is

t T} is

independent of the

initial time t.

Remark 3:
F t +

(y x ) = P{ t + < y

t = x} =

t + t

(p x )dp

Theorem (Chapman-Kolmogorov Theorem) For a continuous-time and continuous-state


Markov process, its transition density functions satisfy
+

f t + +

(y x ) = f
t

t +

(z x )f

t++

t +

(y z )dz

Proof:
f t + +

(y x ) =

A.BENHARI

f t + +

f t t + + (x , y )
f t (x )

t t +

(y

f t t + t + + (x , z , y )
f t (x )

x, z )f t t + (x, z )

f t (x )

dz = f t + +

-155-

dz

t+

(y z )f

t+

(z x )dz

2.2. Wiener Processes


Definition A continuous-time and continuous-state random process { t t 0} is said to be a
Wiener process or Brownian motion process if it satisfies the following conditions
(1) 0 = 0
(2) the process has independent increments
(3) for all t 0 and > 0 , the increment t + t possesses the normal distribution

N 0, 2 , where > 0
Remark 1: If = 1 , the process is called standard Wiener process.
Remark 2: The condition (3) implies that Wiener process is a process with stationary
increment.

Theorem Wiener Processes are homogenous Markov processes.


Hint: The increments of a Wiener process are both independent and stationary.

Theorem Wiener processes { t t 0} are normal processes.


Proof:
For all 0 t 1 < t 2 < L < t n and all numbers 1 , 2 , L , n ,

n 1

n 1

i t i = n t n t n 1 + n t n 1 + i t i = n t n t n 1 + i t i = L =
i =1

0 =0

(
n

i=2

tn

i =1

t n 1 + 1 t 1 0

i =1

)
n

Since the increments are independent normal variables, so is the random variable


i =1

which implies that the joint distribution of random variables t1 , t 2 , L , t n is normal. #

Example (Statistical Averages)

E[ t ] = E[ t 0 ] = 0 ,

A.BENHARI

-156-

ti

[ ]= t
) ] + E[ ] = E[ ] = t

D[ t ] = D[ t 0 ] = E t

E[ t + t ] = E[( t +

E t +

D[ t + ] D[ t ]

t
t+

Remark: Wiener processes { t t 0} are not weakly stationary.

Example Let
f t +

t 0} be a Wiener process, whats its transition density function

(y x ) = ?

Solution:
F t + t (y, x ) = P{ t + < y; t < x} = P{ t + t + t < y; t < x} =

U = t + t ,V = t

P{U + V < y; V < x} =

f (u, v)dudv

UV
u + v < y, v< x

y x

s = u + v, t = v

f (s t, t )dtds
UV

Recall that U = t + t ~ N 0, 2 , V = t ~ N 0, 2 t , and U and V are independent, we


have
f t + t ( y, x ) =

f t +

2 F t + t (y, x )
yx

(y x ) =

= f UV (y x , x ) = f U (y x )f V (x ) =

f t + t ( y, x )
f t (x )

1
=

( y x )2

2t

1
2t

f t +

2 2

1
2

( y x )2

2 2

2t

x2
2 2 t

x2
2 2 t

1
2

( y x )2

22

(y x ) = N(x, 2 ) #

Remark: The problem can be solved in another way. Recall that

(X, Y ) ~ N(1 , 2 , 12 , 22 , )


f Y X (y x ) = N 2 (x 1 ) + 2 , 22 1 2
1

since

t + = t + 0 ~ N 0, 2 (t + ) , t = t 0 ~ N 0, 2 t
=

E[ t + t ]

D[ t + ]D[ t ]

2t

2 (t + ) 2 t

then, the joint distribution of ( t + , t ) are


A.BENHARI

-157-

t
t+

x2
2 2 t

N 0,0, 2 (t + ), 2 t ,
t +

Which leads to the conditional distribution

f t +

2
(x 1 ) + 2 , 22 1 2
1

(y x ) = N

Problems3-1(18)

A.BENHARI

-158-

) = N(x, )
2

Hidden Markov Models

A.BENHARI

-159-

1. Definition of Hidden Markov Models


Hidden Markov Model (HMM) consists of two random processes, one is a homogenous
Markov process {Qt t = 1, 2,L} and the other the observation process {Ot t = 1, 2,L} .

There are three sets of parameters = {, , B} featuring the HMM


(1) The initial probability:

= { i i = P{Q1 = i}, i = 1,L, N}


(2) The transition probability:
a11 L a1N
= M O M , where a ij = P{Q t +1 = j Q t = i} , 1 i, j N
aN 1 L aNN
(3) The conditioned/state-based observation probability:
If Ot is a discrete random variable, then

B = bi ( j ) bi ( j ) = P {Ot = j Qt = i} , i = 1,L , N , j = 1,L, M


If Ot is a continuous random variable, then

B = bi ( o ) bi ( o ) = p ( o Qt = i ) , i = 1,L , N

A.BENHARI

-160-

2. Assumptions in the theory of HMMs


For the sake of mathematical and computational tractability, following assumptions are made
in the theory of HMMs.

Assumption 1: The t th state, given the (t 1) state, is independent of the previous states:
th

P {Qt = qt Ot 1 = ot 1 ,L , O1 = o1 ; Qt 1 = qt 1 ,L , Q1 = q1} = P {Qt = qt Qt 1 = qt 1}

Assumption 2: The t th output, given the t th state, is independent of other outputs and states:

P {Ot = ot OT = oT ,L , O1 = o1 ; QT = qT ,L , Q1 = q1} = P {Ot = ot Qt = qt }

Example
p ( oT ,L , o1 ; qT ,L , q1 )

p ( oT ,L , o1 qT ,L , q1 ) =

p ( qT ,L , q1 )

p ( oT oT 1 ,L , o1 ; qT ,L , q1 ) p ( oT 1 ,L , o1 ; qT ,L , q1 )
p ( qT ,L , q1 )

p ( oT qT ) p ( oT 1 oT 2 ,L , o1 ; qT ,L , q1 ) p ( oT 2 ,L , o1 , qT ,L , q1 )
p ( qT ,L , q1 )

p ( oT qT ) p ( oT 1 qT 1 ) p ( oT 2 ,L , o1 ; qT ,L , q1 )
p ( qT ,L , q1 )
T

t =1

t =1

= L = p(o t q t ) = b q t (o t )

p(q T ,L, q1 ) = p(q T q T 1 ,L, q1 )p(q T 1 ,L, q1 ) = p(q T q T 1 )p(q T 1 ,L, q1 )


A.BENHARI

-161-

t=2

t =2

= L = p(q1 ) p(q t q t 1 ) = q1 a q t 1q t

A.BENHARI

-162-

3. Three basic problems of HMMs


Once we have an HMM, there are three problems of interest.

3.1. The Evaluation Problem


Given an HMM and an observation sequence oT ,L , o1 , what is the probability that the
observations are generated p ( oT ,L , o1 ) = ? We can calculate the probability by using simple
probabilistic arguments.
p ( oT ,L , o1 ) =

p ( oT ,L , o1 qT ,L , q1 ) p ( qT ,L , q1 ) =

q T ,L, q1

T
T

b
o

qt ( t ) q1 aqt 1qt
q T ,L, q1 t =1
t =2

But this calculation involves the number of operations in the order of N T . This is very large
even if the length of the sequence, T, is moderate. Therefore we have to look for other
methods for this calculation.

3.2. The Decoding Problem


Given an HMM and an observation sequence o T ,L, o1 , what is the most likely state sequence

q*T ,L, q1* that produced the observations?, i.e.,

( q ,L, q ) = arg max p ( q ,L, q


*
T

*
1

qT ,L, q1

oT ,L , o1 )

Note that

p ( qT ,L , q1 oT ,L , o1 ) =

p ( qT ,L , q1 ; oT ,L , o1 )
p ( oT ,L , o1 )

we have
arg max p ( qT ,L , q1 oT ,L , o1 ) = arg max p ( qT ,L , q1 ; oT ,L , o1 )
qT ,L, q1

qT ,L, q1

The solution to arg max p ( qT ,L , q1 ; oT ,L , o1 ) can be solved by Viterbi algorithm.


qT ,L, q1

3.3. The Learning Problem


Given an HMM and an observation sequence oT ,L , o1 , how should we adjust the model
parameters = ( , , B ) so as to maximize P {OT = oT ,L , O1 = o1}

A.BENHARI

-163-

= arg max p ( oT ,L , o1 )

A.BENHARI

-164-

4. The Forward/Backward Algorithm and its Application


to the Evaluation Problem
Given an HMM = { , , B} and an observation sequence oT ,L, o1 , what is the probability

p ( oT ,L , o1 ) = ?

We first define the so-called forward variable as follows:

t ( qt ) = p ( ot ,L , o1 , qt )
It is easy to see that following recursive relationship holds.

1 ( q1 ) = p ( o1 , q1 ) = p ( o1 q1 ) p ( q1 ) = q bq ( o1 )
1

t +1 ( qt +1 ) = p ( ot +1 ,L , o1 , qt +1 ) = p(ot +1 ot ,L, o1 , qt +1 ) p(ot ,L, o1 , qt +1 )


= p (ot +1 qt +1 ) p (ot ,L, o1 , qt +1 , qt )
qt

= bqt +1 (ot +1 ) p (qt +1 ot ,L, o1 , qt ) p (ot ,L, o1 , qt )


qt

= bqt +1 ( ot +1 ) p ( qt +1 qt ) t ( qt ) = bqt+1 ( ot +1 ) aqt qt +1 t ( qt )


qt

qt

p ( oT ,L , o1 ) = p ( oT ,L , o1 , qT ) = T ( qT )
qT

qT

The complexity of this method, known as the forward algorithm, is proportional to N 2T ,


which is linear with T whereas the direct calculation mentioned earlier, had an exponential
complexity.

In a similar way we can define the backward variable t (qt ) as follows:

t (qt ) = p(oT ,L , ot +1 qt )
As in the case of t ( qt ) there is a recursive relationship which can be used to calculate

A.BENHARI

-165-

t (qt ) efficiently.
T (qT ) = 1

t (qt ) = p(oT ,L, ot +1 qt ) = p (oT ,L, ot +1 , qt +1 qt )


qt + 1

= p (oT ,L, ot + 2 ot +1 , qt +1 , qt ) p(ot +1 , qt +1 qt )


q t +1

= p (oT , L , ot + 2 qt +1 ) p(ot +1 , qt +1 qt ) = t +1 (qt +1 ) p(ot +1 qt +1 , qt ) p(qt +1 qt )


qt +1

qt +1

= t +1 (qt +1 ) p(ot +1 qt +1 )a qt qt +1 = t +1 (qt +1 )bqt +1 (ot +1 )a qt qt +1


qt +1

qt +1

p ( oT ,L , o1 ) = p ( oT ,L , o1 , q1 ) = p ( oT ,L , o2 o1 , q1 ) p ( o1 , q1 )
q1

q1

= p ( oT ,L , o2 q1 ) p ( o1 q1 ) p ( q1 ) = 1 ( q1 ) bq1 ( o1 ) q1
q1

q1

Further we can see that,

p(oT ,L, ot +1 , ot ,L, o1 , qt ) = P(oT ,L, ot +1 ot ,L, o1 , qt ) p(ot ,L, o1 , qt )


= p ( oT ,L , ot +1 qt ) p ( ot ,L , o1 , qt ) = t ( qt ) t ( qt )
Therefore this gives another way to calculate p(oT ,L, o1 ) , by using both forward and
backward variables as given in the following equation,
p ( oT ,L , o1 ) = p ( oT ,L , o1 , qt ) = t ( qt ) t ( qt )
qt

qt

The above equation is very useful, specially in deriving the formulas required for gradient
based training.

A.BENHARI

-166-

5. Viterbi Algorithm and its Application to the Decoding


Problem
In this case we want to find a state sequence q*T ,L, q1* for a given sequence of observations
o T ,L, o1 such that

(q

*
T

,L , q1* ) = arg max p ( q T ,L , q1 oT ,L , o1 )


q T ,L,q1

or equally

(q

*
T

,L , q1* ) = arg max p ( oT ,L , o1q T ,L , q1 )


q T ,L,q1

An natural way to solve this problem is to calculate all possible state sequences to find the
most likely state sequence. But some times this method does not give a physically meaningful
state sequence. Therefore we would go for another method which has no such problems.
In this method, commonly known as Viterbi algorithm, the whole state sequence with the
maximum likelihood is found. In order to facilitate the computation we define an auxiliary
variable,

t ( q t ) = max p ( o t ,L , o1 ;q t , q t 1 ,L , q1 )
q t 1 ,L,q1

then we have

t +1 ( q t +1 ) = max p ( o t +1 ,L , o1;q t +1 , q t ,L, q1 )


q t ,L,q1

= max p ( o t +1 o t ,L , o1 ;q t +1 , q t ,L , q1 ) p ( o t ,L, o1 ;q t +1 , q t ,L , q1 )
q t ,L,q1

= max p ( o t +1 q t +1 ) p ( o t ,L, o1 ;q t +1 , q t ,L , q1 )
q t ,L,q1

= b qt+1 ( o t +1 ) max p ( q t +1 o t ,L, o1 ;q t ,L , q1 ) p ( o t ,L , o1;q t ,L, q1 )


q t ,L,q1

= bq t+1 ( o t +1 ) max p ( q t +1 q t ) p ( o t ,L , o1 ;q t ,L ,q1 )


q t ,L,q1

= b qt+1 ( o t +1 ) max a qt qt+1 p ( o t ,L , o1;q t ,L, q1 )


q t ,L,q1

= b q t+1 ( o t +1 ) max a q t q t+1 max p ( o t ,L , o1 ; q t ,L , q1 )

qt
q t 1 ,L,q1

= bqt +1 ( ot +1 ) max aqt qt+1 t ( qt )


qt

which gives the highest probability that partial observation sequence and state sequence up to

A.BENHARI

-167-

the t moment can have, when the current state is q t +1 . Note that
2 ( q 2 ) = max p ( o 2 , o1 ; q 2 , q1 ) = max p ( o 2 q 2 ) p ( o1 ; q 2 , q1 )
q1

q1

= b q 2 ( o 2 ) max p ( q 2 o1 ; q1 ) p ( o1 ; q1 ) = b q2 ( o 2 ) max a q1q 2 b q1 ( o1 ) q1


q1

q1

So the procedure to find the most likely state sequence starts from the following calculation

max p ( oT ,L, o1 ;q T ,L, q1 )

qT ,L,q1

= max max p ( o T ,L , o1 ;q T ,L , q1 ) = max T ( q T )

qT
qT
qT1 ,L,q1
= max bqT ( oT ) max aqT 1qT T 1 ( qT 1 ) = L

qT
qT 1

This whole algorithm can be interpreted as a search in a graph whose nodes are formed by the
states of the HMM in each of the time instant t .

A.BENHARI

-168-

6. Baum-Welch Algorithm and its Application to the


Learning Problem
Generally, the learning problem is how to adjust the HMM parameters so that the given set of
observations (called the training set) is represented by the model in the best way for the
intended application. Thus it would be clear that the quantity we wish to optimize during
the learning process can be different from application to application. In other words there may
be several optimization criteria for learning, out of which a suitable one is selected depending
on the application.
There are two main optimization criteria for the learning problem: Maximum Likelihood
(ML) and Maximum Mutual Information (MMI). The solutions to the learning problem under
each of those criteria is described below.

6.1. Maximum Likelihood (ML) Criterion


In ML we try to maximize the probability of a given sequence of observations oT ,L, o1 , given
a HMM = ( , A,B ) . This probability is the total likelihood of the observations and can be
expressed mathematically as

L ( ) = p ( oT ,L , o1 )
Then the ML criterion can be given as,

= arg max L ( )

However there is no known way to analytically solve for the model = ( , A,B ) , which
maximize the quantity L( ) . But we can choose model parameters such that it is locally
maximized, using an iterative procedure, like Baum-Welch method or a gradient based
method, which are described below.

6.2. Baum-Welch Algorithm


To describe the Baum-Welch algorithm, (also known as Forward-Backward algorithm), we
need to define two more auxiliary variables, in addition to the forward and backward variables
defined in a previous section. These variables can however be expressed in terms of the
forward and backward variables.
A.BENHARI

-169-

First one of those variables is defined as the probability of being in state qt at t and in state

qt +1 at t + 1 . Formally,

t ( q t+1 , qt ) = P ( Qt +1 = qt +1 , Qt = qt oT ,L , o1 )

t (q t+1 , qt ) can be derived from the forward and backward variables:


p ( oT ,L , o1 , q t+1 , qt )

t ( q t+1 , qt ) = p ( q t+1 , qt oT ,L , o1 ) =
=
=
=

p ( oT ,L , o1 )

p ( oT ,L , ot +1 , qt +1 ot ,L , o1 , qt ) p ( ot ,L , o1 , qt )
p ( oT ,L , o1 )

p ( oT ,L , ot +1 , qt +1 qt ) t ( qt )
p ( oT ,L , o1 )

p ( oT ,L , ot + 2 ot +1 , qt +1 , qt ) p ( ot +1 , qt +1 qt ) t ( qt )
p ( oT ,L , o1 )

p ( oT ,L , ot + 2 qt +1 ) p ( ot +1 qt +1 , qt ) p ( qt +1 qt ) t ( qt )
p ( oT ,L , o1 )

t +1 ( qt +1 ) bq

( ot +1 ) aq q t ( qt )
p ( oT ,L , o1 )
t +1

t t +1

The second variable is the a posteriori probability,

t ( qt ) = P ( Qt = qt oT ,L , o1 )
that is the probability of being in state qt at t , given the observation sequence and the model.

t (qt ) can be also derived from the forward and backward variables:
t ( qt ) =
=

p ( oT ,L , o1 , qt )
p ( oT ,L , o1 )

p ( oT ,L , ot +1 ot ,L , o1 , qt ) p ( ot ,L , o1 , qt )
p ( oT ,L , o1 )

p ( oT ,L , ot +1 qt ) p ( ot ,L , o1 , qt )
p ( oT ,L , o1 )

t ( qt ) t ( qt )
p ( oT ,L , o1 )

One can see that the relationship between t (qt ) and t (q t+1 , q t ) is given by,
p (oT ,L, o1 , qt )
t (qt ) =
=
p(oT ,L, o1 )

p(o

qt +1

,L, o1 , qt +1 , qt )

p (oT ,L, o1 )

= t (q t+1 , qt )
qt +1

Now it is possible to describe the Baum-Welch learning process, where parameters of the
HMM is updated in such a way to maximize the quantity, p ( o1 , o2 ,L , oT ) . Assuming a
starting model = ( ,, B ) , we first calculate the forward and backward variables and

A.BENHARI

-170-

using the recursions, and then and . Next step is to update the HMM parameters
according to the following equations, known as re-estimation formulas.
T 1

)
q = 1 ( q ) , aqt q t+1 =
)

(q
t
t =1
T 1

(q)

q =

R T 1

(r )

r =1
R

( ) ( q )

)
, aqq =

r =1

, qt )

(q )
t =1

t+1

1 t T ,ot = o
, bqt ( o ) =
T

t ( qt )

(q )
t =1

( ) ( q, q)
r =1 t =1
R T 1

( ) ( q )
r =1 t =1

t( r ) ( q ) = P Qt = q oT( r ) ,L , o1( r ) , t( r ) ( q, q ) = P Qt +1 = q, Qt = q oT( r ) ,L , o1( r )


R

q( k ) =

( ) (q)
r

r =1
R T 1

( ) ( q )
r =1 t =1

A.BENHARI

-171-

Second-Order Processes and Random Analysis

A.BENHARI

-172-

1. Second-Order Random Variables and Hilbert Spaces


Theorem Let H be the collection of all second-order random variables defined on a
probability space (, , P ) , then
(1) H is a linear space
(2) for all , H , let , = E[ ], then (H, ,

Hint:

E C1 + C 2

) is a Hilbert space.

] = E[(C + C )(C + C )] C E[ ]+ C E[ ]+ 2 C C
C E [ ] + C E[ ]+ 2 C C E[ ] E [ ] < +
2

Cauchy Swartch Inequality

E[ ]

H is a linear space

[ ] = 0 , = E[] = E[] = ,

P{ = 0} = 1 E

H is an inner space

In measure theory, one can prove that a Cauchy sequence in H is convergent


H is a complete inner space, i.e., Hilbert space

Remark 1: =

[ ] is then a norm.

, = E

Remark 2: Since
lim n = 0 = def lim n 0 = lim

n +

n +

n +

n 0 , n 0 = lim

n +

the convergence in H is often called mean square convergence.

A.BENHARI

-173-

E n 0

]= 0

2. Second-Order Random Processes


Definition A random process { t t T} is called a second-order random process if for all

[ ]< + .

t T , t is a second-order random variable, i.e., E

Theorem Let { t t T} be a second-order random process and (t 1 , t 2 ) = t1 , t 2 , then for


(t 1 , t 1 ) (t 1 , t 2 )

(t , t ) (t 2 , t 2 )
all all t 1 , t 2 , L , t n T , the matrix = 2 1
M
M

(t , t ) (t , t )
n 1
n
2

L (t 1 , t n )

L (t 2 , t n )
is nonnegative
O
M

L (t n , t n )

definite.
Proof:
For all numbers 1 , 2 , L , n ,

( 1

(t 1 , t 1 ) (t 1 , t 2 )

(t , t ) (t 2 , t 2 )
L n ) 2 1
M
M

(t , t ) (t , t )
n 1
n
2

L (t 1 , t n ) 1

L (t 2 , t n ) 2
M =
O
M

L (t n , t n ) n

= i j (t i , t j ) = i i , j j = E ( i i )( j j ) =
n

i =1 j=1

i =1 j=1

n
n n

= E ( i i )( j j ) = E i i
j=1
i =1 j=1

i =1 j=1

0 #

2.1. Orthogonal Increment Random Processes


Definition A second-order random process { t t T} is called an orthogonal increment
random process if for all t 1 < t 2 t 3 < t 4 T , t 2 t1 , t 4 t 3 = 0 .

Example Let { t t T} be an orthogonal increment random process with T = [a ,+ ) and

A.BENHARI

-174-

a = 0 , then
(1) For all a t 1 t 2 , we have
t1 , t 2 t1 = t1 a , t 2 t1 = 0
(2) For all t 1 t 2 T , we have
t 1 , t 2 = t 1 , t 2 t 1 + t 1 = t1 , t 2 t 1 + t 1 , t 1 = t 1 , t 1 = t 1

(3) For all t 1 t 2 T , we have


t 2 t1

A.BENHARI

= t 2 t1 , t 2 t1 = t 2 , t 2 t1 , t 2 t 2 , t1 + t1 , t1 = t 2

-175-

t1

3. Random Analysis
3.1. Limits
Definition Let { t t (a , b )} be a second-order random process and a second-order random
variable, lim t = is then defined as lim t , where t 0 (a , b ) .
t t 0

tt 0

Theorem lim t = the limit


tt 0

lim

t t 0 ,s t 0

t , s exists.

3.2. Continuity
Definition A second-order random process { t t T} is said to be continuous at the point

t 0 T if given any > 0 , there will be > 0 such that for all t T with t t 0 < ,
t t 0 < .

Remark 1: If t 0 (a , b ) = T , t is said to be continuous at t 0 if lim t t 0 = 0 .


tt0

Remark 2: lim t t 0 = 0 is often denoted by lim t = t 0 .


tt 0

tt0

[ ]

Theorem If lim t = t 0 , then lim E[ t ] = E t 0 .


tt 0

tt 0

Proof:

[ ]

E[ t ] E t 0 = E t t 0 E t t 0 E t t 0

Theorem If lim t = t 0 , lim s = s0 , then


tt 0

s s 0

lim

t t 0 ,s s 0

t , s t 0 , s0
= t t 0 , s s 0 + t t 0 , s0 + t 0 , s s 0
t t 0 , s s0 + t t 0 , s0 + t 0 , s s0
-176-

= t t 0 t
0
t 0

t , s = t 0 , s0 .

Proof:

A.BENHARI

t t 0 s s0 + t t 0 s 0 + t 0 s s0 t
0
t 0 ,s s 0

3.3. Derivatives
Definition The second-order random variable is said to be the derivative of a second-order
random process { t t T} at the point t 0 T if given any > 0 , there will be > 0 such
that for all t T with t t 0 < ,

t t 0
t t0

< .

Remark: If t 0 (a , b ) = T , is said to be the derivative of t at the point t 0 if

lim
t t 0

t t0
t t0

t t 0

= , i.e., lim

t t0

t t 0

= 0 . The derivative is often denoted by (t 0 ) .

Theorem Let { t a < t < b} be a second-order random process, R (t , s ) the correlation


function of t and t 0 (a , b ) , t has derivative at the point t 0 if R (t , s ) is second-order

2 R (t , s )

differentiable at the point (t 0 , t 0 ) , i.e.,

ts

is not only present, but also continuous at

the point (t 0 , t 0 ) .

Proof:
Recall that

lim
t t 0

t t0
t t0

from the continuity of

lim

t t 0 ,s t 0

2 R (t , s )
ts

t t0 s t0
,
t t0
s t0

0 < <1

A.BENHARI

t 0 = 0 the limit

lim

t t 0 ,s t 0

, it follows that

t t 0 s t 0
lim E

t t 0 ,s t 0
t t 0 s t 0

[R (t, s) R (t , s)] [R (t, t ) R (t

t t 0 ,s t 0

lim

t t0 s t0
,
t t0
s t0

lim

t t 0 ,s t 0

, t 0 )]

(t t 0 )(s t 0 )
R (t 0 + (t t 0 ), s ) R (t 0 + (t t 0 ), t 0 )
t

s t0

-177-

exists

0 < <1

lim

2 R (t 0 + (t t 0 ), t 0 + (s t 0 ))
ts

t t 0 ,s t 0

This shows that the limit

lim

t t 0 ,s t 0

t t0 s t0
,
t t0
s t0

2 R (t 0 , t 0 )
ts

exists.

Remark: Let t = t , then


2

t + h t s + k s R (t , s )
t
s
R (t , s ) = E[ t s ] = E lim t + h
lim s + k
=
lim
E
=
h 0 , k 0

k 0
h
k
h
k
ts
h 0

3.4. Integrals
Definition Let { t a t b} be a second-order random process and

a = t 0 < t 1 < L < t n = b t i 1 i t i t i = t i t i 1 i = 1,2,L, n


a random variable is said to be the integral of t over [a , b] if
n

lim i t i = 0

max t i 0

i =1

The integral is often denoted by = t dt .


a

A.BENHARI

-178-

Stationary Processes

A.BENHARI

-179-

1. Strictly Stationary Processes


Definition A random process { t t T} is called a strictly stationary process if for all
t 1 , t 2 , L , t n T and all such that t 1 + , t 2 + , L , t n + T

} {

P t1 + < x 1 ; t 2 + < x 2 ; L ; t n + < x n = P t1 < x 1 ; t 2 < x 2 ; L ; t n < x n

or expressed in the form of distribution function


F(x 1 , t 1 + ; x 2 , t 2 + ; L ; x n , t n + ) = F(x 1 , t 1 ; x 2 , t 2 ; L ; x n , t n )

Example Let { t t T} be a strictly stationary process with finite second-order moment, then
(1) for all t T , since F(x; t ) = F(x;0 ) , we have

E[ t ] =

xdF(x; t ) =

] (x m) dF(x; t ) = (x m) dF(x;0) =

E ( t m ) =
2

xdF(x;0) = m = Const.

= Const.

(2) for all t 1 , t 2 T , since F(x , t 1 ; y, t 2 ) = F(x ,0; y, t 2 t 1 ) , we have

+ +

] xydF(x, t ; y, t

E t 2 t1 =

A.BENHARI

+ +

2 ) = xydF(x ,0; y, t 2 t 1 ) = R (t 2 t 1 )

-180-

2. Weakly Stationary Processes


2.1. Definition
Definition A second-order process { t t T} is called a weakly stationary process if
(1) for all t T , E[ t ] = m = Const.

(2) for all t 1 , t 2 T , E t 2 t1 = R (t 2 t 1 )


Remark: A strictly stationary process with finite second-order moment must be also weakly
stationary.

Definition Two weakly stationary processes { t t T} and { t t T} are said to be jointly

stationary, if for all t 1 , t 2 T , E t 2 t1 = R (t 2 t 1 ) .

2.2. Properties of Correlation/Covariance Functions

Theorem Let { t t T} be a weakly stationary process and R () = E t + t , then

[ ] 0

(1) R (0 ) = E t

] [

(2) (Conjugate Symmetry) R () = E t + t = E t + t = R ( )

(3) R ( ) = E t + t E[ t + t

Schwartz Ineqality

E t +

] E[ ] = R (0)
2

(4) (Nonnegative Definite) for all numbers 1 , 2 , L , n ,

(1

R (t 1 t 1 ) R (t 1 t 2 )

R (t 2 t 1 ) R (t 2 t 2 )
L n )

M
M

R (t t ) R (t t )
n
1
n
2

L R (t 1 t n ) 1


L R (t 2 t n ) 2
=
M
O
M


L R (t n t n ) n

[(

)(

)]

= i j R (t i t j ) = i j E t i t j = E i t i j t j =
n

i =1 j=1

A.BENHARI

i =1 j=1

-181-

i =1 j=1

)(

n
n n

= E i t i j t j = E i t i
i =1
i =1 j=1

[ ] E [ ]

Remark: Cauchy-Schwarz inequality: E[ ] E

Theorem Let

t T} and

t T} be two jointly stationary processes and

R ( ) = E[ t + t ] , then.

(1) R () = E[ t + t ] = E t + t = R ( )

(2) R () = E[ t + t ] E[ t + t

Schwartz Ineqality

E t +

] E [ ] =
2

R (0 )R (0 )

2.3. Periodicity
Theorem (Periodicity) Let { t < t < +} be a weakly stationary process, t is periodic
with period T if and only if its correlation function R ( ) is periodic with period T.

Hint:

E t +T t

] = E[ ]+ E[ ] 2E[
2
t +T

2
t

t +T

t ] = 2 R (0 ) R (T )

2.4. Random Analysis


For a weakly stationary process, the questions of random analysis such as whether the process
is continuous, differentiable or integrable are all dependent on its correlation function.

Theorem Let { t a < t < b} be a weakly stationary process and R ( ) its correlation function,

t has derivatives within the open interval (a , b ) if R ( ) is present and continuous at the
point = 0 .

Remark: Let t = t , then


t
E[ t + h ] E[ t ]

t
E[ t ] = E lim t + h
= lim E t + h
= lim
=0

h
h
h 0
h 0 h
h 0

A.BENHARI

-182-

R (t , s ) = E[ t s ] =

2 R (t , s )
ts

2 R (t s )
ts

= R (t s ) R () = R ()

This shows that t is also weakly stationary.

2.5. Ergodicity (Statistical Average = Time Average)


Definition Let { t < t < +} be a weakly stationary random process and
= E[ t ] , R () = E[ t + t ] (statistical average)
T

1
1
= lim
t dt , t + t = lim
t + t dt (time average)

T + 2T
T + 2T
T
T

(1) the mean of t is said to be ergodic if P{ t = } = 1


(2) the correlation function of t is said to be ergodic if P{ t + t = R ( )} = 1
(3) t is said to be ergodic if both of its mean and correlation function are ergodic

Remark: Ergodicity means that statistical average is equal to time average.

Theorem The mean of a weakly stationary random process { t < t < +} is ergodic if
and only if

1

1

2
1
R () d = lim 1
C ()d = 0

T + T
T + T
2T
2T
0
0
2T

lim

2T

where C () = R () 2 .

Proof:
Note that
T
T

1
1
E[ t ] = E lim
t dt = lim
E[ t ]dt =
T + 2T
T + 2T
T
T

we have

P{ t = } = 1

0 = E t

A.BENHARI

] = D[

] = E[

2
t

1
= E lim
T + 4T 2

-183-

dt
T t T s ds
T

1
= lim 2
T + 4T

1
2

TTE[ t s ]dtds = Tlim


+ 4T 2

T T

1
lim 2
t + s = p , t s = q T + 4T

R (t s )dtds
T T

TT

1
2
R
(
q
)
dpdq

2
2 T < p + q < 2 T , 2 T < p q < 2 T

2T +q
0
2 T 2 T q

1
R (q )dq + dp R (q )dq 2
dp


0 2T +q
T + 8T 2
2T 2T q

= lim

1
T + 2T 2

= lim

2T

2T

1
q
2
2
1

R (q )dq
(2T q )R (q )dq = Tlim

+ T
2T

2T

2T

1
q
1
q
2
= lim 1
R (q ) dq = Tlim
1
C (q )dq #

T + T

+
2T
T 0 2T
0

Theorem The correlation function of a weakly stationary random process { t < t < +}
is ergodic if and only if
2T

q
1
2
1
B (q ) R ( ) dq = 0

T + T
2
T

0
lim

] [

where B (q ) = E t + q t = E t + q + t + q t + t .

Proof:
Let t = t + t , then
E[ t ] = E[ t + t ] = R ( )

E[ t s ] = E[ t + t s + s ] =

+ + + +

xyzwf (x, y, z, w; t + , t, s + , s )dxdydzdw

+ + + +

xyzwf (x, y, z, w; t s + , t s, ,0)dxdydzdw

This shows that t is at least weakly stationary. It follows from the preceding theorem that

P{ t = E[ t ]} = P{ t + t

q
1
2
= R ()} = 1 lim 1
B (q ) R () dq = 0 #
T + T
2T
0
2T

2.6. Spectrum Analysis & White Noise


Definition Let { t < t < +} be a random process, the spectrum of t is defined as
A.BENHARI

-184-

S () = lim

E F (, T )

T +

2T

where F (, T ) = t e jt dt is the Fourier transform of t . Note that F (, T ) is also a


T

random process.

Theorem (Wiener-Khintchine Theorem) Let { t < t < +} be a weakly stationary


random process, R ( ) the correlation function and S () the spectrum of t , then
S () =

j
R ()e d , R () =

1
S ()e j d
2

Example S () is a real-valued function.


Proof:
+

S () = R ( )e j d = R ( )e j d = R ( )e j d = S () #

Definition (White Noise) A weakly stationary process { t < t < +} is said to be a white
noise process if its spectrum is flat, i.e., S () = 2 (Const.)
Remark: Since
+

j
()e d = 1

1
e j d = ( )
2

we have
R ( ) =

A.BENHARI

1
1
S ()e j d =
2 e j d = 2 ( )

2
2

-185-

3. Discrete Time Sequence Analysis: Auto-Regressive and


Moving-Average (ARMA) Models
3.1. Definition
Definition Let x (n ) be a zero-mean white noise, i.e., E[x (n )] = 0 , E[x (n + m )x (n )] = 2x (m ) ,
then
(1) a random sequence y(n ) is said to be in accordance with an auto-regressive (AR) model
of order K if it can be expressed as
K

y(n ) + k y(n k ) = 0 x (n )
k =1

(2) a random sequence y(n ) is said to be in accordance with a moving-average (MA) model
of order M if it can be expressed as
M

y(n ) = m x (n m )
m=0

(3) a random sequence y(n ) is said to be in accordance with an auto-regressive and

moving-average (ARMA) model of order (K , M ) if it can be expressed as


K

y(n ) + k y(n k ) =
k =1

x(n m )

m =0

Remark: The power spectrum of white noise:

( ) R (m )e

S e j =

m =

j m

= 2x

(m )e

jm

m =

= 2x

3.2. Transition Functions


Definition (Transition Functions) Given an ARMA model
K

k =1

m=0

y(n ) + k y(n k ) = m x (n m )
M

let H(z ) =

m =0
K

z m

1+ k z

and z max the largest pole of H(z ) , if z max < 1 , then the model is said

k =1

to be causal and stable and H(z ) is called the transition function of the model.
A.BENHARI

-186-

Remark 1: From now on, the ARMA models we encounter in this lecture are all assumed to
be causal and stable, unless declared something else.
Remark 2: If H(z ) is the transition function of an ARMA model, then h (n ) = Z 1 [H(z )] is
called the impulse response of the model. It can be easily proven that
h (n ) = 0 for n < 0 (causal) and

h(n )

< + (stable)

n =0

Remark 3: For AR models,

H(z ) =

0
K

1+ kz

h (n ) is of infinite duration (Infinite Impulse Response, IIR)

k =1

For MR model,
H (z ) =

m=0

z m h (n ) is of finite duration (Finite Impulse Response, FIR)

Remark 4: h (n ) can also be solved from the difference equation


K
M

h (n ) + k h (n k ) = m (n m )

k =1
m =0

h (n ) = 0 for all n < 0

Example What are the impulse responses for the following models?
(1) AR (1)
y(n ) y(n 1) = x (n ) , < 1
H(z ) =


, z > h (n ) = Z 1
= n u (n )
1
1
1 z
1 z

(2) AR (2 )
y(n ) 1 y(n 1) 2 y(n 2 ) = x (n )
h (n ) 1 h (n 1) 2 h (n 2 ) = (n ) , h (n ) = 0 for all n < 0
h (0) = 1 , h (1) = 1 , h (n ) = 1 h (n 1) + 2 h (n 2 ) , n 2

Definition y(n ) is said to be the stationary solution/output of an AMAR model if it is given


+

by y(n ) = h (k )x (n k ) , where h (n ) is the impulse response of the model.


k =0

A.BENHARI

-187-

3.3. Mathematical Expectations


Theorem Assume that y(n ) is the stationary solution of an ARMA model and h (n ) the
+

impulse response of the model. It follows from y(n ) = h (k )x (n k ) that


k =0

mean value:

+
+
y = E[y(n )] = E h (k )x (n k ) = h (k )E[x (n k )] = 0
k =0
k =0

correlation function:
+
+

R y (m ) = E[y(n + m )y(n )] = E h (p )x (n + m p ) h (q )x (n q )
p = 0
q = 0

+ +

+ +

q =0 p=0

q =0 p =0

= h (p )h (q )E[x (n q )x (n + m p )] = 2x h (p )h (q )(q p + m )
+

= 2x h (q )h (q + m ) = 2x R h (m )
q =0

where R h (m ) = h (q )h (q + m ) .
q=0

variance:
+

2y = R y (0 ) = 2x R h (0 ) = 2x h (n )

n =0

correlation coefficient or standard correlation function:

y (m ) =

R y (m )

2
y

R h (m )
R h (0)

spectrum:

( ) R (m )e

S y e j =

m =

= 2x h (k )e jk
k =0

= 2x

m =

( )

= He

jm

k =0

k =0

n =0

h (k + m )e j(m+ k ) = 2x h (k )e jk h (n )e jn
n=k+m

m =

2
x

h(k )h (k + m )e

2
x

m=0
K

jm

1 + k e jk
k =1

Remark: It is clear that y(n ) is also a zero-mean weakly stationary random process.

A.BENHARI

-188-

Example For AR (1) ,


y(n ) y(n 1) = x (n ) , < 1 h (n ) = n u (n )

2
R (m )
m
= 1 =
0
y (m ) = h
m
1
R h (0)
2

1 2
m

Remark: y (m ) is said to tail off if y (m )


0 .
m +

3.4. Parameter Estimation


K

Theorem For an ARMA model y(n ) + k y(n k ) =


k =1

x(n m ) , if n > m , then

m =0

+
+

+
E[x (n )y(m )] = E x (n ) h (k )x (m k ) = h (k )E[x (n )x (m k )] = 2x h (k )(n m + k ) = 0
k =0
k =0

k =0

Remark: The theorem is straightforward because of the causality of the model. The causality
states that the output from the model is only dependent upon the input to the model at present
and in the past and has nothing to do with the input in the future.

3.4.1. Estimation of AR parameters


Example (Auto-Regressive Weights) Given an AR (K ) model:
K

y(n ) k y(n k ) = 0 x (n )
k =1

then for i = 1,2,L, K , we have


K

y(n )y(n i ) k y(n k )y(n i ) = 0 x (n )y(n i )


k =1

E[y(n )y(n i )] k E[y(n k )y(n i )] = 0 E[x (n )y(n i )] = 0


k =1

R y (i ) = k R y (i k )
k =1

y (i )=

R y (i )

R y (0 )

k =1

y (i k ) = y (i )

The above equations can be expressed in matrix form:

A.BENHARI

-189-

y ( 1)
y (0 )
(1)
y (0 )
y

M
M

y (K 1) y (K 2 )

L y (1 K ) 1
1
y (1)

L y (2 K ) 2 y (1)
1
=
M
O
M
M
M

L
y (0 ) K y (K 1) y (K 2 )

L y (K 1) 1 y (1)
L y (K 2 ) 2 y (2 )
=
M M
O
M

L
1
K y (K )

The parameters 1 , 2 , L , K can be then derived from the solution to the above equations.
Remark: In practice, R y (i ) = E[y(n )y(n i )]

1 n
y(k )y(k i) , where i = 1,2,L, K .
K k = n K +i

Example (Variance of White Noise) Given an AR (K ) model:


K

y(n ) k y(n k ) = x (n )
k =1

we have
2
K
K
K K

= E x (n ) = E y(n ) k y(n k ) = R y (0 ) 2 k R y (k ) + p q R y (p q )
k =1
k =1
p =1 q =1

2
x

K
K
K
K
K

= R y (0 ) 2 k R y (k ) + p q R y (p q ) K
=
R y (0) 2 k R y (k ) + p R y (p )
k =1
p =1
k =1
p =1
q =1
q=1 q R (pq )=R y (p )
K

= R y (0 ) k R y (k )
k =1

The variance 2x can be obtained after the parameters 1 , 2 , L , K have been estimated.

3.4.2. Estimation of MA parameters


Example (Moving Average Weights) Given a MA (M ) model:
y(n ) =

x(n m )

m=0

for i = 0,1,L, M , we have

M
M
M M
y(n )y(n i ) = m x (n m ) k x (n i k ) = m k x (n m )x (n i k )
m =0
k =0
m =0 k = 0
M
Mi
M

R y (i ) = 2x k m (k + i m ) = x2 k k + i
k =0
k =0
m =0

A.BENHARI

-190-

~ k
~ = ,

x
x 0
k=
0

M i

~ ~
~2

x k k + i
k =0

~
~2 ,~
Thus, the unknowns
X 1 , L , M can be derived from the solutions to the above M+1
equations.

3.4.3. Estimation of ARMA parameters


Example Given an ARMA(K , M ) model:
K

k =1

m =0

y(n ) k y(n k ) = m x (n m )
for i = 1,2,L, K , we have
K

k =1

m =0

y(n )y(n M i ) k y(n k )y(n M i ) = m x (n m )y(n M i )


K

k =1

m =0

E[y(n )y(n M i )] k E[y(n k )y(n M i )] = m E[x (n m )y(n M i )] = 0


K

R y (M + i ) = k R y (M + i k )

y (i )=

k =1

R y (i )

R y (0 )

k =1

y (M + i k ) = y (M + i )

The above equations can be expressed in matrix form:

y (M 1)
y (M )
(M + 1)
y (M )
y

M
M

y (M + K 1) y (M + K 2 )

L y (M + 1 K ) 1 y (M + 1)
L y (M + 2 K ) 2 y (M + 2 )
=
M

O
M
M

L
y (M )
K y (M + K )

The parameters 1 , 2 , L , K can be then derived from the solutions to the above equations.

Example Given an ARMA(K, M ) model:


K

k =1

m =0

y(n ) k y(n k ) = m x (n m )
if let
K

g(n ) = y(n ) k y(n k )


k =1

the ARMA(K, M ) model can reduces to an MA(M ) model


g(n ) =

A.BENHARI

x(n m)

m=0

-191-

~ 1
~ (= ),

the unknowns
x
0 x
1 =
0

~
,L, M = M

can be then derived from the solutions to

the following equations:


Mi
~ ~
~2
R g (i ) =
X k k + i , i = 0,1, L, M
k =0

A.BENHARI

-192-

4. Problems
(1) An IID process must be strictly stationary.
In fact, let { t t T} be an IID process, then

P t1 + < x 1 , t 2 + < x 2 , L , t n + < x n


=

identical distribution

(2) If

}{

independence

}{

} {

P t1 + < x 1 P t 2 + < x 2 L P t n + < x n =

} {

P t1 < x 1 P t 2 < x 2 L P t n < x n

independence

P t1 < x 1 , t 2 < x 2 , L , t n < x n #

[ ]

n = 1,2, L} is a discrete random process with E[ n ] = 0 , E 2n = 2 and

E[ n m ] = 0 (when n m ), then

2
E[ n m ] =
0

n=m
nm

= 2 (n m )

This implies that the process { n n = 1,2, L} is a weakly stationary process. #

(3) Let be a random variable possessing a uniform distribution over the interval (0,2 ) and

t = a cos(t + ), < t < +}, then


for all < t < + , we have
2

1
a
E[ t ] =
a cos(t + x )dx =

y = t + x 2
2 0

t + 2

cos(y)dy = 0

for all < t 1 t 2 < + , we have

E t 2 t1 =

a2
a2
cos
(

t
+
x
)
cos
(

t
+
x
)
dx
=
cos(t 2 t 1 )
1
2
2 0
2

This implies that the process { t t = a cos(t + ), < t < +} is weakly stationary. #

Remark: cos cos =

A.BENHARI

cos( + ) + cos( )
2

-193-

(4) Let s(t ) be a periodic function with period T, be a random variable possessing the
uniform distribution on the interval (0, T ) and { t s(t + ), < t < +} , then
for all < t < + , we have
+

1
1
E[ t ] = s(t + x )f (x )dx = s(t + x )dx =
y
=
t
+
x
T0
T

t +T

1
=
s(y )dy = const.
t s(y )dy periodicit
y T
0

for all < t 1 t 2 < + , we have

E t 2 t1

1
= s(t 1 + x )s(t 2 + x )f (x )dx = s(t 1 + x )s(t 2 + x )dx =
T0

1
y = t1 + x T
=

t1 + T

s(y)s(t 2 t 1 + y)dy
t1

1
s(y )s(t 2 t 1 + y )dy = R (t 2 t 1 )
periodicity T
0
=

This implies that the process { t s(t + ), < t < +} is weakly stationary. #

(5) Let { t < t < +} be a random process such that for all < t < + ,

1
2
1
P{ t = k} =
2
0

k=I
k = I
others

Furthermore, for all > 0 , if we denote by A k the event that the process changes its values k
times within the period [t , t + ) , then
k
(
)
P{A k } =

k!

e ,where > 0 , k = 0,1,2, L

Thus,
for all < t < + , we have

E[ t ] = I

1
1
+ ( I ) = 0
2
2

for all < t 1 < t 2 < + , we have

E t 2 t1 = I 2 (P{A 0 } + P{A 2 } + L + P{A 2 n } + L)


I (P{A 1 } + P{A 3 } + L + P{A 2 n +1 } + L)
2

( )k

k =0

k!

= I 2 e

A.BENHARI

= I 2 e 2 , where = t 2 t 1
-194-

Note that the above result can also be applied to the case of t 2 = t 1
This implies that the process is weakly stationary. #

(6) If { t < t < +} is a periodic random process with period T, then its covariance

function R () = E t + t is also a periodic function with period T.

Proof:

Since the process is periodic with period T, i.e., P{ t + T = t } = 1 , we have

E t +T t

]= 0 .

From Cauchy-Schwarz inequality and the result obtained in (1), we have

0 E[ ( t + + T t + ) t ] E t + + T t +

] E[ ] = 0 E[(
2

t + +T

t + ) t ] = 0

Form the result obtained in (2), we have

R ( + T ) R ( ) = E ( t + + T t + ) t E[ ( t + + T t + ) t ] = 0 R ( + T ) = R ( ) #

A.BENHARI

-195-

Martingales

A.BENHARI

-196-

1.

Simple properties

Definitions. Let (,K,P) be a probability space. A filtration is any increasing


sequence of sub--algebras of K. We shall denote it by (F n)n1 . Usually one adds
to the filtration its tail -field, that is the -algebra F defined by F =(

Fn). Let X:= (Xn)n be a sequence of random variables. We call X adapted if Xn

n =1

is Fn-measurable for any positive integer n. The system (,K,P, (F n)n) is called
a stochastic basis.
Example. If we define Fn := (X1,X2,,Xn) , then X is clearly adapted. This
filtration is called the natural filtration given by X.
Definitions. Let X be an adapted sequence. Suppose that Xn L1 for any n. Then X
is called
A supermartingale if E(Xn+1 Fn) Xn n;
A martingale if E(Xn+1 Fn) = Xn n;
A submartingale if E(Xn+1 Fn) Xn n;
A semimartingale if X is either supermartingale or martingale or
submartingale.
Remark. If one does not define the filtration, it is understood that he has in
mind the natural filtration. Also notice that a martingale is both a sub- and a
super- martingale and conversely, if X is both sub- and super- martingale, it is a
martingale.
Remark. In the literature the concept of semimartingale is slightly different.
However, we shall use it only in this sense.

Examples.
1. Let n be a sequence of i.i.d. r.v. from L1 and let a = E1. Let Fn =
(1,2,,n) and Xn = 1 + 2 ++n . Then a 0 X is a supermartingale, a =
0 X is a martingale and a 0 X is a submartingale. If we think at n as
being the gain of a player at the nth game, then Xn is the gain of the player
ofter n games. So we can understand a supermartingale or a martingale as the
gain in an unfair game and the martingale as the gain in a fair game.
Supermartingale = the game is unfavorable to the player and submartingale =
game favorable to the player.
Proof. E(Xn+1Fn) = E(Xn+n+1Fn) = E(XnFn) + E(n+1Fn) = Xn + E(n+1Fn) (as Xn is Fn measurable) = Xn + En+1 (as Xn is independent on Fn ) E(Xn+1Fn) = Xn + a .
2. Let n be a sequence of non-negative i.i.d. r.v. from L1 and let a = E1. Let Fn
= (1,2,,n) and Xn = 12 n . Then a 1 X is a supermartingale, a =
1 X is a martingale and a 1 X is a submartingale.
Proof. Similar. E(Xn+1Fn) = E(Xnn+1Fn) = XnE(n+1Fn) (as Xn is Fn -measurable) =
XnEn+1 (as Xn is independent on Fn ) E(Xn+1Fn) = aXn .
3. Let (Fn)n be a filtration and f L1. Let Xn = E(fFn). Then Xn is a martingale.
The random variable X = E(fF) is called the tail of X . Martingales of this
form are called martingales with tail.

A.BENHARI

-197-

Proof. E(Xn+1Fn) = E(E(fFn+1)Fn) = E(fFn) (as Fn Fn+1) = Xn.


4. A concrete example. Let = (0,1], K = B ((0,1]), P = the Lebesgue measure and
Xn = n1 1 . Check that this is a non-negative martingale converging to 0 a.s.
( 0, ]
n

but not in L .
5. Another concrete example. Let n be i.i.d with the distribution (-1+1)/2. Let
Fn = (1,,n). Let Bn Fn be such that P(Bn) 0 as n but P(limsup Bn)
= 1. Define the sequence Xn by recurrence as follows: X1=1 and Xn+1 =
Xn(1+n+1)+n+1 1Bn for n 1. Then Xn converges in probability to 0 but P(limsupXn =
liminfXn) = 0. That is, Xn diverges almost surely.
Proof. Remark that n+1() = -1 and Bn Xn+1()=0 hence Xn+1() 0 n+1()
= 1,Xn() 0 or Bn. That is, {Xn+1 0} {n+1 = 1,Xn 0} Bn P({Xn+1 0})
P( {n+1 = 1,Xn0} Bn) P(n+1= 1, Xn 0) + P(Bn) = P(Xn 0)P(n+1=1) + P(Bn) =
P(Xn 0)/2 + P(Bn).
Let pn = P(Xn0) and qn = P(Bn). So pn+1 pn/2 + qn n and qn 0. Aplying the
-1
-2
-1
-3
-2
recurrence many times we see that pn+1 2 pn + qn 2 pn-1+ 2 qn-1 + qn 2 pn-2 + 2 qn-1
.. 2-np1 +2n-1(q1 + 2q2 + + 2n-1qn). As 2-np1 0 and , by Cesaro2 + 2 qn-1 + qn

q1 + 2q2 + .. + 2 n1 qn
2 n qn+1
= lim n
= 2limiqn = 0 it means that P(Xn 0)
Stolz lim
n
n 2 2 n 1
2 n1
0. Now suppose that Xn() a for some a . Then Xn+1() Xn() 0 . But
from the recurrence relation we infere that Xn+1 Xn = n+1(Xn + 1Bn ). So, if Xn+1
Xn = Xn + 1Bn (as n = 1) converges to 0, then Xn() + 1Bn () 0, too ,
which is the same with the claim that Xn() + 1Bn () 0 , meaning that 1Bn ()
has a limit. But we know that P(liminf Bn) lim P(Bn) = 0 and P(limsup Bn) = 1 ,
i.e. the sequence 1Bn diverges a.s. . Therefore P( Xn converges to a finite limit)
= 0. Suppose that Xn() . That will imply the fact that Xn() > 0 for any n
great enough. But P(Xn+k > 0 k) P(Xn+j0) j and that converges to 0. Meaning
that P(limXn = or -) = 0. We inferr that Xn diverges a.s.
The fact that Xn is a martingale is obvious, since E(Xn+1Fn) = XnE(1+n+1Fn)+
1Bn E(n+1Fn) (as Xn is Fn measurable and Bn Fn ) = Xn E(1+n+1) + 1Bn E(n+1) = Xn
(as En+1 = 0) . On the other hand remark that

E(Xn+1Fn) = XnE(1+n+1) + 1Bn

E(n+1) = Xn + 1Bn Xn points out that Xn is a submartingale with the


property that EXn= 1+

n 1

q
j =1

Here are some simple properties of these sequences.


Property 1. 1. If X is a submartingale, the sequence (EXn)n is nondecreasing; If X is a martingale, the sequence (EXn)n is constant and if X is a
supermartingale, the sequence (EXn)n is non-increasing. Moreover, if m < n then
E(XnFm) Xm (for supermartingales), = Xm (for martingales) and Xm for
submartingales.
The proof is simple and left as an exercise.
Property 1.2. If X,Y are martingales (sub-, super-) and a,b 0 , then
aX+bY is the same. That is the sub (super) martingales form a positive cone.

A.BENHARI

-198-

Moreover, if X,Y are martingales , then aX+bY is a martingale a,b , meaning that
the set of all the martingales of some stochastic basis is a vector space.
Moreover, X is supermartingale -X is a submartingale.
The proof is obvious and left to the reader.
Property 1.3. If X is a martingale and f is a convex function such that
f(Xn) L1 n, then the sequence Yn = f(Xn) is a submartingale. If f is concave
1
and f(Xn) L n, , then the sequence Yn = f(Xn) is a supermartingale. As a
2
consequence, if X is a martingale, then (Xn)n, ((Xn)+)n, Xn is are submartingales.
Proof. It is Jensens inequality for conditioned expectations.Suppose f is convex.
Then E(Yn+1Fn) = E(f(Xn+1)Fn) f(E(Xn+1Fn)) = f(Xn) = Yn .
Property 1.4. The DoobDoob-Meyer decomposition. The submartingales are
actually sums between martingales and increasing sequences. Any submartingale X
can be written as X = M + A where M is a martingale and A is nondecreasing (An
An+1 a.s.) and predictable (i.e. An+1 is Fn measurable) .
Proof. Let us define the sequence An by the following recurrence: A1 = 0 , A2 =
E(X2 F1) X1 , A3 = A2 + E(X3 F2) X2 , ., An+1 = An + E(Xn+1Fn) - Xn . As X is a
submartingale, A is indeed non-decreasing. By the definition, An+1 is Fnmeasurable. Let Mn = Xn An . As Mn+1 = Mn + Xn+1 E(Xn+1Fn) it follows that Mn is
indeed a martingale.
Property 1.5. Martingale transforms. Let X = (Xn)n1 and B = (Bn)n0 be
adapted sequences of r.v. such that Bn(Xn+1 Xn) L1 (that happens for instance if
Bn L and Xn L1 n). Remark that, unless X, B starts from 0. We shall agree
that B0 is a constant in order to be measurable with respect to any -algebra.
Let us define a new sequence denoted by BX by the recurrence (BX)1 = B0X1 and, for
n 1, (BX)n+1 = (BX)n + Bn(Xn+1-Xn) .( Or, directly, (BX)n =X1 + B1(X2 X1) + B2(X3
X2) + + Bn-1(Xn Xn-1) for n 2). Call the sequence BX the transform of X by B.
Then
(i)
If X is a martingale, BX is a martingale, too;
(ii).
If X is a submartingale and Bn 0, n, then BX is a submartingale,
too; if Bn 0, n, BX is a supermartingale.
(iii).
If Bn = c is a constant sequence, L(F1), then BX = cX.
Proof. E((BX)n+1Fn) = E((BX)n + Bn(Xn+1-Xn)Fn) = (BX)n + BnE(Xn+1-Xn)Fn) .

2.

Stopping times

In the theory of martingales the concept of stopping time is crucial.


Definitions. Let (,K,P, (F n)n) be a stochastic basis. A random variable :
N {} is called a stopping time iff {=n} F n n. If is a stopping time
one denotes by F the family of sets A K with the property that A { = n }
F n n . Remark that F is a new -algebra called the -field of the events
happenned before (the anterior -algebra). Let now X be a sequence of random
variables. Let L1(F) arbitrary. We define X by the relation

X ( ) () if () <
f () =
()

(2.1) X() =

A.BENHARI

-199-

Remark that, while there exists an ambiguity in the definition of X on the set =
, if < there is no imprecision.
Property 2.1. Examples of stopping times and properties of F.
(i).
Any constant is a stopping time.
(ii).
If = k = constant, then F = F k , meaning that the definition of F
is natural.
(iii).
If X is adapted and B B(), then B defined as B= inf {nXn B} is
a stopping time. (We adopt the convention that inf = ) . This stopping time is
called the hitting time of B.
(iv).
If is a stopping time and A F then A is again stopping time where
A = 1A + 1 \ A .
(v).
If and are stopping times and , then F F.
A F A{} F, A{=} F F
(vi)
(vii)
{} F F, { = } F F
F F = F , (F F) = F
(viii)

Proof. (i) and (ii) are obvious. For (iii) remark that {B = n} = {X1B , X2B ,
, Xn-1B,XnB} F n since X is adapted.
(iv)
It is easy: {A = n } = { = n } A F n due to the definition of F.
It is also immediate: AF A{ = k}Fk so A { = n} =

(v).

U A

{ =

k =1

n}{ = k } (since implies = n n) = U Bk { = n} (with Bk =


k =1

A{=k} Fk Fn ) Fn . (vi).
Let A F. To prove that A{} F
we have to check that A{}{=n} Fn n. But A{}{=n} = A{
n}{=n} belongs to Fn since A F A{ n} Fn and is a stopping time
{ = n} Fn . As about the set A{=}, it belongs both to F (as A{=}{=n}
= (A{ = n}){ = n} ) and to F (as A{ = }{ = n} = (A{ = n}){ = n}
).
(vii).
That {} F is an easy consequence of (vi) (just set A = ) . To
check that {} F , let n be arbitrary. Then {}{ = n } = {=n}{n} =
{=n} \ {=n}{<n} Fn as {=n} Fn and { < n } Fn . Thus {} F
F. As about { = }, it is even easier: { = } { = n } = { = } { = n }
= { = n} { = n } Fn .
(viii).
As is a stopping time and , , it follows that F
F F . Conversely, if A F F , then A { n) = (A{ n})
(A{ n} ) Fn hence A F. As both and ,
F F
F ( F F) F. Conversely, A F A = (A{=})(A{=}).
The first set is in F and the second one in F hence their union is in (F
F).

Property 2.2 If X is adapted, then X is F measurable.


Proof. Let B be a Borel subset of . Then X-1(B) = {X() B} =

A.BENHARI

-200-

U {X B, = n} {X B, = }
n =1

U X (B ) { = n} {
1
n

n =1

U {X

B, = n} { B, = } =

n =1

( B ) { = } . We have to check that X-1(B) F , meaning

that X (B){=n} Fn n . But the above computation show that X (B){=n} =


Xn-1(B){=n} ; as Xn is Fn measurable , Xn-1(B) Fn hence, by the very definition
-1
of a stopping time {=n} Fn X (B){=n} Fn for finite n. If n = , it is
the same.
Property 2.3. A formula to compute E(f F). The following equality
holds. If f L1 then
-1

-1

(2.2) E(fF ) =

E (fF )
n

1{=n} + E(fF) 1{=}

n =1

Proof. Let Y be the left term from (2.2). By the same reasoning as before, Y
is F -measurable. Let A F . The task is to prove that E(f1A) = E(Y1A). But
E(Y1A) = E(

n =1

n =1

E (fFn) 1{=n}1A + E(fF) 1{=}1A) = E( E (fFn) 1{=n}A + E(fF)

1{=}A) = E(

E (f

1{=n}A Fn) + E(f1{=}A F) ) =

n =1

E(E(f1{=}A F))

E (E(f

1{=n}A Fn)) +

n =1

E (f

1{=n}A) + E(f1{=}A )

= E(f1(<)A) + E(f1(=)A) = E(f1A)

n =1

. Notice that we have commuted the sum with the expectation due to Lebesgue
dominated convergence theorem. Indeed, if gn =
(f 1{=k}A Fk)

k =1

k =1

E (f 1{=k}A Fk) then gn E

(f 1{=k}A Fk) (Jensens inequality for the convex

k =1

function s a s!) g where g =


(E(f 1(=n) Fn)) (by Beppo-Levi!) =

E (f 1(=n) Fn) and g L1 since Eg =


n =1

E (f

E
n =1

1(=n)) = E(f 1(<)) (again by

n =1

Beppo-Levi!) E(f) < .


Property 2.4. A stopped martingale (sub
sub--, supersuper-) is again a martingale
(sub
sub--, supersuper-) . Precisely, if is a stopping time and X is a sequence of
random variables, the sequence Y defined by
(2.3)
Yn = Xn
is called the stopped of X at .
The claim is that by stopping a martingale(submartingale, supermartingale) one
gets another martingale (submartingale, supermartingale) with respect to the same

filtration.
Proof. Let be a stopping time and Bn = 1{ > n} = 1{n < } for n 1 and B0 = 1 . Due to
the definition of a stopping time, B is an adapted sequence. Let X be an adapted
sequence. Then (BX)n = Xn. Indeed, if () = n , n 2, then Bk() = 1 if k < n
and = 0 if k n . Let k n. Then (BX)k ()=(B1X1 + B1(X2 X1) + B2(X3 X2) + +

A.BENHARI

-201-

Bk-1(Xk Xk-1))() = (X1 + (X2 X1) + (X3 X2) + + (Xk Xk-1))() = Xk() . If k >
n, then (BX)k() = (X1 + B1(X2 X1) + B2(X3 X2) + + Bn-1(Xn Xn-1) + Bn(Xn+1-Xn) +
+ Bk-1(Xk Xk-1))()= (X1 +(X2 X1)+(X3 X2)+ + (Xn Xn-1) + 0(Xn+1-Xn) + + 0(Xk
Xk-1))()
= Xn() . If n = 1, then (BX)1 = B0X1 = X1 = X1 holds in this case,
too.

So this property is a consequence of Property 1.5.


Property 2.5. Optionalization. If , are bounded stopping times and
then
(2.4) E(XF) X if X is a supermartingale
(2.5) E(XF) = X if X is a martingale and
(2.6) E(XF) X if X is a submartingale
Proof. Let A F . Consider the stopping times A and A defined in
Property 2.1 (iv). Let Bn = 1{ n < }A = 1{ A n< A } = 1{n< A } 1{n< A } .Suppose that X is a
supermartingale. Then,
(2.7) (BX)n = (Xn - Xn)1A
is again a supermartingale, according to Property 2.4. It means that
(2.8) E((BX)n) E((BX)1) = E(B0X1) = 0
since B0 = 0. We assumed that and are bounded. Let n . From (2.7) we see
that (BX)n = (X - X)1A and (2.8) implies that
(2.9) E((X - X)1A) 0 A F.
Let Y = E(X - XF). By the definition of the conditioned expectation, E((X X)1A) = E(Y1A) A F. But Y is itself F-measurable hence from (2.9) Y 0.
Meaning that E(X - XF) 0 which further implies E(XF) E(XF) 0
E(XF) X as by property 2.2 we know that X is F - measurable. Notice that
as is finite, we do not need an extra random variable to define X. We have
proved the inequality (2.4). The proof holds also for (2.5) and (2.6) changing the
hypothesis that X is a supermartingale with martingale and submartingale.
Corollary 2.6. Let (n)n1 be an increasing sequence of bounded stopping
times. Let Yn = X n and Gn = F n . Suppose that X is a supermartingale (martingale,
submartingale) Then Y is a supermartingale (martingale, submartingale) too, with
respect to the new filtration (Gn)n1
Corollary 2.7. Let X be a supermartingale (martingale, submartingale)
and be a bounded stopping time. Then EX1 EX (EX1 = EX , EX1 EX) .
Proof. Of course, since 1. Apply Property 2.5 with =1.
Counterexample. If is finite but not bounded, that may not be true.
1 1
For example if X is the martingale from example 4. Let An = (
, ] . Then F1 is
n +1 n
trivial and for n 2, Fn is the -algebra generated by the sets A1,.,An-1. Let

(n + 1)1
n =1

An

. As An Fn+1 , is a stopping time and X = 0. Therefore it is not

true that EX = EX1.


But sometimes it is true.
Definition. Let be a finite stopping time. Then is called regular if
Xn X in L1 is n .
Corollary. 2.8. Suppose that , are regular stopping times and .

A.BENHARI

-202-

Then the assertions (2.4)-(2.6) still hold.


Proof. We shall prove only (2.4), the other two assertions have the same
proof. Of course Xn L (since
1

Xn

Xj

j =1

1)

X-Xn

and, as

0 ,

it means that X is in L , too. The same holds for X. But we know that E(XnFn)
Xn for any n. Recalling the definition of the conditioned expectation, that
means that E(Xn1A) E(Xn1A) A Fn, n fixed. As Fn F(n+k) for k 0 , it
follows that E(X(n+k)1A) E(X(n+k)1A) A Fn, n fixed for any k 1. Letting k
1
and keeping in mind that fn f in L E(fn1A) E(f1A) A it follows that
1

E(X1A) E(X1A) A Fn , n fixed. Let A =

Fn . Then A is an algebra of

n =1

sets from F and (A ) = F (since A F A =

A{n} and the sets

n =1

A{n} belong both to F (from Property 2.1(vi) ) and to Fn A{n} F


Fn = Fn ). Moreover, we checked that E(X1A) E(X1A) A A E(X1A)
E(X1A) A (A) E(X1A) E(X1A) A F which, of course is the same as
the claim (2.4).

We shall give some sufficient conditions to ensure the regularity of a


stopping time.
For the semimartingales of the form
(2.10) Xn = 1 + 2 ++ n , where (n)n are i.i.d. from L1
there is a simple condition.
Proposition 2.9. The Wald condition. Any stopping time with finite
expectation E is regular for the semimartingale defined by (2.10). As a
consequence, if E1=0, then EX = 0.
Proof. We shall prove that EX - Xn 0. But EX - Xn= E(Xn+1Xn)1{=n+1} + (Xn+2-Xn)1{=n+2} + = En+11{=n+1} + (n+1+n+2)1{=n+2} + (n+1+n+2 + n+3)1{=n+3} +
= En+11{>n} + n+21{>n+1} + n+31{>n+2} +

E(
k =0

n + k +1

1{ >n + k } )

Now E(n+k+11{>n+k}) = E(E(n+k+11{>n+k} Fn+k)) = E(E(n+k+1Fn+k) 1{>n+k}) (since is a


stopping-time!) = E(E(n+k+1) 1{>n+k}) (as n+k+1 is independent on Fn+k) = aP(>n+k)
with a = E1 (as n are identically distributed) . Therefore EX - Xn

k =0

aP( > n+k) . But E =

k =0

P( > k) < implies that limn aP( > n+k) = 0.


k =0

Therefore is regular.

Corollary 2.10. Walds identities. Let X be defined by (2.10) and be a


stopping time such that E < . Then
(2.11) EX = E1E
And, if n L2, then
(2.12)
E((X - a)2) = (E)Var(1)
Proof. Let a = E1. Then Yn = Xn na is a martingale (of course , with

A.BENHARI

-203-

respect to its natural filtration!). As is regular, EY = 0 E(X - n) = 0


2
2
2
proving (2.11). For the second assertion, let = Var(1) and Zn = Yn -n . Then Z
2
2
2
2
is a martingale. Indeed, E(Zn+1Fn) = E(Yn + 2(n+1-a)Yn + (n+1-a) - n - Fn) = Zn +
E(2(n+1-a)Yn + (n+1-a)2 -2 Fn) = Zn + 2YnE(n+1-a) Fn) +E( (n+1-a)2 Fn) - 2 = Zn
2
2
2
+ 2YnE(n+1-a) +E(n+1-a) - (since n+1 is independent on Fn!) = Zn (as E(n+1-a)
2
= !) . Moreover, EZn = 0. If we could prove that is regular, then EZ = 0
2
E((X - a) - Var(1) ) = 0 which is exactly (2.12).
It means that the task is to prove that is regular for Z.
2
The trick is to prove that Yn Y in L as n . If so, that would
1
2
2
2
imply the convergence in L of Y n to Y by Holders inequality (notice that f g21 = E(f-gf+g) f-g2f+g2). Let n = n a . Notice that now En = 0.
2
2
Then Yn - Y2 = E(Yn - Y) = E(n+11{=n+1} + (n+1+n+2) 1{=n+2} + (n+1+n+2+n+3) 1{=n+3}
2
2
+ .) = E(n+11{>n} + n+2 1{>n+1} + n+3 1{>n+2} + .) . Let yj=j1{ > j-1} considered in
the Hilbert space L2 and Sn = y1 + y2 + + yn.
Notice that i j yiyj .
( Indeed, if , say, i < j then <yi,yj> = E(ij1{ > i 1}1{ > j-1}) = E(ij1{ > j-1}) =
E(E(ij1{ > j-1}Fj-1)) = E(i 1{ > j-1}E(jFj-1)) (as i and 1{ > j-1} are Fj-1- measurable) =
E(i 1{ > j-1}Ej) (as j is independent on Fj-1) = 0. )
n

On the other hand the sequence Sn =

is convergent in the Hilbert space

j =1

L2 to some limit y, because it is Cauchy and L2 is complete: Sn+k Sn 22 = yn+122 +


+ yn+k22 (due to orthogonality) = 2(P(>n)) + P(>n+1) + + P( > n+k-1))
(ym22 = E(m21{>m-1}) =E(E(m21{>m-1} Fm-1)) = E(1{>m-1}E(m2 Fm-1)) = E(1{>m-1}E(m2)) = 2P(
> m-1) !! )

P ( > n+k-1) < if n is great enough because E =

k =1

P (>k-1)
k =1

< .
After all, the conclusion is that Yn - Y22 = y Sn22 0 as n .
Meaning that Yn2 Y2 in L1 Yn2 (n)2 Y2 2 in L1 Zn Z
1
in L . So is regular for Z.
Remark. In statistics one uses Walds inequalities in a slightly different
case: is a counting variable which is independent on s. We can see that case
as a particular one of ours as follows: let us extend the natural filtration with
the -algebra generated by . So Fn = (1,2,,n,). Then X remains a
semimartingale with respect to the new filtration because E(Xn+1Fn) = E(Xn+n+1Fn)
= Xn + E(n+1Fn) and n+1 is independent on Fn (the associativity of the
independence: if F1 (here (1,,n)) , F2 (here ()) and F3 (here (n+1)) are
independent, then ( F1 F2) is independent on F3 .

Remark. One should not believe that automatically any stopping time with
finite expectation is regular. For instance, if Xn = n2 (this is a submartingale!)
and is such that E< but E2 = , then X = 2 is not even in L1, in the spite of
1
1
the fact that Xn , being constants are in L . So Xn cannot converge in L !

A.BENHARI

-204-

3.

An application: the ruin problem.

There are two players, A and B playing a game . The first one has a
capital of a euro , the second one b euro (a,b positive integers). If A wins, he
gains 1 euro; if B wins, he loses 1 euro. They decide to play the game until the
ruin, i.e. until one of them loses all his money. Let be the ruin time, that is
the number of games after which the game stops. We want to find the probability
that A wins and the expectation of .
Suppose that the probability that A wins is p. Let q be the
probability of a draw and r the probability that B wins. To avoid trivialities we
accept that p,r 0. Let n the gain of A at the nth game. So n 1 0 1 .
r q p
Thus
(3.1)
: = E1 = p-r
and , as E12 = p+r
2
2
(3.2) := Var(1) = p+r-(p-r) = p(1-p) + r(1-r) + 2pr.
We accept that the s are independent. Let Xn = 1++n . This is the gain
of the first player after playing n games.
The game stops the first time when Xn = b (in this case B is ruined) or
Xn = -a (now A has lost all his money). So = inf {n Xn = b or Xn = - a }.
Let (Fn)n be the natural filtration.
Remark first that < a.s. That is, P(-a < Xn < b for any n ) = 0.
Indeed, if 0 the law of the large numbers says that Sn/n a.s.
Sn/n in probability. So P(n(-) Sn n(+)) 1 as n for any
> 0. We infer that Sn if > 0 and Sn - if < 0. In both cases P(-a <
Xn < b for any n ) = 0.
Xn
N(0,1) in
If = 0, the Central Limit Theorem asserts that
n
distribution. Therefore P(-a < Xn < b for any n ) P(-a < Xn < b ) = P(
X
X
a
b
) P(- < n < ) (for n great enough) N(0,1)((-,)) for

< n <
n n n
n
any >0. As the normal distribution is absolutely continuous, the quantity
N(0,1)((-,)) can be made arbitrarily small. So P( = ) = P(-a < Xn < b for any
n ) = 0 in this case, too.
Why E < ?
There exists a direct proof, but it is pretty sophisticated. Here is an
indirect one.
Let Yn = Xn na. Then (Yn)n is a martingale and EYn = 0. Then E(Yn ) = 0
since any bounded stopping time (in our case n) is regular. It means that E(Xn)
= E(n) n. But the right hand term converges to E, by Beppo-Levi . The left
hand one is bounded between a and b, since a Xn b hence the limit EX =
E(a.s. lim Xn) = E .

A.BENHARI

-205-

The trick holds if 0.


2
If = 0, (this happens if p = r!) let us consider the martingale Zn = Xn n2 . It has also null expectation: EZn = 0. Meaning that E(Xn2) = 2E(n). The
2
2
2
argument is the same, because the sequence (Xn )n is bounded between 0 and a b .
Then the result is
2
2
(3.3)
E = EX /
Let us consider first the case 0. We know that
(3.4) E = (EX)/. = (EX)/(p-r)
The only problem is to compute EX. Notice that X = b1A a1B where A is the
event A wins and B means B wins. Thus
(3.5)
EX = bP(A) aP(B).
Let us consider the new sequence Un = t X n , t > 0. Then E(Un+1Fn) = E( t X n +1
Fn) = = E( t X n +n +1 Fn) = E( t X n t n +1 Fn) = t X n E( t n +1 Fn) (as t X n is Fn measurable) =

Un E( t n+1 ) (since t n+1 is independent on Fn ) = Un(pt+q+rt -1). Choose t1 such that


pt+q+rt 1 = 1 pt+r/t = p+r t = r/p. Then Un is a martingale and EUn = 1
EUn = 1 by Corollary 2.7. Therefore E t X n = 1 for any n . As Xn X a.s. and the
sequence is bounded , the sequence ( t X n )n is bounded, too and converges a.s. to
U. By Lebesgues domination principle, Un converges in L1 to U , hence EU =
limn EUn = 1. But EU = tbP(A) + t-aP(B) = 1 P(A)(tb-t-a) = 1 t-a . Therefore
we find
1 t a
t a 1
=
t b t a t a +b 1
which, replaced in (3.5) and (3.4) gives us the possibility to compute E.
In the case =0 we have p=r. . Now Xn is a martingale itself hence EX = 0
, as is regular. Replacing in (3.5) we see that
a
(3.7) P(A) = P(A wins) =
a+b
a
b
Which implies that EX2 = b2
+a2
= ab which, replaced in (3.3) gives us
a+b
a+b
E=ab/ 2 or
ab
(3.8)
E =
2p

(3.6)

P(A wins) =

Notice that if there are no draws, E=ab, the win-probabilities do not change.

A.BENHARI

-206-

Convergence of martingales
1.

Maximal inequalities

Let (,K,P,(Fn)n 1) be a stochastic basis and X = (Xn)n be an adapted


sequence of random variables. The random variable X* := sup{Xn; n 1} is
called the maximal variable of X. A maximal inequality is any inequality
concerning X*.
We shall also denote by X*n the random variable max(X1,X2,,Xn). Thus
X* = limnX*n = supnX*n .
There are many ways to organize the material: we adopted that of Jacques
Neveu (Martingales a temps discrete Masson 1972).
We start with a result concerning the combination of two supermartingales.
Proposition 1.1. Let (Xn)n and (Yn)n be two supermartingales. Let be
stopping times. Suppose that if
(1.1)
<, then X Y. Define Zn = Xn1{n < } + Yn1{n } .
Then Z is again a supermartingale.
Proof. The task is to prove that E(Zn+1Fn) Zn .
But Zn = Xn1{n < } + Yn1{n } 1{n < }E(Xn+1 Fn) + 1{n }E(Yn+1Fn) (as X and Y are
supermartingales!) = E(Xn+11{n < } Fn) + E(Yn+11{n }Fn) (since is a stopping
time both sets are in Fn!) = E(Xn+11{n < }+ Yn+11{n }Fn) = E(Xn+11{n+1 < }+Xn+11{ = n+1}+
Yn+11{n }Fn) E(Xn+11{n+1 < }+Yn+11{ = n+1}+ Yn+11{n }Fn) (since X Y hence = n+1
Xn+1 Yn+1!) = E(Xn+11{n+1 < }+Yn+11{n+1 }Fn) = E(Zn+1Fn).

Corollary 1.2. Maximal inequality for nonnegative supermartingales.


The following inequality holds if X is a non-negative supermartingale:
EX 1
(1.2) P(X* > a)
a
Proof. Let us consider the stopping time
(1.3) = inf {n Xn > a} (convention: inf = !)
Remark the obvious fact that X* > a < .
In the previous proposition we consider Xn to be even our supermartingale X
and Yn = a (any constant is of course a martingale). The condition (1.1) is
fulfilled since < X > a. It means that Zn = Xn1{n<} + a1{n} is a
supermartingale hence EZn EZ1 = E(X11{1} +a1{ =1}) EX1 (since =1 X = X1 > a) .
EZ n
EX 1
As a1{n} Zn it means that aP(n) EZn P(n)

. Therefore P( <
a
a
EX 1
) = P( U { n}) = limn P(n) (since the sets increase!)
. As a
a
n
EX 1
.
consequence P(X* > a)
a

A.BENHARI

-207-

Corollary 1.3. If X is a nonnegative supermartingale, then X* < a.s.


EX 1
Proof. P(X* = ) P(X* > a)
a > 0.
a
It follows that for almost all the sequence (Xn())n is bounded.
We shall prove now a maximal inequality for the submartingales.
Proposition 1.4 . Let X be a submartingale. Then
(1.4) P(X* > a)
(1.5) P(X*n > a)

sup{E X n }
n

a
E ( X n 1{X *n >a} )

Proof. Let m = supn EXn , let a > 0 and let Yn = Xn. Then Y is another
submartingale, by Jensens inequality hence m = limn EXn. Let
(1.6)
= inf {n Yn > a} (inf := !)
Then the stopped sequence (Yn)n remains a submartingale (any bounded
stopping time is regular!) and Yn a1{n} + Yn1{>n}. (Indeed, by the very
definition of , < Y > a!)
It follows that a1{n} Yn aP( n) EYn EYn m (the stopping
theorem applied to the pair of regular stopping times n and n!) . It means that
m
m
P( n)
for any n hence P(< )
. But clearly { < } = {X* > a}.
a
a
The second inequality comes from the remark that n X*n > a . So
a1{n} Yn1{n} aP( n) E(Yn1{n}) E(Yn1{n}) (as n n Yn
E(YnFn) by the stopping theorem E(Yn1A) E(Yn1A) A Fn ; our A is {
n}!) . Recalling that { n} = {X*n > a} we discover that aP(X*n > a) E(Yn1{ X*n > a
= E(Xn1{ X*n > a }) which is exactly (1.5).
})
We shall prove now another kind of maximal inequalities concerned with

X*

: the so-called Doobs inequalities.


Proposition 1.5. Let X be a martingale
(i).
Suppose that Xn Lp n for some 1 < p < . Let q = p/(p-1) be the
Holder conjugate of p. Then
p

q supn Xn p
If Xn are only in L1, then

(1.7)
(ii).

X*

(1.8)

X*

e
(1+supn E(Xnlog+Xn)
e 1

Proof.
(i).
Recall the following trick when dealing with non-negative random
variables: if f:[0,) is differentiable and X > 0, then Ef(X) = f(0) +

f ' (t ) P( X > t )dt .


0

1
If f(x) = x the above formula becomes EX = pt p P( X > t )dt .
p

A.BENHARI

-208-

Now write (1.5) as tP(X*n> t) E(Yn1{X*n > t}) and multiply it with pt . We
p-1

obtain

pt

pt P(X*n> t) pt E(Yn1{S*n > t}). Integrating, one gets E(X*n )


p-1

p-2

p 2

E (Yn 1{X *n >t } )dt =

p
Yn ( ( p 1)t p21[ 0, X *n ) (t )dt )dP (we applied Fubini, the

p 1
0

p2
pt ( Yn1{X *n >t}dP)dt =
0

X *n

nonnegative case) = q Yn (

p 1

(t

)' dt )dP = qE(Yn(X*n)p-1) q Yn

(X*n)p-1

(Holder

( ( X

p-1

(X*n)

!) . But

q Yn

X*n

p
p

X*n

obtained the inequality


or
(1.9)

* ( p 1) q
n

dP

1
q

( ( X

= E(X*np) q

* p
n

) dP

Yn

p 1
p =

p-1

X*n

(X*n)p-1

hence we
p-1

= q

Yn

X*n

n.

X*n

As a consequence,

qsupk Yk

n. But (X*n)n is an increasing


p

sequence of nonnegative random variables. By Beppo-Levi we see that


=limn X*n

qsupk Yk

p
p

from 1 to :

P (X*n> t) =

E Yn 1{X *n >t }
t

1( 0,b ) (t )
t

Yn (

)dt =

Yn 1{X *n >t }

1
E(Yn1{X*n > t}). Integrate that
t

dP )dt =

Yn (
1

dt = lnb if b 1 or = 0 elsewhere. In short

1( 0, X *n ) (t )
t
1( 0,b ) (t )

1( 0, X *n ) (t )

dt )dP =

proving the inequality (1.7).

Look again at (1.5) written as P(X*n> t)

(ii).

X*

dt )dP . Now

dt =ln+b. It means that

ln + ( X *n )dP hence the result is

(1.10)

P (X* >
n

t) E(Ynln+(X*n))

Now look at the right hand term of (1.10). The integrand is of the form

aln+b. As

b
b
) = alna + aln
and x > 0 lnx x/e , it follows that alnb
a
a
b
b
alna + a = alna + . The inequality holds with xlnx replaced with xln+x. If
ae
e
b
b
b
b > 1, then aln+b = alnb alna +
aln+a +
and if b 1, then aln+b = 0
e
e
e
b
aln+a +
. We got the elementary inequality
e
b
(1.11)
aln+b aln+a +
a,b 0
e

EX n*
Using (1.11) in (1.10) one gets
P
(
X*
.
n> t) E(Ynln+Yn) +
1
e
alnb = aln(a

A.BENHARI

-209-

Now we are close enough to (1.8) because EX*n =

P (X* >
n

t) 1 +

P (X* >
n

t)

EX n*
implying that (1-e-1) EX*n 1 + E(Ynln+Yn) n. Remark that the
e
sequence (Ynln+Yn)n is a submartingale due to the convexity of the function x a
xln+x and Jensens inequality. So the sequence (E(Ynln+Yn))n is non-decreasing. Be as
-1
it may, it is clear now that (1-e ) EX*n 1 + supk E(Ykln+Yk) which implies (1.8)
letting n .
Remark. If sup Xn p < , we say that X is bounded in Lp. Doobs
p
p
inequalities point out that if p>1 and X is bounded in L then X* is in L .
1
1.
However, this doesnt hold for p=1 : if X is bounded in L , X* may not be in L A
counterexample is the martingale from Example 4 , previous lesson. If we want X*
1
to be in L , it means that we want X to be bounded in Lln+L . Meaning the condition
E(Ynln+Yn) +

(1.8).

2.

Almost sure convergence of semimartingales

We begin with the convergence of the non-negative supermartingales.


If X is a non-negative supermartingale, we know from Corollary 1.3 that X* <
a.s, that is, the sequence (Xn)n is bounded a.s. . So lim inf Xn - , lim sup Xn
+. In this case the fact that (Xn())n diverges is the same with the following
claim:
(2.1) There exist a,b rationale numbers, 0 < a < b such that the set {n Xn() <
a and Xn+k() > b for some k > 0} is infinite
Indeed, (Xn())n diverges : = lim inf Xn() < lim sup Xn() := , 0
< < ., then some subsequence of (Xn())n converges to and other subsequence
converges to ; so for any rationales a,b such that < a < b < the first
subsequence is smaller than a and the second is greater than b.
Let us fix a,b Q+, a < b and consider the following sequence of random
variables:
1() = inf { n Xn() < a}; 2() = inf { n > 1() Xn() > b} ..
2n-1() = inf { n > 2n-2() Xn() < a}; 2n() = inf { n > 2n-1() Xn() >
b}
(always with the convention inf = !) . Then it is easy to see that n are
stopping times. Indeed, it is an induction: 1 is a stopping time and {k+1 = n} =
{ k = j,Xj+1 B , , Xn-1 B, XnB} Fn (since the first set is Fj Fn),

U
j< n

where B = (b,) if k is odd and B = (-,a) if k is even.


Let a,b() = max{n 2k() < }. Then a,b means the number of times the
sequence X() crossed the interval (a,b) (the number of upcrossings)
The idea of the proof (belonging to Dubins) is that the sequence X() is
convergent iff a,b() is finite for any a,b Q+.
Notice the crucial fact that

2k() <
(2.2)
a,b() k
Lemma 2.1. The bounded sequence Xn is convergent iff a,b < a.s.

A.BENHARI

-210-

a,b Q+, a < b.


Proof. Let E = {(Xn())n is divergent}. Then E a,b Q+, a <
b such that a,b() = . In other words E =
U { a,b = } . Clearly P(E) = 0
a ,bQ + ,a <b

P(a,b = ) = 0 a < b, a,b Q+.


Proposition 2.2 (Dubins inequality)
(2.3) P(a,b k ) (

a k
)
b

Proof.
Let k be fixed and define the sequence Z of random variables as follows:
Zn() = 1
if
n < 1()
X 1 ()
Xn
if
1() n < 2() (notice that 1() <
<
a
a
1!)
b
a
X 2 ()

if

2() n < 3() (notice that 2() <

b
<
a

!)

a
b Xn
if
a a
b
( )2
a
X 4 ()
!)
a

X
b
( )k-1 n
a
a
b k-2
) !)
a
b
( )k
a
X 2 k ()
!)
a

b X 3 () b
< !)
a
a
a
b
b
4() n < 5() (notice that 4() < ( )2 <
a
a

3() n < 4() (notice that 3() <


if

b k-1 X 2 k 1 ()
)
<(
a
a

if

2k-1() n < 2k() ( 2k-1() < (

if

2k() n (notice that 2k() < (

b k
b
) <( )k-1
a
a

X
b j
b
) and the sequences Y(j)n = ( )j-1 n
a
a
a
are nonnegative supermartingales and we took care that at the combining moment j
the jump be downward, it means that we can apply Proposition (1.1) with the result
b
that Z is a non-negative supermartingale. Moreover, Zn ( )k 1{2 k n} . Therefore E(
a
b k
a
) 1{2 k n} EZn EZ1 1. We obtain the inequality P(2k n ) ( )k
n
a
b
a
. Letting n , we get P(2k < ) ( )k
which, corroborated with (2.2)
b
gives us (2.3).
Corollary. 2.3. Any non-negative supermartingale X converges a.s. to a
random variable X such that E(XFn) Xn. In words, we can add to X its tail X

Because the constant sequences X(j)n = (

A.BENHARI

-211-

such that (X,X) remains a supermartingale.


Proof. From (2.3) we infer that P(a,b = ) = 0 a < b positive rationales
which, together with Lemma 2.1 implies the first assertion. The second one comes
from Fatous lemma (see the lesson about conditioning!) : E(XFn) =
E(liminfkXn+kFn) liminfn E(Xn+kFn) Xn.
Remarks.1. Example 4 points out that we cannot automatically replace
nonnegative supermartingale with nonnegative martingale to get a similar
result for martingales. In that example X = 0 while EXn = 1. So (X,X) , while
supermartingale, is not a martingale.
2. Changing signs one gets a similar result for non-positive submartingales.
3. Example 5 points out that not all martingales converge. Rather the
contrary, if n are i.i.d such En = 0 then the martingale Xn = 1 + + n
never converges, except in the trivial case n = 0. Use CLT to check that!
We study now the convergence of the submartingales.

Proposition 2.4.
2.4 Let X be a submartingale with the property that supn E(Xn)+
< . Then Xn converges a.s. to some X L1.
Proof. Let Yn = (Xn)+. As x a x+ is convex and non-decreasing, Y is another
submartingale. Let Zp = E(YpFn), p n. Then Zp+1 = E(Yp+1Fn) E(E(Yp+1Fp) Fn)
E(YpFn) hence (Zp)pn is nondecreasing. Let Mn = limpZp .
We claim that (Mn)n is a non-negative martingale. First of all, EMn =
E(limpZp) = limpE(Zp) (Beppo-Levi) = limpE(Yp) = supp E(Xp)+ < (as Y is a
submartingale). Therefore Mn L1. Next, E(Mn+1 Fn) = E(limp E(YpFn+1)Fn) =
limp E(E(YpFn+1)Fn) (conditioned Beppo-Levi!) = limp E(YpFn) = Mn. Thus M is a
martingale. Being non-negative, it has an a.s limit, M , by Corollary 2.3.
Let Un = Mn - Xn .
Then U is a supermartingale and Un 0 (clearly, since Un = limp E(YpFn)
- Xn = limp E(Yp - Xn Fn) = limp E((Xp)+ - Xn Fn) limp E(Xp - Xn Fn) 0
(keep in mind that X is a submartingale!).
By Corollary 2.3, U has a limit, too , in L1. Denote it by U.
It follows that X = M U is a diference between two convergent sequences.
As both M and U are finite, the meaning is that X has a limit itself, X L1.

Corollary 2.5. If X is a martingale, supn E(Xn)+ < is


equivalent with supn E(Xn)) < . In that case X has an
almost sure limit, X.
Proof. x = 2x+ - x E(Xn) = 2E(Xn)+ - EXn . But EXn is a constant,
say a . Therefore supn EXn = 2supnEXn+ - a.
..
Here is a very interesting consequence of this theory, consequence that deals with

random walks.
Corollary 2.6. Let = (n)n i.i.d. rev. from L. Let Sn = 1++n, S0 = 0

A.BENHARI

-212-

and let m = E1. Let a and = a be the hitting time of (a,), that is, =
inf {n Sn > a}. Suppose that n are not constants.
Then m 0 < (a.s.).
The same holds for the hitting time of the interval (-,a).
Proof. If m > 0 , it is simple. The sequence Sn converges a.s. to due
to the LLN. (Sn/n m > 0 Sn !) . The problem is if m = 0 . In that case
let Xn = a - Sn. Then X is a martingale and EXn = a. If a < 0, =0 and there is
nothing to prove. So we shall suppose that a0. In this case X0 = a 0 and
(2.4)
= inf{n Xn < 0} .
Here is how we shall use the boundedness of the steps n. Let M =
n

Then M n M a.s.
The stopping theorem tells us that Y = (Xn)n is another martingale,
since every bounded stopping time (we mean n !) is regular. But Yn - M since
for n > Yn = Xn 0 (from (2.4)) and n Yn = X = X-1 + X-1 M
0 M = M. So Yn+M is another martingale, this time nonnegative. By Corollary
2.5 Yn+M should converge a.s. . Subtracting M, it follows that Yn f for some f
L1. So Xn f a - Sn f
Sn a-f
. Let E = {=}. If E,
then a-f() = limSn(). Meaning that Sn() is convergent.
Well, the sequence Sn diverges a.s.
Here is why: if (Sn)n would be convergent, then it should be Cauchy. Thus Sn+k
Sn < k for great n. Hence Sn+k Sn < , Sn+2k Sn-k < , Sn+3k Sn-2k < ,
. But if n are not constants, there exists a k such that P(Sn+k Sn < ) =q < 1.
Then , as the above differences are i.i.d., P(Sn+k Sn < , Sn+2k Sn-k < , Sn+3k
Sn-2k < ,) = qqq= 0. So P({(Sn())n is Cauchy} = 0.
The only conclusion is that P(E) = 0.
.

A.BENHARI

-213-

3.

Uniform integrability and the convergence of semimartingales

in L1
We want to establish conditions such that a martingale X converge to X
1
in L . In that case we shall call X a martingale with tail.
Proposition
Proposition 3.1.
1
If X is a martingale and Xn X in L , then Xn = E(XFn).
Proof. From the definition of the conditioned expectation we see that
1
the claim is that E(Xn1A) = E(X1A) for any A Fn. But Xn+k X in L as k
E(Xn+k1A) E(X1A) as k. And E(Xn+k1A) = E(E(Xn+k1AFn)) = E(1A E(Xn+kFn)) = E(1A
Xn).
Proposition 3.2. Conversely, if Xn = E(fFn) then Xn E(fF) both a.s.
1
and in L .
Proof. Let Z = E(fF).
Suppose first that f 0. Then Xn is a nonnegative martingale. According to
Corollary 1.3 X converges a.s. to some X from L1.
Step 1. If f is even bounded, f M , then Xn M too; hence X M X
- Xn 2M. By Lebesgues domination criterion EX - Xn 0, thus Xn X in L1.
Moreover, if A Fn then E(Xn+k1A) E(X 1A) thus E(X 1A) = limk E(E(Xn+k1AFn)) =
limk E(1A E(Xn+kFn))= E(1A Xn) (since X is a martingale!). It means that E(XFn)
= Xn . But E(ZFn) = E(E(fF)Fn) = E(fFn) Xn . Therefore Z and X are both
from L1(F) and E(ZFn) = E(XFn) n. As F is generated by the union of all F
and that union is an algebra it follows that Z = X . We proved the claim if f is
bounded and nonnegative.

Step 2. If f 0, let fa = fa. Let a be great enough such that f-fa 1 <
for a given arbitrary . Then
E(fF) - E(fFn) 1 E(fF) - E(faF) 1 +
E(faF) - E(faFn) 1 + E(faFn) - E(fFn) 1 f - fa 1 + E(faF) E(faFn) 1 + fa - f 1 (due to the contractivity of the conditioned expectation,
see the lesson!) 2 +
E(faF) - E(faFn) 1. According to step 1, the second
term converges to 0 (as fa is bounded and nonnegative). It follows that
limsupn E(fF) - E(fFn) 1 2 + 0 E(fFn) E(fF) in L1.
Step 3. f any. We write f =f+ - f- . Then E(f+Fn) E(f+F) both a.s. and
1
in L and the same holds for E(f-Fn) E(f-F). Subtracting the two relations we
infer that E(fFn) E(fF) both a.s. and in L1.
Remark. The result of proposition 3.1 and 3.2 is that even if all the
martingales bounded in L1 converge a.s., only the martingales of the form Xn =
E(fFn) have a tail that is, converge to its a.s.- limit in L1
Definition. Let X = (Xn)n be a sequence of random variables from L1. We say
that X is uniformly integrable iff for any >0 there exists an a = a() such that
E(Xn 1{ X n >a } ) < n.
Notice that can write the condition from
the definition also as E(Xn -a(Xn)) < n, where a(x) = (xa)(-a) or as
E(Xn-Xna) < n.
A.BENHARI

-214-

Proposition 3.3. If X is uniformly integrable, then X is bounded in L1.


Proof. Let >0 and a as in the definition. Then EXn=E(Xna + (XnXna)) a + n .

a.s.

The importance of this concept is given by


Proposition 3.4. Let X be a sequence of r.v. from L1. Suppose that Xn X
1
Then Xn X in L iff X is uniformly integrable.

Proof. . Let >0. Let a such that X - Xa 1 < /3. Let n() be
such that n > n() X - Xn 1 < /3. Then n > n Xn - Xna 1 Xn
- X 1 +
X - Xa 1 +
Xa - Xna 1 /3 + /3 +
Xn X 1 3/3 = .
For n n() let bn > 0 be such that Xn - Xnbn 1 < . Finally, let A
= max{a,b1,b2,,bn()}. Then E(Xn-XnA) < n.
. Let >0 and a as in the definition of uniform integrability; from
Fatou we infer that X is in L1, too as EX= E(liminfnXn) liminfnE(Xn) <
(according to proposition 3.3!). Let then a be chosen such that X Xa 1 < and Xna - Xn < n.
Then
X-Xn 1 X - a(X) 1 + a(X ) a(Xn) 1 + a(Xn) - Xn = I
Xna + II + III. The first term is X - Xa 1 < ; the last one is
Xn < ; as about the term II, Xn X a(Xn) a(X) since a is
continuous. But the sequence (a(Xn))n is dominated by a therefore a(X )
a(Xn) 1 0 as n by Lebesgues domination principle.
The conclusion is that liminfn X-Xn 1 2. And is arbitrary
Corroborating with propositions 3.1 and 3.2 we arrive at the following
conclusion:

Corollary 3.5. The only martingales with tail are the

uniform integrables ones.

How can we decide if a martingale is uniformly integrable?


Here is a very useful criterion.

Proposition 3.6. (The criterion of Valee Poussin)


X is uniformly integrable
there exists an nondecreasing function
:[0,) [0,) with the property that (t)/t as t such that
sup{E(Xn) n} < .
We can say that uniform integrability = boundedness in some faster that x
to infinity. Actually we shall see that this function may be chosen to be even

convex.
Proof. . We shall first establish an auxiliary result:
A.BENHARI

-215-

Lemma 3.7. Let (an)n be an increasing sequence of positive integers.


Let (m)= {n an m}. (Thus 0 = 0 and (am) = m ). Thus the sequence
(a(m))m is obviously non-decreasing and () = . Let
(3.1) (x) =

( (m)1[ m,m+1) )1[ 0, x ] d


m=0

Then
(3.2) is non-decreasing and convex;

( x )
= ;
x
x

(3.3) lim

(3.4) If Y 0 is a random variable, then E(Y)

(m)P(Y

m).

m=1

Proof of the Lemma. As the sequence (a(m))m is non-decreasing and nonnegative, the function (t):=

(m)1
m=0

[ m ,m+1)

(t ) is also non-decreasing and non-

negative. As (x)= (t)dt , is clearly convex and no-decreasing. Then the


0

( x)
( x)
(m + 1)
is non-decreasing thus limx
= limm
(here m is
x
x
m +1
(1) + (2) + ... + (m)
an integer!) = limm
= limm(m) (by Stolz-Cesaro!) = . We
m +1

function x

have proved the claims (3.2) and (3.3).

As about the last one, E(Y) =

E ((Y)1

{m

E ((m+1)1

< m+1}

m =0

(as is non-decreasing) =

(m + 1) P(m Y < m+1) =


m =0

(m + 1) P(Y

m+1)) =

m) -

m =0

( m ) P(Y m) =
m=1

(m + 1) P(Y
m =0

m+1) =

{m

< m+1}

m =0

(m + 1) (P(Y

m =0

(m + 1) P(Y

m) P(Y

m) -

m =0

((m + 1) (m)) P(Y m) (as (1) = 0!) =


m=1

( m ) P(Y

m)

m =1

m +1

(since

(t)dt

= (m)).

The proof of the Lemma is complete.


Continue with the proof of .Let an be positive integers such that
E(Xk 1{ X k > an } ) < 2-n for any k. Let (m) and be constructed as in the previous
Lemma. Let Y be one of the random variables Xk. Remark that, according with the
construction of the numbers an we have 2-n E(Y 1{Y an } ) =

(m1{m Y < m+1}) =

mP (m

E (Y1{m Y < m+1})

m = an

m = an

Y < m+1) = anP(an Y < an+1) + (an+1) P(an+1 Y < an+2)

m= an

+(an+2) P(an+2 Y < an+3) + .


=an(P(an Y < an+1) + P(an+1 Y < an+2) +P(an+2 Y < an+3) + .) +P(an+1 Y <

A.BENHARI

-216-

an+2) +2P(an+2 Y < an+3) + 3P(an+3 Y < an+4)+ . = anP(Y an) + P(Y an + 1) +
P(Y an+2) + .

P (Y

m) (since an 1 !) or

m = an

(3.5)

P (Y

m ) 2 -n

m = an

Well, the claim is that E(Y) 1.


Indeed, according to the previous Lemma, E(Y)

attention points out that

(m)P(Y

m) =

(m)P(Y
m =1

P (Y
n 1

m =1

m = an

m) . But a bit of

m)

-n

= 1.

n 1

Therefore we found a such that sup{E(Xn) n} 1.


Proof of . This the easy implication. Let > 0 arbitrary. We want to discover
an t such that E(Y1{Y t}) if Y = Xk for any k. Let A be such that E(Xk)

( y ) A
( y )
y
. We can find
y

A
such a t because of the property (t)/t as t , which we assumed.
(Y )
1{Y t } )
Let then Y be one of the random variables Xk. Then E(Y1{Y t}) E(
A
(Y )

E(
) =
E(Y)
A = .
A
A
A
Corollary 3.8. If a martingale X is bounded in Lp or in Lln+L then it is
uniformly integrable. Bounded in Lln+L means that sup {E(Xnln+Xn)} < . In this
A k and let t > 0 be such that y t

case it has a tail.


Proof. We choose (x) = xp , p > 1 or (x) = xln+x .
Remark. Example 4 points out that if X is not bounded in Lln+L then X
may not be uniform integrable. Indeed, if Xn = n 1 1 ,then E(Xnln+Xn) = lnn as
0,
n

n . This martingale is not bounded in Lln+L.


Now we establish the connection between uniform integrability and the
regularity of the stopping times.

Proposition 3.8. If X is an uniformly integrable martingale, then every


stopping time is regular. As a consequence E(X F) = X for any
stopping times. In particular EX = EX1 for any .
Proof.
First remark that any uniform integrable martingale is
1
bounded in L hence it has an almost sure limit X which is also a L1-limit.
Therefore X makes sense even on the set =. So we can assume that Xn = E(fFn)
for some f L1(F) (actually we can put f = X!). Then X = E(f F) (indeed,
E(fF) =

1n

E(fFn)1{=n}=

Xn1{=n} = X ). We shall prove that the family

1n

{E(fF) stopping time} is uniformly integrable. Let be increasing and convex


such that E(f) < , (t)/t if t (such a exists according to the

A.BENHARI

-217-

Theorem of Vallee-Poussin: any finite set of random variables is uniformly


integrable!) Then (E(fF)) E((f)F) (Jensen!) E(X) =
E((E(fF))) E((E(fF))) (Jensen for x a x) E(E((f)F)) =
E((f)) < .
Therefore the family {E(fF) stopping time} is uniformly integrable.
1
But Xn X a.s. According to Proposition 3.4 it must converge in L , too; it
means that is a regular stopping time. For the rest, see the previous lesson
(stopping theorems). {E(fF) stopping time} is uniformly integrable.

4. Singular martingales. Exponential martingales.


A singular martingale is a nonnegative martingale, which converges to 0.
We shall construct here a family of such kind of martingales.
Let (n)n be a sequence of bounded i.i.d. random variables. Let Sn = 1++n .
The sequence (n)n is called a random walk. If E1=0, then Sn is a martingale.
t
Let L(t) = E e 1 be the Moment Generating Function of 1. (Notice that L(-t) is
the Laplace transform of 1). As 1 is bounded, L makes sense for any t and is a
convex function. Moreover, L(t) > 0 hence the function (t) = ln(L(t)) makes sense
, too. Notice also that L is indefinitely differentiable, since we can apply
Lebesgues Theorem and
t
(4.1) L(n)(t) = E(1n e 1 )
We claim that the function is convex, too. Indeed, (t) = (L(t)L(t)t
t
t
(L(t))2)/L2(t). We check that > 0 LL > (L)2 (E(1 e 1 ))2 < E(12 e 1 ) E( e 1
). To get the result, apply Schwartzs inequality (Efg) Ef Eg for f = 1 e
2

t1
2

, g =

t1
2

e . Moreover, the equality is possible only if f/g = constant a.s. 1 =


constant. Meaning that if 1 is not a constant, then is strictly convex.
tS n (t )
t ( t )
t ( t )
Let now Xn = e n
. Thus Xn+1 = Xn e n +1
E(Xn+1Fn) = XnE e n +1
(as n+1 is
- (t)
(as n+1 has the same distribution as 1!) = Xn (as
independent on Fn !) = XnL(t)e
- (t)
-ln(L(t))
e
= e
= 1/L(t) !) . Thus X = (Xn)n is a positive martingale and EXn = 1.
Proposition 4.1. The martingale X is singular.
S
S
Proof. From the law of large numbers n E1 tSn - n(t) = n(t n n
n
(t)) if tE1> (t) and - if tE1 < (t). The only problem is if tE1 =
tE ( )
t
tE ( )
(t) tE1 = ln(L(t)) L(t) = e 1 E e 1 = e 1 . But Jensens inequality
t
tE ( )
for the convex function x a etx points out that E e 1 e 1 and, as this
function is strictly convex, the equality may happen iff 1 is constant a.s.,
which we denied.
After all, the conclusion is that tSn - n(t) - Xn 0.

Definition. Such kind of martingales are called exponential martingales.


They are of some interest in studying random walks.

A.BENHARI

-218-

Proposition 4.2. Let a be the hitting moment of (a,)


by S, a 0 . If E1 0 and 1 L, then a is regular
with respect to the martingale Xn = etS n(t ) if t 0.
n

As a consequence, E X = 1.
a

Proof. This stopping time is finite a.s. by Corollary 2.7. It means that
Xn X (a.s.). But notice that Sn a. Thus, if t > 0, Xn eta-n(t) eta (since
(t) = logE e t1 log e tE1 (by Jensen!) = tE1 0!) so we can apply Lebesgues
domination criterion to infer that Xn X in L1, too.
There is a case when this fact is enough to find the distribution of a.
Suppose that n 1 1 , p . This is the simplest random walk when
q p
the probability of a step to the right is p and the probability of a step to the
left is q = 1 p . Suppose a is a positive integer. Then S = a. As the above
- (t)
-at
proposition tells us that E e tS (t ) = 1 it means E e ta(t ) = 1 t 0 Ee = e
t 0. Let us denote (t) by u 0. The function (t) becomes in our case (t)
=ln(pet + qe-t ) = u hence
(4.2)
pet+qe-t = eu.
The idea is to find the positive t=(u) from the equation (4.1) in order
to find the Laplace transform of ,
(4.3) L(u) = Ee-u = e-a(u)
A bit of calculus points out that
e u + e 2u 4 pq
(4.4) t =(u) = ln
2p
which, replaced in (4.3) gives us
(4.5)

e u + e 2u 4 pq -a
e u e 2u 4 pq a
)
L(u) = (
) = (
2p
2q

Remark that the Laplace transform is the ath power of another Laplace
transform, which means that is a convolution of a i.i.d random variables. That
should not be very surprising, because in order to reach the level a the random
walk S should reach successively the levels 1,2,,a-1!
If one expands (4.5) in series one discovers the moments of . In order to
find the distribution of it is more convenient to deal instead with the
generating function g(x) = Ex. We want x to be in [0,1]. We can do that replacing
e-u by x (since u 0 0 < x 1!) . Then we obtain
a

1 1 4 pqx 2

(4.5) g(x) =

2qx

Recall now the Mac Laurin expansion of 1 1 x is


2 n 1
n 1

2
3
4
5

x n = x + x + x + 5 x + 7 x + ...
(4.6) 1 1 x =
2 n 1
2 8 15 128 256
n =1 ( 2 n 1) 2

A.BENHARI

-219-

and replace in (4.5). One gets


a

2 n 1

n 1

1
2

1
n
n
n
= =(
p q x
(4.7) g(x) =
n =1 (2n 1)

2
3
3 2 5
4 3 7
px + p qx + 2 p q x + 5 p q x + 14 p 5 q 4 x 9 + 42 p 6 q 5 x11 + ... )a .

which gives the distribution of if one could effectively do the computations.


For a = 1, anyway, the result is that
2 n 1


-1
n 1 p n q n 1
(4.8) Po1 =
2 n 1 .
n =1 ( 2n 1)

2n 1
n 1


For p = q = , Po1 =
2 n 1 .
2 n 1
n =1 ( 2 n 1) 2
2ap
Remark. Notice that p > Ea =
< but p = Ea = .
2 p 1

-1

A.BENHARI

-220-

Bibliography:
1. P.Billingsley: Probability and Measure, Wiley and sons, New-York, 1979
2. L.Breiman: Probability, Addison-Wesley, Reading, 1968
3. W. B. Davenport, Jr. and W. L. Root, An Introduction to the Theory of Random Signals
and Noise. New York, NY, USA: McGraw-Hill Inc., 1958.
4. W. B. Davenport, Jr., Probability and Random Processes: An Introduction for Applied
Scientists and Engineers. New York, NY, USA: McGraw-Hill Inc., 1970.
5. C.Dellacherie, P-A.Meyer: Probabilits et Potentiel, Vol.2, Hermann, Paris, 1980
6. J. L. Doob, Stochastic Processes. New York, NY, USA: John Wiley & Sons Inc.,
1958.
7. W. Feller An introduction to probability theory and its application Tome I&II. Wiley
(1966)
8. J. E. Freund, Mathematical Statistics. Englewood Cliffs, NJ, USA: Prentice-Hall, Inc.,
16th printing, 1962.
9. J. E. Freund and G. A. Simon, Modern Elementary Statistics. Englewood Cliffs, NJ,
USA: Prentice-Hall, Inc., 8th ed., 1992.
10. W. A. Gardner, Introduction to Random Processes with Applications to Signals and
Systems. London, UK: Collier Macmillan Publishers, 1986.
11. Peter Galko, ELG 5119/92.519 Stochastic Processes Course Notes, Faculty of
Engineering, University of Ottawa, Ottawa, ON, Canada, Fall 1987.
12. W. A. Gardner, Introduction to Random Processes with Applications to Signals and
Systems. New York, NY, USA: McGraw-Hill Publishing Company, 2nd ed., 1990.
13. B. V. Gnedenko, Theory of Probability. New York, NY, USA: Chelsea Publishing
Co., 1962. Library of Congress Card No. 61-13496.
14. R. M. Gray and L. D. Davisson, Random Processes: A Mathematical Approach for
Engineers. Englewood Cliffs, NJ, USA: Prentice-Hall, Inc., 1986.
15. H. P. Hsu, Schaum's outline of theory and problems of probability, random variables,
and random processes. New York, NY, USA: McGraw-Hill Inc., 1997.
16. A. N. Kolmogorov, Foundations of the Theory of Probability. New York, NY, USA:
Chelsea Publishing Co., english translation of 1933 german edition, 2nd english ed.,
1956.
17. Alberto Leon-Garcia, Probability and Random Processes for Electrical Engineering.
Reading, MA, USA: Addison Wesley Publishing Co. Inc., 2nd ed., 1994. ISBN 0-20150037-X.
18. Alberto Leon-Garcia, Student Solutions Manual: Probability and Random Processes
for Electrical Engineering. Reading, MA, USA: Addison-Wesley Publishing Co. Inc.,
2nd ed., 1994. ISBN 0-201-55738-X.
19. M. Love, Probability Theory. Princeton, NJ, USA: D. Van Nostrand Co., Inc., 2nd
ed., 1960.
20. M. Love, Probability Theory, vol. I. New York, NY, USA: Springer, 4th ed., 1977.
21. M. Love, Probability Theory, vol. II. New York, NY, USA: Springer, 4th ed., 1978.
22. I. Miller and J. E. Freund, Probability and Statistics for Engineers. Englewood Cliffs,
NJ, USA: Prentice-Hall, Inc., 2nd ed., 1977.
23. F. Mosteller, R. E. K. Rourke, and G. B. Thomas, Jr., Probabilty and Statistics.
Reading, MA, USA: Addison Wesley Publishing Company Inc., 1961.
24. I. P. Natanson, Theory of Functions of a Real Variable. New York, NY, USA:
Frederick Ungar Publishing Co., 1955.
25. J.Neveu: Martingales temps discret, Masson, Paris, 1972

A.BENHARI

-221-

26. J.R.Norris: Markov chains, Cambridge University Press, 1997


27. M. O'Flynn, Probabilities, Random Variables, and Stochastic Processes. New York,
NY, USA: Harper & Row, Publishers, Inc., 1982.
28. Athanasios Papoulis, Random Variables and Stochastic Processes. New York, NY,
USA: McGraw-Hill Book Company, 2nd ed., 1984. ISBN 0-07-048468-6.
29. Athanasios Papoulis, Probability, Random Variables, and Stochastic Processes. New
York, NY, USA: McGraw-Hill Inc., 3rd ed., 1991. ISBN 0-07-048477-5.
30. Probability Theory, Random Processes, and Mathematical Statistics,Yu Rozanov,
Kluwer Academic Publishers, 1995
31. Sheldon Ross. A First Course in Probability. Englewoods CLiffs, NJ, USA: PrenticeHall, Inc., 1994.
32. A.N.Shiriyaev: Probability, Springer-Verlag, New-York, 1984
33. H. L. Van Trees, Detection, Estimation and Modulation Theory, Part I: Detection,
Estimation, and Linear Modulation Theory. New York, NY, USA: John Wiley & Sons
Inc., 1968.
34. Y. Viniotis, Probability and Random Processes for Electrical Engineering. New York,
NY, USA: McGraw-Hill Inc., 1998.
35. N. Wiener, Nonlinear Problems in Random Theory. Cambridge, MA, USA: The M.I.T.
Press, 1966, c 1958.
36. D.Williams: Probability with Martingales, Cambridge Math.Textbooks, Cambridge,
1991
37. Probability and Stochastic Processes, Roy D. Yates and David J. Goodman, John
Wiley and Sons, Second Edition, 2005

A.BENHARI

-222-

Vous aimerez peut-être aussi