Vous êtes sur la page 1sur 191

Probability and Statistics: Basic Thinking in

Statistics
Jongsig Bae

Preface
Statistics is a mathematical science pertaining to the collection, analysis,
interpretation or explanation, and presentation of data. It also provides
tools for prediction and forecasting based on data. It is applicable to a wide
variety of academic disciplines, from the natural and social sciences to the
humanities, government and business.
Statistical methods can be used to summarize or describe a collection of
data; this is called descriptive statistics. In addition, patterns in the data
may be modeled in a way that accounts for randomness and uncertainty in
the observations, and are then used to draw inferences about the process
or population being studied; this is called inferential statistics. Descriptive,
predictive, and inferential statistics comprise applied statistics.
There is also a discipline called mathematical statistics, which is concerned with the theoretical basis of the subject. Moreover, there is a branch
of statistics called exact statistics that is based on exact probability statements.
The word statistics can either be singular or plural. In its singular form,
statistics refers to the mathematical science discussed in this article. In its
plural form, statistics is the plural of the word statistic, which refers to a
quantity (such as a mean) calculated from a set of data.
February, 2016

Contents
1 Panorama in Statistics:Examples
1.1 Brownian motion . . . . . . . . . . . . . . . . . . . .
1.1.1 Brownian motion as a natural phenomena . .
1.1.2 Brownian motion as a stochastic process . . .
1.1.3 Exercises . . . . . . . . . . . . . . . . . . . . .
1.2 Stock pricing problem . . . . . . . . . . . . . . . . .
1.2.1 A geometric Brownian motion . . . . . . . . .
1.2.2 Black-Scholes option pricing formula . . . . .
1.3 Medical data and limiting theorems . . . . . . . . . .
1.3.1 Complete data and the classical model . . . .
1.3.2 Incomplete data and the Kaplan-Meier model

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

7
7
7
8
12
13
13
13
17
18
19

2 Basic thinking in Probability


2.1 Sample space and event . . . . . . . . . .
2.1.1 Probability and statistics . . . . . .
2.1.2 Sample space . . . . . . . . . . . .
2.1.3 Event . . . . . . . . . . . . . . . .
2.1.4 Set theory and events . . . . . . . .
2.1.5 Binomial theorem . . . . . . . . . .
2.1.6 Exercises . . . . . . . . . . . . . . .
2.2 Denitions of Probability . . . . . . . . . .
2.2.1 Classical denition of probability .
2.2.2 Frequency probability . . . . . . .
2.2.3 Bayesian probability . . . . . . . .
2.2.4 Axioms of probability . . . . . . . .
2.2.5 Exercises . . . . . . . . . . . . . . .
2.3 Conditional probability and independence
2.3.1 Conditional probability . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

21
21
21
23
24
26
30
32
33
33
33
35
35
41
43
43

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

2.4

2.5

2.6

2.7

2.8

2.3.2 Independence of events . . . . . . . . . .


2.3.3 Bayes theorem . . . . . . . . . . . . . .
2.3.4 Exercises . . . . . . . . . . . . . . . . . .
Distribution of random variable . . . . . . . . .
2.4.1 Discrete random variable . . . . . . . . .
2.4.2 Exercises . . . . . . . . . . . . . . . . . .
Expectation . . . . . . . . . . . . . . . . . . . .
2.5.1 Properties of expectation . . . . . . . . .
2.5.2 Moment of random variable . . . . . . .
2.5.3 Exercises . . . . . . . . . . . . . . . . . .
Absolutely continuous random variable . . . . .
2.6.1 Distribution function . . . . . . . . . . .
2.6.2 Riemann integral . . . . . . . . . . . . .
2.6.3 Absolutely continuous random variables
2.6.4 Exercises . . . . . . . . . . . . . . . . . .
Bivariate random variables . . . . . . . . . . . .
2.7.1 Marginal and conditional distribution . .
2.7.2 Conditional distribution . . . . . . . . .
2.7.3 Covariance and correlation coecient . .
2.7.4 Stochastic Independence . . . . . . . . .
2.7.5 Bivariate normal distribution . . . . . .
Convergence of random variables . . . . . . . .
2.8.1 Convergence of random variables . . . .
2.8.2 The central limit theorem . . . . . . . .
2.8.3 Exercises . . . . . . . . . . . . . . . . . .

3 Basic thinking in Statistics


3.1 Sampling Theory . . . . . . . . . . .
3.1.1 Descriptive statistics . . . . .
3.1.2 Sampling theory . . . . . . .
3.1.3 Order statistics . . . . . . . .
3.1.4 The 2 , t, and F distributions
3.1.5 Exercises . . . . . . . . . . . .
3.2 Estimation in statistical inference . .
3.2.1 Point estimation . . . . . . .
3.2.2 Interval estimation . . . . . .
3.2.3 Exercises . . . . . . . . . . . .
3.3 Testing in statistical inference . . . .

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

45
48
53
54
56
64
67
72
73
80
81
81
83
86
100
103
103
106
114
118
125
131
131
134
136

.
.
.
.
.
.
.
.
.
.
.

137
. 137
. 137
. 137
. 140
. 143
. 147
. 150
. 150
. 160
. 171
. 173

5
3.3.1
3.3.2
3.3.3
3.3.4
3.3.5
3.3.6

Testing statistical hypothesis . . . . . . . .


Formulating the hypotheses . . . . . . . .
Test criterion and rejection region . . . . .
Two types of error and their probabilities .
Performing a test . . . . . . . . . . . . . .
The p-value . . . . . . . . . . . . . . . . .

4 The uniform distribution

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

173
173
175
177
179
180
183

Chapter 1
Panorama in
Statistics:Examples
1.1

Brownian motion

We introduce and compare two concepts of Brownian motion: Brownian


motion as a natural phenomena and a Brownian motion as a probabilistic
model.
Example 1 When you throw a stone in air you will nd a trace that we call
a parabola. We call this a parabola as a natural phenomena. We model this
as mathematical equation
y = ax2 + bx + c
that we call a parabola as mathematical model.
Before dening a Brownian motion as a mathematical (or probabilistic)
model, we consider a Brownian motion as a natural phenomena by observing
an internet data.

1.1.1

Brownian motion as a natural phenomena

Internet 1 Brownian motion, named after the Scottish botanist Robert Brown,
is the random movement of microscopic particles suspended in a liquid or gas,
caused by collisions with molecules of the surrounding medium. Also called
Brownian movement.
7

8
More specically, we name it Brownian motion as a natural phenomena. We
now introduce a Brownian motion as a stochastic process.
Before introducing a Brownian motion as a stochastic process, we rst
introduce the concept of normal distribution.

1.1.2

Brownian motion as a stochastic process

We consider the following function


g(z) = ez , z IR
2

that we call a gaussian curve. Here, e = 2.71828 .


One can prove the following
Theorem 1

ez dz =
2

Here, is the area of the unit circle. That is, = 3.14159 .


By changing a variable we get

z2
1
e 2 dz = 1
2

Now, if we let
z2
1
(z) = e 2 , z IR
2

then
1. 0
2.

(z)dz = 1

so the function can be a probability density function.


Now, we consider a continuous type variable X. If the continuous type
variable X have the following property
1. f 0
2.

f (x)dx = 1

9
3. P (a X b) =

b
a

f (x)dx

then the variable X is called a random variable having the probability density
function function f .
We now have the following
Denition 1 If the random variable Z has the probability density function
z2
1
(z) = e 2 , z IR
2

then Z is said to have a standard normal distribution and is denoted Z


N (0, 1).
If IR and > 0, then using the transformation x = z + we get the
following identity

(x)2
1

e 22 dx = 1

2
Denition 2 If the random variable X has the probability density function
(x)2
1
f (x) =
e 22 , x IR
2

then X is said to have a normal distribution and is denoted X N (, 2 ).


One can prove the following facts.
Theorem 2

Let X N (, 2 ). Then

1. f is symmetric with respect to x = .


2. The graph of f is bell shaped.
3. f has a maximum value

1
2

at x = .

4. X has the mean and the variance 2 .


5. Dene M (t) := EetX for t then we get
M (t) = et+

2 2
t
2

, < t <

We call the function M as the moment generating function of X.

10
A collection of random variables is called a stochastic process. We rstly
look at some examples of discrete time stochastic process {X(t) : t T }.
Example 2 (Discrete time stochastic process)
1. If T = IN, then {Xn : n IN} = {X1 , X2 , . . . .} is a discrete time
stochastic process.
2. If T = ZZ, then {Xn : n ZZ} = {. . . . , X1 , X0 , X1 , . . . .} is a discrete
time stochastic process.
If the random variables in a discrete time stochastic process are independent then we call the stochastic process an independent trial.
We secondly look at examples of continuous time stochastic process {X(t) :
t T }.
Example 3 (Continuous time stochastic process)
1. If T = [0, 1] then {X(t) : 0 t 1} is a continuous time stochastic
process.
2. If T = [0, ) then {X(t) : 0 t < } is a continuous time stochastic
process
3. If T = IR then {X(t) : t IR} is a continuous time stochastic process
Consider a continuous time stochastic process {X(t) : 0 t < }. If the
random variables
X(t1 ) X(t0 ), X(t2 ) X(t1 ), , X(tk ) X(tk1 )
are independent for t0 < t1 < < tk , then {X(t) : 0 t < } is said to
have an independent increment. If the random variables
X(t + s) X(t), X(s) X(0)
have the same distribution for all s and t, then {X(t) : 0 t < } is said
to have stationary increment.
We now dene a Brownian motion as a probabilistic model.
Denition 3 A continuous time stochastic process {X(t) : t 0} is called
a Brownian motion if the following properties is satised

11
1. X(0) = 0,
2. {X(t) : t 0} have stationary and independent increment, and
3. for xed s > 0, the random variable X(s) is a normal distribution with
the mean s and the variance 2 s.
We denote this by {X(t) : t 0} BM (, 2 ).
Remark 1 Brownian motion is often called a Winer process.
Theorem 3 (Existence of a Brownian motion) There exists a Brownian motion as a continuous time stochastic process.
Project 1 Think of modeling a heart beat.
Denition 4 Let {X(t) : t 0} BM (, 2 ) be a Brownian motion. If
= 0 and 2 = 1, then it is called a standard Brownian motion and denoted
by {B(t) : t 0}. In this case, we have
X(t) = B(t) + t.
We now dene a geometric Brownian motion.
Denition 5 Let {Y (t) : t 0} BM (, 2 ) be a Brownian motion. Then
the stochastic process {X(t) : t 0} dened by
X(t) = eY (t)
is called a geometric Brownian motion.

12

1.1.3

Exercises

1. Use the polar coordinates to prove

2. Prove that

ez dz =
2

z2
1
e 2 dz = 1.
2

3. Let IR and > 0. Prove, using x = z + , that

4. Prove that

(x)2
1

e 22 dx = 1.
2

(x)2
1
e 22 dx = .
x

5. Prove that

(x)2
1
x2
e 22 dx = 2 + 2 .

6. Prove that

x2
t2
1
etx e 2 dx = e 2

for all t.
7. Prove that

(x)2
2 t2
1
etx
e 22 dx = et+ 2

for all t.
8. Let c(z) =

1
.
1+z 2

Prove that

c(z)dz = .

9. Evaluate, if it exists, the following integral:

x
dx.
(1 + x2 )

13

1.2
1.2.1

Stock pricing problem


A geometric Brownian motion

A geometric Brownian motion (GBM) (occasionally, exponential Brownian


motion) is a continuous-time stochastic process in which the logarithm of the
randomly varying quantity follows a Brownian motion, or a Wiener process.
It is applicable to mathematical modeling of some phenomena in nancial
markets. It is used particularly in the eld of option pricing because a quantity that follows a GBM may take any positive value, and only the fractional
changes of the random variate are signicant. This is a reasonable approximation of stock price dynamics.
A stochastic process St is said to follow a GBM if it satises the following
stochastic dierential equation:
dSt = St dt + St dWt
whereWt is a Wiener process or Brownian motion and (the percentage drift)
and (the percentage volatility) are constants.
For an arbitrary initial value S0 the equation has the analytic solution
((

St = S0 exp

t + Wt ,
2

which is a log-normal distribution random variable


( 2 with) the expected value
E(St ) = et S0 and the variance V (St ) = e2t S02 e t 1 .
The correctness of the solution can be veried using Its lemma. The
random variable log(St /S0 ) is normal distribution with the mean ( 2 /2)t
and the variance 2 t, which reects the fact that increments of a GBM are
normal relative to the current price, which is why the process has the name
geometric.

1.2.2

Black-Scholes option pricing formula

In 1973, Fischer Black and Myron Scholes published their ground breaking
paper the pricing of options and corporate liabilities. Not only did this specify
the rst successful options pricing formula, but it also described a general
framework for pricing other derivative instruments. That paper launched the
eld of nancial engineering. Black and Scholes had a very hard time getting

14
that paper published. Eventually, it took the intersession of Eugene Fama
and Merton Miller to get it accepted by the Journal of Political Economy. In
the mean time, Black and Scholes had published in the Journal of Finance a
more accessible (1972) paper that cited the as-yet unpublished (1973) option
pricing formula in an empirical analysis of current options trading.
The Black-Scholes (1973) option pricing formula prices European put or
call options on a stock that does not pay a dividend or make other distributions. The formula assumes the underlying stock price follows a geometric
Brownian motion with constant volatility. It is historically signicant as
the original option pricing formula published by Black and Scholes in their
landmark (1973) paper.
In this subsection we briey give the values for a call price c.
Suppose the present price of a stock is X(0) = x0 , and let X(t) denote its
price at time t. Suppose we are interested in the stock over the time interval
0 to T . Assume that the discounter factor is (equivalently, the interest rate
is 100% compounded continuously), and so the present value of the stock
price at time t is et X(t).
We can regard the evolution of the price of the stock over time as our
experiment, and the outcome of the experiment is the value of the function
X(t), 0 t T . Suppose the option gives one the right to buy one share of
the stock at time t for a price K. Suppose that X(t) is a geometric Brownian
motion process. That is
X(t) = x0 eY (t)
(1.1)
where {Y (t), t 0} is a Brownian motion process with drift coecient
and variance parameter 2 . By computing a conditional expectation we
have that, for s < t,
E[X(t)|X(u), 0 u s] = X(s)e(ts)(+

2 /2)

If we choose and 2 so that


+ 2 /2 = .

(1.2)

then it is known that the price an option to purchase a share of the stock at
time t for a xed price K is given by
c = E[et (X(t) K)+ .]

15
By observing that X(t) = x0 eY (t) , where Y (t) is normal with mean t
and variance 2 t, we see that

1
2
2
(x0 ey K)+
e(yt) /2t dy

2t 2

1
2
2
(x0 ey K)
e(yt) /2t dy.
=
2
log(K/x0 )
2t

Making the change of variable w = (y t)/ t yields


cet =

1 wt w2 /2
1 w2 /2
cet = x0 et
e
e
dw K
e
dw
2 a
2 a
where
a=

log(K/x0 ) t

Now,
1 wt w2 /2

e
2
dw
2 a
1 (wt)2 /2
2
e
dw
= et /2
2 a

2
= et /2 P {N ( t, 1) a}

2
= et /2 P {N (0, 1) a t}

2
= et /2 P {N (0, 1) (a t)}

2
= et /2 ( t a)
where is the standard normal distribution function. Thus, we see that

2
cet = x0 et+ t/2 ( t a) K(a).
Using that
+ 2 /2 =
and letting b = a, we can write this as follows:

c = x0 ( t + b) Ket (b).
where
b=

t 2 t/2 log(K/x0 ) t

16
Remark 2 The Black Scholes formula tells us that the price of an option is
determined by
1. the initial price x0 ,
2. the option exercising time t,
3. the option exercising price K,
4. the call interest , and
5. the variance of the stock 2 .

17

1.3

Medical data and limiting theorems

In a classical probability theory, one often has interests in establishing three


famous limit theorems. Law of large numbers(LLN), central limit theorem(CLT), and law of the iterated logarithm(LIL). These topics have the
own importance in the classical probability theory. The results also have
good applications in a parametric statistical inference. In a recent years,
under the topic of empirical process theory, authors are interested in the
uniform analogue of the three theorems. Uniform LLN of Glivenko-Cantelli
type, uniform CLT for Donsker type, and uniform LIL for Strassen type.
When the underlying data are IID complete case, these topics are well developed up to a generality of function indexed process and usefully applied in a
nonparametric statical inference. See for example, DeHardt for the uniform
LLN, and Ossiander for the uniform CLT and the uniform LIL.
In survival analysis, authors deal with the statistical inference problems
on the base of an incomplete data of the random censorship model. Since
Kaplan-Meier introduced the model in 1958, this topics have been studied
in various directions. The Kaplan-Meier estimator is generalized and rened
up to a Kaplan-Meier integral and some limit theorems for this integral are
obtained. See, for example, Stute and Wang for LLN and Stute for CLT.
In this section, we introduce the three famous limit theorems in probability theory.
Describing a rigorous medical dada requires advanced mathematics. In
this section we briey introduce a Kaplan-Meier model and some limit theorems.
We try to infer an unknown distribution function based on a complete
data of a random sample that is obtained by repeating an experiment.
In a medical experiment, it is often impossible to obtain a complete data.
So it is inevitable to try to infer the unknown distribution function based on
an incomplete data.
After Kaplan-Meier model(1958), the inference based on a incomplete
data is developing up to recent in probability and statistics.
A statistical inference is an inference on the unknown distribution function based on a random sampling in ways of estimation, testing and prediction.
Classical limit theorems such as law of large numbers, central limit theorem, and law of the iterated logarithm is evolving to empirical process theory.
Empirical process version of law of large numbers, central limit theo-

18
rem, and law of the iterated logarithm are Glivenko-Cantelli theorem(1933),
Donsker theorem(1951), and Strassen theorem(1964) respectively.

1.3.1

Complete data and the classical model

Uncertainty is often expressed as an unknown distribution function. We


try to infer the distribution function by using the complete data that are
obtained from an experiment.
In a classic model, we are given a random sample
X1 , . . . , Xn
in order to infer the distribution function F (x). We often use the empirical
distribution function dened as follows.
Fn (x) =

n
1
1(,x] (Xi ).
n i=1

(1.3)

Here, 1A (x) is the indicator function of the set A.


We have the following limit theorems on the empirical distribution function Fn (x) and the distribution function F (x).
Fact 1
1. (Law of large numbers) For xed x, Fn (x) converges to F (x) strongly.
That is,
P (Fn (x) F (x)) = 1.

2. (Central limit theorem) For xed x, n(Fn (x) F (x)) converges in


distribution to a normal distribution. That is,

z
t2
n(Fn (x) F (x))
1

e 2 dt.
lim P
z =
n

2
F (x) F 2 (x)

3. (Law of the iterated logarithm) For xed x, the limit points of the sequence

n(Fn (x) F (x))

2 ln ln n
is bounded.

19

1.3.2

Incomplete data and the Kaplan-Meier model

We consider the random censorship model where one observes the incomplete
data {Zi , i }. The {Zi } are independent copies of Z whose distribution is
H. The {Zi , i } are obtained by the equations Zi = min(Xi , Xi ) and i =
{Xi Yi } where the {Yi } are independent copies of the censoring random
variable Y with distribution G which is also assumed to be independent of
F , the distribution of IID random variables {Xi } of original interest in a
statistical inference.
This model has been introduced by Kaplan-Meier in 1958 and the KaplanMeier estimator given below is known as one of the reasonable estimator of
the underlying distribution F in the random censorship model.
The Kaplan-Meier estimator Fn is dened as follow.
Fn (x) = 1

i=1

[i;n]
1
ni+1

]1(,x] (Zi )

(1.4)

where
Z(1) , . . . , Z(n)
are order statistics of
Z1 , . . . , Zn
and [i;n] are the concomitant of Z(i) . That is if, Z(i) = Zj , then [i;n] = j
We need more notation in describing the limit theorem.
Let F {a} = F (a) F (a) denote the jump size of F at a and let A be
the set of all atoms of H which is an empty set when H is continuous. Let
H = inf {x : H(x) = 1} denote the least upper bound of the support of H.
Consider a sub-distribution function F dened by
F (x) = F (x){x < H } + [F (H ) + {H A}F {H }]{x H }.

(1.5)

We are now introduce the three limit theorems in the medical data of the
Kaplan-Meier model.
Fact 2
1. (Law of large numbers) For xed x, Fn (x) converges to F (x) strongly.
That is,
(
)
P Fn (x) F (x) = 1.

20

2. (Central limit theorem) For xed x, n(Fn (x) F (x)) converges in


distribution to a normal distribution. That is,

z
t2
n(Fn (x) F (x))
1
e 2 dt.
z =
lim P
n

2
F (x) F 2 (x)

3. (Law of the iterated logarithm) For xed x, the limit points of the sequence

n(Fn (x) F (x))

2 ln ln n
is bounded.
Remark 3 We omit the detailed mathematical conditions.

Chapter 2
Basic thinking in Probability
2.1
2.1.1

Sample space and event


Probability and statistics

Why are you studying probability? We believe that many problems in real
life depend on uncertainty and they can sometimes answered by probabilistic
idea. Uncertainty is part of our daily experience.
A probability is a number that expresses the degree to which an occurrence is certain or uncertain. Probability is the branch of mathematics that
seeks to study uncertainty in a systematic way.
Statistic is a bunch of data consists of meaningful numbers whose name
is originated from state arithmetic. Statistics is branch of science that collect
data, and draw an inference from them.
1. The quantum theory of physics depicts a Universe whose fundamental
structure is random.
2. Modern models of inheritance, too, are based on a probabilistic understanding of how genes are transmitted from parents to children.
Genetic counseling relies on this work.
3. Probability arguments underpin recent modeling of the AIDS epidemic.
4. Have you ever applied for credit?
5. Court cases are increasingly being inuenced by forensic evidence based
on probabilities.
21

22
We begin with listing some experiments.
1. Flip a coin.
2. Record the amount of money needed to buy a gift for your parents.
3. Record the sex of baby born at a local maternity hospital.
4. Interview 100 shoppers in a busy shopping center: record the number
who have heard of a new brand of refrigerator.
5. Test electrical components consecutively as they come o a production
line; record the number tested until the rst defective component is
discovered.
6. Measure the length of a sh.
7. Record a stock price increment.
8. Investigate credits in a statistics class.
9. Operate on a patient to remove a stomach cancer; record the amount
of time(months) that goes by until the patient relapses.
10. Record the birth weights(kilograms) of twins.
An outcome is a result of an experiment. An outcome is not always
a number. Some outcomes in above experiment are numbers, others are
not. For example, in ipping coin the outcomes consist of head and tail.
The outcomes of recording the birth weights of twins are ordered pairs of
numbers.
Experiments in the above examples are called a random experiment in
the sense that the numbers of outcome is at least two.
On the other hand, a deterministic experiment is an experiment whose
result is deterministic. So the number of outcome in a deterministic experiment is only one. Because we always know the result of a deterministic
experiment in advance, it is not our goal of studying in probability any more.
The following is an example of a deterministic experiment.
Example 4
1. Record the age of baby born at a local maternity hospital.

23

2.1.2

Sample space

The set of all possible outcomes of an experiment is known as the sample


space of the random experiment and is denoted by .
Example 5
1. In ipping a coin, the sample space is
= {H, T }.
The outcome will be H if the coin is head and T if it is tail.
2. In recording the amount of money needed to buy a gift for your parents,
= [0, ).
3. In recording the sex of baby born at a local maternity hospital,
= {G, B}
where the outcome G means that the child is a girl and B means that
it is a boy.
4. In interviewing 100 shoppers in a busy shopping center and recording
the number who have heard of a new brand of refrigerator,
= {0, 1, . . . , 100}.
5. In recording the number tested until the rst defective component is
discovered,
= {0, 1, . . . , }.
6. In measuring the length of a sh,
= [0, ).

24
7. In recording a stock price increment,
= (, )
because the stock price increment can be a any real numbers: positive,
zero, or negative numbers.
8. In investigating credits in a statistics class
= {A+, A, B+, B, C+, C, D+, D, F }.
9. In recording the amount of time(months) that goes by until the patient
relapses after operating to remove a stomach cancer,
= [0, ).
10. In recording the birth weights(kilograms) of twins,
= [0, ) [0, ).

2.1.3

Event

An event is a part of outcomes in a random experiment. It is a subset of the


sample space. That is, an event is a set consisting of possible outcomes of
the experiment. If the outcome of the experiment is contained in A, then we
say that A has occurred.
An event that consist of any single element is called elementary event.
That is, if is a single element of , then {} is an elementary event which
is a subset of . If event has at least two element, then it is called a compound
event. The sample space itself is called a total event and it is an event
that always occur. An empty set which is a subset of is called a null event
and it is an event that never occur.
Some examples of events are as follows.
Example 6
1. In ipping a coin, if A is
A = {H} = {H, T }
then A is the event that the outcome is head.

25
2. In recording the amount of money needed to buy a gift for your parents,
if A is
A = [0, 50000] = [0, )
then A is the event that the amount of money is at most 50000.
3. In recording the sex of baby born at a local maternity hospital, if A is
A = {G} = {G, B}
then A is the event that the baby is a girl.
4. In interviewing 100 shoppers in a busy shopping center and recording
the number who have heard of a new brand of refrigerator, if A is
A = {1, . . . , 100} = {0, 1, . . . , 100}
then A is the event that at least one of the interviewee have heard of
the new brand of refrigerator.
5. In recording the number tested until the rst defective component is
discovered, if A is
A = {100, 101, . . .} = {0, 1, . . . , }
then A is the event that the defective component is not discovered before
100.
6. In measuring the length of a sh, if A is
A = [30, ) = [0, )
then A is the event that the length of the sh is at least 30
7. In recording a stock price increment, if A is
A = [30, 30] = (, )
then A is the event that the stock price increment is at most 30.

26
8. In investigating credits in a statistics class, if A is
A = {A+, A, B+, B} = {A+, A, B+, B, C+, C, D+, D, F }
then A is the event that the credit is at least B.
9. In recording the amount of time(months) that goes by until the patient
relapses after operating to remove a stomach cancer, if A is
A = [120, ) = [0, )
then A is the event that the patient does not relapse before 10 years.
10. In recording the birth weights(kilograms) of twins, if A is
A = [4, ) [4, ) = [0, ) [0, )
then A is the event that the weight of both twins are at least 4.

2.1.4

Set theory and events

We review the basic set theory.


Set theory is a fundamental tool that are introduced by George Boole(1825
1864) and Georg Cantor(1845 1918).
By a set we mean any collection into a whole of denite, distinct objects
which are called the elements of the set of our perception or of our thought.
If the object x is an element of the set S, we say that x belongs to S, or S
contains x, and is denoted
x S.
If x is not an element of S then it is denoted
x
/S
Consider a set that has no element at all. Such a set is called an empty set
and is denoted .
The elements of a set, also called its members, can be anything: numbers,
people, letters of the alphabet, other sets, and so on. Sets are conventionally
denoted with capital letters.

27
There are two ways of describing, or specifying the members of, a set.
One way is by intensional denition, using a rule or semantic description. In
this case the set is denoted
{x S : p(x)}
This is called a set build form.
The second way is by extension, that is, listing each member of the set. An
extensional denition is notated by enclosing the list of members in braces.
{x1 , x2 , . . .}
This is called a tabular form. The order in which the elements of a set are
listed in an extensional denition is irrelevant, as are any repetitions in the
list.
One often has the choice of specifying a set intensionally or extensionally.
If every member of set A is also a member of set B, then A is said to
be a subset of B, written A B (also pronounced A is contained in B).
Equivalently, we can write B A, read as B is a superset of A, B includes
A, or B contains A. The relationship between sets established by is called
inclusion or containment.
If A is a subset of, but not equal to, B, then A is called a proper subset
of B, written A B (A is a proper subset of B) or B A (B is proper
superset of A).
The empty set is a subset of every set and every set is a subset of itself:
The statement that sets A and B are equal means that they have precisely
the same members (i.e., every member of A is also a member of B and vice
versa). An obvious but very handy identity, which can often be used to show
that two seemingly dierent sets are equal. A = B if and only if A B and
B A.
The set
A B = {x : x A or x B}
is called the union of A and B.
The set
A B = {x : x A and x B}
is called the intersection of A and B.
The set
A \ B = {x : x A and x
/ B}

28
is called the dierenceof A and B.
What is a universal set? Does there exist the universal set U of all sets?
If there exist such a U then the set
R = {A U : A
/ A}
is well-dened. The problem arises when it is considered whether R is an
element of itself. If R is an element of R, then according to the denition R is
not an element of R. If R is not an element of R, then R has to be an element
of R, again by its very denition. The statements R is an element of R
and R is not an element of R cannot both be true, thus the contradiction.
This is known as the Russells paradox. We conclude that there does not
exist the set of all sets. When we say that U is a universal set, it is the set
of all elements in a restricted sense. So, for example, the sample space is
the universal set of all possible outcomes in an experiment. The set IR is the
universal set of all real numbers.
Now let U be a given universal set. The set
AC = U \ A
is called the complement of A. It is clear that
A \ B = A BC .
The cardinality of a set is the number of members of the set. For example,
1. The cardinality of empty set is zero.
2. The cardinality of {x1 , x2 , . . . , xn } is n which is a nite number.
Some sets have innite cardinality. The set N of natural numbers, for
instance, is innite. Some innite cardinalities are greater than others. For
instance, the set of real numbers has greater cardinality than the set of
natural numbers. However, it can be shown that the cardinality of (which is
to say, the number of points on) a straight line is the same as the cardinality
of any segment of that line, of an entire plane, and indeed of any Euclidean
space.
If a set A has its proper subset that is one to one correspondence to A
itself, then A is called an innite set. If a set A is not a innite set, then it
is a nite set. For example the set of natural numbers
IN = {1, 2, 3, . . .}

29
has a proper subset of even numbers
INE = {2, 4, 6, . . .}
one correspondence to. So IN is an innite set. On the other hand the set
A = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
cannot have any proper subset that is one correspondence to A. So A cannot
be a innite set. So it is a nite set.
If an innite set A is one to one correspondence to the
IN = {1, 2, 3, 4, . . .}
then we say that A is a countable set. By denition a countable set can be
represented as
{x1 , x2 , x3 , . . .}.
By denition, IN is a countable set. It is also known that the set ZZ of
integers and the setQ rational numbers are countable sets.
If an innite set is not a countable then it is called an uncountable set.
Consider
[0, 1] = {x IR : 0 x 1}
where IR is the set of real numbers. It is known that the set [0, 1] is an
uncountable set. It is proved by Cantor. Therefore the set IR of real numbers
and the set
QC = IR \Q
of irrational numbers are uncountable. We observe that
IN ZZ Q IR.
Because Q is a countable set, any innite subset of Q is countable set. The
rationales are a densely ordered set: between any two rationales, there sits
another one, in fact innitely many other ones.
A new set can be constructed by associating every element of one set with
every element of another set. The Cartesian product of two sets A and B,
denoted by A B is the set of all ordered pairs (a, b) such that a is a member
of A and b is a member of B.
Example 7

30
1. {1, 2} {a, b} = {(1, a), (1, b), (2, a), (2, b)}.
2. {1, 2, g} {r, w, g}
= {(1, r), (1, w), (1, g), (2, r), (2, w), (2, g), (g, r), (g, w), (g, g)}.
3. {1, 2} {1, 2} = {(1, 1), (1, 2), (2, 1), (2, 2)}.
4. A B C = (A B) C.
5. IR2 := IR IR = {(x, y) : x, y IR}.
6. IR3 := IR IR IR = {(x, y, z) : x, y, z IR}.
In general, IRn is called the Euclidean space.
In some random experiments the outcome is the sequence of numbers
followed by a repeated the basic experiment countably many times. In this
case, outcomes are sequences of numbers. For example, if the sample space
of the basic experiment is {0, 1} and we repeat this experiment countably
many times, then we could take the new sample space = {0, 1}IN , the space
of entire sequences of = (1 , 2 , ), where i are 0 or 1. It is known the
set of outcomes of a independent trials is such an example.
In some random experiments the outcome is the path or trajectory followed by a system over an interval of time. In this case, outcomes are functions. For example, if the system is observed over [0, ) and its path is
continuous, we could take = C[0, ), the vector space of continuous realvalued functions on [0, ). It is known the sample paths of a Brownian
motion is such an example.
We now understand the sample space and events as the universal set and
subsets.

2.1.5

Binomial theorem

We dene

( )
n
r

, for r n, by
( )

n!
n
,
:=
(n r)!r!
r

31
( )

and say that nr represents the number of possible combinations of n objects


taken r at a time. Notice that
(

( )

n
n!
n
n!
=
=
.
=
(n r)!r!
r!(n r)!
nr
r

We have the identity


( )

n
n1
n1
=
+
, 1 r n.
r
r1
r

By induction one can prove


Theorem 4 (The binomial theorem)For n IN,
n

(x + y) =

k=0

Here,

( )
n
r

( )

n k nk
x y .
k

is called the binomial coecient.

(2.1)

32

2.1.6

Exercises

1. Find 5 random experiments.


2. Determine the sample space in the above random experiments.
3. Describe events in the above random experiments.
4. Find 5 deterministic experiments.
5. Prove that A for any set A.
6. Prove that A A for any set A.
7. Describe the set that contains the solution of the equation x 1 = 0.
8. Describe the set that contains the solution of the equation x + 1 = 0.
9. Describe the set that contains the solution of the equation 2x 1 = 0.
10. Describe the set that contains the solution of the equation x2 2 = 0.
11. Describe the set that contains the solution of the equation x y = 0,
x + y = 2.
12. Prove that the setQ of rationales is a countable set.
13. Prove that the set [0, 1] is an uncountable set.
14. Use the induction to prove the binomial theorem.
15. Prove that for p [0, 1]
n

x=0

16. Prove that

( )

n x
p (1 p)nx = 1.
x

( )2

n
2n
=
n
x=0 x

33

2.2

Denitions of Probability

In the previous sections, we dened the sample space as the set of all possible
outcomes in a stochastic experiment, and introduced the event as a subset
of the sample space. In this section we make a formal denition of the
probability of an event within a sample space.

2.2.1

Classical denition of probability

The probability of an event is the ratio of the number of cases favorable to it,
to the number of all cases possible when nothing leads us to expect that any
one of these cases should occur more than any other, which renders them,
for us, equally possible. This denition is essentially a consequence of the
principle of indierence. If elementary events are assigned equal probabilities,
then the probability of a disjunction of elementary events is just the number
of events in the disjunction divided by the total number of elementary events.
The classical denition of probability was called into question by several
writers of the nineteenth century, including John Venn and George Boole.
The frequentist denition of probability became widely accepted as a result
of their criticism, and especially through the works of R.A. Fisher. The
classical denition enjoyed a revival of sorts due to the general interest in
Bayesian probability.
Example 8 Toss a die, and record the face that lands uppermost. The sample space is = {1, 2, 3, 4, 5, 6}. The die is described as fair if these six
outcomes are equally likely. Consider an event A of even numbers. That is
A = {2, 4, 6}. In this case, the classical probability of A is given by
P (A) =

2.2.2

number of outcomes in A
3
= = 0.5
number of outcomes in
6

Frequency probability

Frequency probability is the interpretation of probability that denes an


events probability as the limit of its relative frequency in a large number
of trials. The development of the frequentist account was motivated by the
problems and paradoxes of the previously dominant viewpoint, the classical
interpretation. The shift from the classical view to the frequentist view
represents a paradigm shift in the progression of statistical thought. This

34
school is often associated with the names of Jerzy Neyman and Egon Pearson
who described the logic of statistical hypothesis testing. Other inuential
gures of the frequentist school include John Venn, R.A. Fisher, and Richard
von Mises.
Frequentists talk about probabilities only when dealing with well-dened
random experiments. The set of all possible outcomes of a random experiment is called the sample space of the experiment. An event is dened as
a particular subset of the sample space that you want to consider. For any
event only one of two possibilities can happen; it occurs or it does not occur.
The relative frequency of occurrence of an event, in a number of repetitions
of the experiment, is a measure of the probability of that event.
A further and more controversial claim is that in the long run, as the
number of trials approaches innity, the relative frequency will converge exactly to the probability.
One objection to this is that we can only ever observe a nite sequence,
and thus the extrapolation to the innite involves unwarranted metaphysical
assumptions. This conicts with the standard claim that the frequency interpretation is somehow more objective than other theories of probability.
Example 9 Toss a coin over and over again, and record the number of the
face that land uppermost.
In the rst, the basic experiment is repeated 100 times. The event {H}
occurs on 48 out of the 100 replicates. The relative frequency of this event is
just the proportion of replicates on which A occurs. That is,
RF100 ({H}) =

48
= .48
100

The notation RF100 ({H}) is dened as the relative frequency of the event
{H} in the 100 replicates.
In the second, the basic experiment is repeated 1000 times. The event
{H} occurs on 495 out of the 1000 replicates. The relative frequency of this
event is
495
RF1000 ({H}) =
= .495
1000
In the third, the basic experiment is repeated 10000 times. The event
{H} occurs on 4995 out of the 10000 replicates. The relative frequency of
this event is
4995
= .4995
RF10000 ({H}) =
10000

35
In the above example, we see intuitively that the limit of RFn ({A}) is
0.5, the classical probability of {H}.
Now we are in a position to give a formal denition of the probability of
an event. The probability of the event A, written P (A), is dened to be the
long-run relative frequency of A. That is,
P (A) = lim RFn (A)
n

where RFn (A) is dened as the relative frequency of the event A in the n
replicates and the meaning of the limit will be discussed in the later section.

2.2.3

Bayesian probability

Bayesian probability interprets the concept of probability as a measure of a


state of knowledge and not as a frequency as in orthodox statistics. Broadly
speaking, there are two views on Bayesian probability that interpret the
state of knowledge concept in dierent ways. For the objectivist school,
the rules of Bayesian statistics can be justied by desiderata of rationality
and consistency and interpreted as an extension of Aristotelian logic[2][1].
For the subjectivist school, the state of knowledge corresponds to a personal
belief [3]. Many modern machine learning methods are based on objectivist
Bayesian principles [4]. One of the crucial features of the Bayesian view is
that a probability can be assigned to a hypothesis, which is not possible
under the frequentist view, where a hypothesis can only be rejected or not
rejected.

2.2.4

Axioms of probability

This section shows how to dene probabilities relating to general experiments.


In probability theory, the probability P of some event A, denoted P (A),
is dened in such a way that P satises the Kolmogorov axioms, named after
Andrey Kolmogorov.
First axiom
The probability of an event is a non-negative real number:
P (A) 0 for all A F
where F denote the space of events in a sample space .

36
Second axiom
This is the assumption of unit measure: that the probability that some elementary event in the entire sample space will occur is 1. More specically,
there are no elementary events outside the sample space.
P () = 1
This is often overlooked in some mistaken probability calculations; if you
cannot precisely dene the whole sample space, then the probability of any
subset cannot be dened either.
Third axiom
This is the assumption of -additivity:
Any countable sequence of pairwise disjoint events A1 , A2 , ... satises

P (
n=1 An ) =

P (An )

n=1

Some authors consider merely nitely additive probability spaces, in which


case one just needs an algebra of sets, rather than a -algebra.
We summarize the above axioms in the following
Denition 6 A probability P on the space (, F) is a function P : F
[0, ) satisfying
1. P (A) 0 for all A F
2. P () = 1
3. For any countable sequence of pairwise disjoint events A1 , A2 , ...,
P (
n=1 An ) =

P (An )

n=1

We name P (A) as the probability of the event A.


Example 10 Let RFn (A) denote the relative frequency of the event A in n
replicates. Consider the frequency probability given by
P (A) = lim RFn (A)
n

Then, show that P satisfy the Kolmogorov axioms of the probability.

37
Solution. Let n(A) denote the number that the event A occurs. Then,
1. Given A , we have RFn (A) = n(A)/n. So we have
0 P (A) 1
2. Since RFn () = n/n = 1 we get
P () = 1
3. Given a pairwise disjoint events A1 , A2 , . . ., we let ni denotes the frequency of the event Ai where
n1 + n2 + = n
Then, the frequency of the event
A1 A2
is exactly n1 + n2 + . Hence
n1 + n2 +
n
n1 n2
=
+
+
n
n
= RFn (A1 ) + RFn (A2 ) + .

RFn (A1 A2 ) =

Taking limit on the identity


RFn (A1 A2 ) = RFn (A1 ) + RFn (A2 ) +
we get
P

i=1

Ai =

P (Ai )

i=1

From the Kolmogorov axioms one can deduce some useful rules for calculating probabilities.
Theorem 5
P (AC ) = 1 P (A).

38
Proof. Notice that
A AC =
and, from the second axiom we get
P (A AC ) = P () = 1
Since A and AC are disjoint, the third axiom gives
P (A AC ) = P (A) + P (AC )
Hence
P (A) + P (AC ) = 1
That is,
P (AC ) = 1 P (A)

Theorem 6 The probability of the empty event is zero. That is,


P () = 0
Proof. Since
= C
we get
P () = 1 P () = 1 1 = 0
Theorem 7 If A B then
P (B) = P (B \ A) + P (A)
Hence P (A) P (B).
Proof.
B = A (B AC )
and A (B AC ) = are disjoint, the third axiom gives
P (B) = P (B \ A) + P (A)
Now, since P (B \ A) 0 we get
P (A) P (B)

39

Theorem 8 For any two events A and B


P (A B) = P (A) + P (B) P (A B)
In particular, if A and B are disjoint then
P (A B) = P (A) + P (B)
Proof. The event A1 A2 consists of three disjoint events. That is,
AB =C DE

C = AB
D = A BC
E = AC B
The third axiom gives
P (A B) = P (C) + P (D) + P (E)
and
P (A) = P (C) + P (D),
P (B) = P (C) + P (E)
Hence
P (A) + P (B) = [P (C) + P (D)] + [P (C) + P (E)]
= [P (C) + P (D) + P (E)] + P (C)
= P (A B) + P (A B)
Hence
P (A B) = P (A) + P (B) P (A B)
If A and B are disjoint then P (A B) = P () = 0. Therefore we get
P (A B) = P (A) + P (B)

40

We state the following results without proof.


Theorem 9
1. For any countable events A1 , A2 , . . .
P(

An )

n=1

P (An )

n=1

2. If An An+1 for all n IN, then


P(

n=1

An ) = lim P (An )
n

3. If An An+1 for all n IN


P(

n=1

An ) = lim P (An )
n

41

2.2.5

Exercises

1. For any countable events A1 , A2 , . . ., prove that


(

)C

Ai

i=1

AC
i ,

i=1

)C

Ai

i=1

AC
i

i=1

The identities are known as the de Morgans law.


2. Prove that
P (A1 A2 ) P (A1 ) + P (A2 )
for any two events A1 , A2 .
3. Prove that
P(

Ai )

i=1

P (Ai )

i=1

for any nite events A1 , A2 , . . . , An .


4. For any countable events A1 , A2 , . . ., dene
B1 := A1
B2 := A2 \ A1
..
.
Bn := An \ (A1 An1 )
..
.
Then, prove that
(a) {Bn }
n=1 are pairwise disjoint,
(b) Bn An ,
(c)
(d)

k=1

Bk =

n=1

Bn =

n
k=1

Ak for all n, and

n=1

An .

42
5. Prove that
P(

i=1

Ai )

P (Ai )

i=1

for any countable events A1 , A2 , . . .. This inequality is known as the


Booles inequality.

43

2.3

Conditional probability and independence

In this section, we consider concepts of the conditional probability and independence of events.

2.3.1

Conditional probability

Conditional probability is the probability of some event B, given the occurrence of some other event A. Conditional probability is written P (B|A), and
is read the probability of B, given A.
We give the following
Denition 7 Given a probability space (, F, P ) and two events A, B F
with P (A) > 0, the conditional probability of B given A is dened by
P (B|A) =

P (B A)
P (A)

(2.2)

If P (A) = 0 then P (B|A) is undened, or at any rate irrelevant; but see


Regular conditional probability that will be given in a advanced probability
class.
If we multiply P (A) in (2.2) then we get the identity
P (B A) = P (A)P (B|A)
This general result is known as the multiplication theorem of probability.
Theorem 10 The conditional probability P (|A) satisfy the Kolmogorov axioms of probability. That is,
1.
0 P (B|A) 1
for B
2.
P (|A) = 1

44
3. For any countable sequence of pairwise disjoint events B1 , B2 , . . .

P(

Bi |A) =

i=1

P (Bi |A)

i=1

Proof.
1. Let B . Then
0 P (B|A) =

P (B A)
1
P (A)

2.
P (|A) =

P ( A)
P (A)
=
=1
P (A)
P (A)

3. For any countable sequence of pairwise disjoint events B1 , B2 , . . .


P(

Bi A)
P (A)

P ( n=1 (Bn A))


=
P (A)

n=1 P (Bn A)
=
P (A)

P (Bn A)
=
P (A)
n=1

Bn |A) =

n=1

P(

n=1

P (Bn |A)

n=1

We have used the distributive law


(

Bn ) A =

n=1

in driving the above identity.

(Bn A)

n=1

45

2.3.2

Independence of events

Consider two events A and B. By the conditional probability


P (B|A) =

P (B A)
P (A)

we know that the multiplication theorem


P (B A) = P (A)P (B|A)
Now, if P (A) > 0, then the identity
P (B|A) = P (B)
says that the occurrence or non-occurrence of A does not change the probability of B. That is, if P (A) > 0 then
P (B|A) = P (B)
if and only if
P (B A) = P (B)P (A)
We now give the formal denition of the independence of the events A
and B.
Denition 8 Two events A and B are (mutually) independent if
P (A B) = P (A)P (B)
Theorem 11 If A and B are independent then A and B C are independent.
Proof. Suppose that A and B are independent. Then from the identity
A = (A B) (A B C )
we get
P (A) = P (A B) + P (A B C )
= P (A)P (B) + P (A B C )
That is, we have
P (A B C ) = P (A)[1 P (B)]
= P (A)P (B C )
So A and B C are independent.

46

We now consider the independence of three or more events.


Denition 9 We say that three events A, B, and C are mutually independent if
P (A B C)
P (A B)
P (B C)
P (C A)

=
=
=
=

P (A)P (B)P (C)


P (A)P (B)
P (B)P (C)
P (C)P (A)

Denition 10 We say that three events A, B, and C are pairwise independent if


P (A B) = P (A)P (B)
P (B C) = P (B)P (C)
P (C A) = P (C)P (A)
Example 11
1. If A, B, C are mutually independent then they are pairwise independent.
2. The following example shows that pairwise independence does not imply
the mutual independence in general.
Consider A, B, C have the following property;
P (A) = P (B) = P (C) = 1/2,
P (A B) = P (B C) = P (C A) = 1/4
In this case, we have
P (A B C) = 1/4 = 1/8 = P (A)P (B)P (C)
Therefore A, B, C are pairwise independent, yet they are not mutually
independent.
One can take constant probability 1/4 on the sample space = {1, 2, 3, 4}
and take the events A = {1, 2}, B = {1, 3} C = {1, 4}.

47
3. The relation
P (A B C) = P (A)P (B)P (C)
does not imply the mutual independence of A, B, C.
Consider the sample space
= {(i, j) : i, j = 1, 2, 3, 4, 5, 6}
in tossing two dice.
Let
A = { i is 1, 2, or 3}
B = { i is 3, 4, or 5}
C = { i + j is 9}
In this case
ABC
AB
BC
C A

=
=
=
=

{(3, 6)}
{(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6)}
{(3, 6), (4, 5), (5, 4)}
{(3, 6)}

and
P (A B C) = 1/36 = P (A)P (B)P (C),
However A, B, C are not pairwise independent.
In general, we have the following
Denition 11 The events A1 , A2 . . . An are mutually independent if
P (Ai1 Aim ) = P (Ai1 ) P (Aim )
for 1 i1 < < im n. They are pairwise independent if every pair of
events are independent.
We now consider the independence of a sequence of events A1 , A2 , A3 , . . ..
Denition 12 A sequence of events A1 , A2 , A3 , . . . are mutually independent
if every nite sub-collection of events are mutually independent. They are
pairwise independent if every pair of events are independent.

48
Consider a family of events in a sample space .
Example 12
1. {, }
2. {, A, AC , }, and
3. {B : B }
are meaningful families of events in probability and statistics.
Denition 13 Two families of events A and B are independent if
P (A B) = P (A)P (B)
for any A A and any B B.
Finally we dene the conditional dependence.
Denition 14 Consider A, B, C with P (C) > 0. We say that A and B are
conditionally independent, given C if
P (A B|C) = P (A|C)P (B|C)

2.3.3

Bayes theorem

Is there any relation between P (A|B) and P (B|A) ? It is very instructive to


develop the relations systematically. Actually, the relation between P (A|B)
and P (B|A) is given by the so called Bayes theorem:
P (B|A) = P (A|B)

P (B)
.
P (A)

In other words, one can only assume that P (A|B) is approximately equal
to P (B|A) if the prior probabilities P (A) and P (B) are also approximately
equal.
We begin by introducing the theorem of total probability.
Denition 15 A family of events {A1 , A2 , . . .} is called a partition of the
sample space if

49
1.

n=1

An =

2. An Am = for n = m
3. P (An ) > 0 for all n.
Theorem 12 (The theorem of total probability) Let {An } be a partition of
the sample space . Given A we have
P (A) =

P (An )P (A|An )

(2.3)

n=1

Proof. By the distributive law we get


(

A=A=A

An =

n=1

(A An )

n=1

Notice that {(AAn )} are disjoint. So the third axiom and the multiplication
theorem gives
P (A) =
=

n=1

P (A An )
P (An ) P (A|An )

n=1

Example 13 Suppose we plan a sample survey to discover what proportion


of students have used a legal drug(for example, a brand of the specic tooth
paste). r students said Yes among n answered the question of using the
legal drug. How do we nd out the proportion of students who have used the
legal drug?
Solution. Let p be the probability of a randomly selected student has
used the legal drug. Then one can estimate p by
p
=

r
n

by relative frequency denition of probability.

50
We have supposed that the students have told the truth to the question of
using the legal drug. What if we asked the question about an illegal drug? If
asked the question outright, many students might lie to avoid embarrassment.
The following example shows the role played by the theorem of probability.
Example 14 Suppose we plan a sample survey to discover what proportion
of students have used an illegal drug. We prepare three question cards. The
rst card says Have you ever used illegal drug? The second card instructs
the student to Answer No and the third card instructs the student to Answer Yes. We intend to show each respondent all three cards, mix them,
allow the respond to choose one at random and then to respond appropriately.
In this way we hope to persuade virtually everyone to respond truthfully. r
students said Yes among n answered the question of using the illegal drug.
How do we nd out the proportion of students who have used the legal drug?
Solution. Let p be the probability of a randomly selected student has
used the illegal drug. Consider the three events
A1 = The event of choosing the rst card
A2 = The event of choosing the second card
A3 = The event of choosing the third card
Then, A1 , A2 , A3 are disjoint and A1 A2 A3 = , and
P (A1 ) = P (A2 ) = P (A3 ) =

1
3

Hence {A1 , A2 , A3 } is a partition of . Letting B be the event Respondent


answers Yes. Then, by the theorem of total probability, we get
P (B) = P (B|A1 )P (A1 ) + P (B|A2 )P (A2 ) + P (B|A2 )P (A2 )
1
1
1
= p +1 +0
3
3
3
p+1
=
3
Because r students said Y es among n answered the question of using the
illegal drug, by the relative frequency denition of probability, we see that
r
P (B)
=
n

51
That is,
p+1 r
=
3
n
Hence, one can estimate p by
p
=

3r
1
n

Theorem 13 (Bayes theorem) Let {An } be a partition of the sample space


. Then for any A with P (A) > 0 we have
P (An |A) =

P (An ) P (A|An )
P (An ) P (A|An )
=
P (A)
n=1 P (An ) P (A|An )

That is,
P (An |A) P (An ) P (A|An ) , n = 1, 2, . . .
Proof. Multiplication theorem of probability gives
P (A An ) = P (A|An )P (An )
and
P (A An ) = P (An A) = P (An |A)P (A)
Hence
P (An |A)P (A) = P (A|An )P (An )
That is,
P (An |A) =

P (An ) P (A|An )
P (A)

Now apply the theorem of probability to get


P (An |A) =

P (An ) P (A|An )
P (An ) P (A|An )
=
P (A)
i=1 P (An ) P (A|An )

52
Remark 4 Notice that an intuitive way of expressing Bayes theorem is
P (B|A) = P (A|B)

P (B)
.
P (A)

For example, take B = {cause} and A = {ef f ect}. Then we get


P (cause|eect) = P (eect|cause)

P (cause)
P (eect)

In this case, P (cause|eect) is the probability that is not easy to nd, yet
P (eect|cause) is the probability that is relatively easy to nd.

53

2.3.4

Exercises

1. Prove that
P (A B) = 1 (1 P (A)(1 P (B)
for any independent events A and B.
2. Prove that
P(

)=1

i=1

(1 P (Ai ))

i=1

for any independent events A1 , A2 , . . . , An .


3. Prove that if P (An ) = 1 for all n IN then
P(

An ) = 1

n=1

4. Prove that
P (A B) P (A) + P (B) 1
for any events A and B. This is known as Bonferronis inequality.
5. Dene
AB = (A \ B) (B \ A)
for any events A and B. We say that AB is the symmetric dierence
of A and B. Prove that
P (AB) = P (A) + P (B) 2P (A B)

54

2.4

Distribution of random variable

Intuitively, a random variable is thought of as a function mapping the sample


space of a random process to the real numbers. In this section we study a
random variable and its distribution.
Example 15 Consider the experiment of recording the numbers of baby born
tomorrow in a hospital. In this case the sample space is given by
= {0, 1, 2, . . .}
and the outcomes are numbers.
Example 16 Consider the experiment of recording the credit after taking
the basic thinking in probability. In this case the sample space is given by
= {A+, A, B+, B, C+, C, D+, D, F }
and the outcomes are not numbers. Applying the following rule:
A+
A
B+
B
C+
C
D+
D
F

95;
90;
85;
80;
75;
70;
65;
60;
0

we get the numbers


{95, 90, 85, 80, 75, 70, 65, 60, 0}
Example 17 Consider the experiment of recording the height of a tree in a
forest. In this case the sample space is given by
= [0, )
and the outcomes are numbers.

55
Denition 16 A random variable X on a probability space (, F, P ) is a
real valued function dened on the sample space .
Consider a random variable X : IR. Let be an element of .
Then, the real number X() is called a realization of the random variable X.
We usually denote the random variable by in capital X, and the realization
in lower case x.
Example 18
1. Consider the experiment of recording the numbers of baby born tomorrow in a hospital. Dene the random variable X by the identity function
X() = . Then the range of X is
RX = {0, 1, 2, . . .}
2. Consider the experiment of recording the credit after taking the basic
thinking in probability. Dene the random variable Y by the following
rule.
Y (A+)
Y (A)
Y (B+)
Y (B)
Y (C+)
Y (C)
Y (D+)
Y (D)
Y (F )

=
=
=
=
=
=
=
=
=

95;
90;
85;
80;
75;
70;
65;
60;
0

Then the range of Y is


RY = {95, 90, 85, 80, 75, 70, 65, 60, 0}
3. Consider the experiment of recording the height of a tree in a forest.
Dene the random variable Z by the identity function Z() = omega.
Then the range of Z is
RZ = [0, )

56
In the above example, the range RX is a countable set and RY is a nite
set. In this case the random variable X and Y are called a discrete random
variable. The range of X is
RZ = [0, )
and in this case the random variable Z is not a discrete. We treat this random
variable in a later section.

2.4.1

Discrete random variable

Denition 17 A random variable is discrete if the range of the random


variable is nite or countable.
Consider a discrete random variable X whose range is RX . Then the range
of X is a nite set
RX = {x1 , x2 , . . . , xn }
or a countable set
RX = {x1 , x2 , . . .}
We assume that
RX = {x1 , x2 , . . .}
without loss of generality. Dene
f (x) = P (X = x)
for all x IR. Notice that P is the probability on (, F) and {X = x} is an
event since
{X = x} = { : X() = x} F
If x RC
X then
f (x) = 0
and if x RX then
f (x) > 0
We now dene the probability mass function of the discrete random variable.

57
Denition 18 Consider a discrete random variable X with range RX . The
function fX dened by
fX (x) = P (X = x)
for x IR is called the probability mass function of X. In this case, if x RC
X
then fX (x) = 0 and if x RX then fX (x) > 0.
Proposition 1 Let fX be the probability mass function of the discrete random variable X. Then
1. fX (x) 0 for all x IR,
2. RX = {x : fX (x) > 0} is nite or countable, and
3. if RX = {x : fX (x) > 0} = {x1 , x2 , . . .}, then

fX (xi ) = 1.

i=1

Proof. 1 and 2 are clear from denition. In order to prove s we observe that
{X = xi }, i = 1, 2 . . ., is a partition of . Hence we get

i=1

fX (xi ) =

P (X = xi ) = P

i=1

{X = xi } = P () = 1.

i=1

Denition 19 A function f dened on IR is called a probability mass function if


1. f (x) 0 for all x IR,
2. R = {x : f (x) > 0} is nite or countable, and
3. if R = {x : f (x) > 0} = {x1 , x2 , . . .}, then

i=1

f (xi ) = 1.

We have observed that if a discrete random variable X is given, then we


can generate a probability mass function. Conversely, if a probability mass
function is given, then one can generate a discrete random variable.
Proposition 2 Let f be a probability mass function on IR satisfying
1. f (x) 0 for all x IR,

58
2. R = {x : f (x) > 0} is nite or countable, and
3. if R = {x : f (x) > 0} = {x1 , x2 , . . .}, then

i=1

f (xi ) = 1.

Then, there exist a probability space (, F, P ) and a discrete random variable


X dened on it such that the probability mass function fX of X is equal to
f.
Proof. Observe that
1. f (x) 0 for all x IR,
2. R = {x : f (x) > 0} is nite or countable, and
3. if R = {x : f (x) > 0} = {x1 , x2 , . . .}, then

i=1

f (xi ) = 1.

Dene the sample space by


:= {x1 , x2 , . . .},
the event space by
F = 2 ,
and the probability P by
P ({}) = f (xi ) for = xi .
Then (, F, P ) is a probability space. Now dene the random variable X by
X() = xi for = xi .
Now,
{ : X() = xi } = {xi }
and
P (X = xi ) = P ({xi }) = f (xi ).
The proof is completed.

Example 19 Consider

59

X
P (X = x)

1
4

1
2

1
4

Then,
fX (0) + fX (1) + fX (2) = 1.
We introduce some discrete random variables.
Bernoulli random variable
In the theory of probability and statistics, a Bernoulli trial is an experiment whose outcome is random and can be either of two possible outcomes,
success and failure.
In practice it refers to a single experiment which can have one of two
possible outcomes. These events can be phrased into yes or no questions:
1. Did the coin land heads?
2. Was the newborn child a girl?
3. Were a persons eyes green?
4. Did a mosquito die after the area was sprayed with insecticide?
5. Did a potential customer decide to buy a product?
6. Did a citizen vote for a specic candidate?
7. Did an employee vote pro-union?
Therefore success and failure are labels for outcomes, and should not be
construed literally. Examples of Bernoulli trials include
1. Flipping a coin. In this context, obverse (heads) conventionally denotes success and reverse (tails) denotes failure. A fair coin has the
probability of success 0.5 by denition.
2. Rolling a die, where a six is success and everything else a failure.

60
3. In conducting a political opinion poll, choosing a voter at random to
ascertain whether that voter will vote yes in an upcoming referendum.
In a Bernoulli trial, the sample space can be written as
:= {S, F }
where S stands for a success and F stands for a failure. Let p [0, 1]. Then
the probability P is given by
P (F ) = 1 p

P (S) = p,

Then a random variable X can be dened on this sample space. That is,
a function X : IR. In this case the random variable given by
{

X() =

1
0

if = S
if = F

The range of X is {0, 1} and the probability math function is given by


f (0) = P (X = 0) = 1 p,
f (1) = P (X = 1) = p
That is
f (x) = P (X = x) = px (1 p)1x , x = 0, 1.
Clearly,
f (0) + f (1) = 1
We say that this random variable X by the Bernoulli random variable and
is denoted by X B(1, p).
Binomial random variable
We perform a Bernoulli trial n times independently. The sample space is
:= { : = (1 , . . . , n ), i = S, F }
and

P () = pthe numbers of S (1 p)the numbers of F

61
Dene a random variable X by the number of success in the n trials. Then
the range of X is
{0, 1, 2, . . . , n}
and the probability mass function is given by
( )

f (x) = P (X = x) =

n x
p (1 p)nx ,
x

x = 0, 1, . . . , n

Using the binomial theorem, we see that


n

x=0

f (x) =

( )

x=0

n x
p (1 p)nx = (p + (1 p))n = 1,
x

We say that this random variable X by the Binomial random variable and
is denoted by X B(n, p).
Geometric random variable
We perform a Bernoulli trials independently until the rst success occur.
Then the sample space is
:= {S, F S, F F S, F F F S, . . .}
and

P () = p(1 p)the numbers of failures

Dene a random variable X by the number of failures until the rst success.
Then the range of X is
{0, 1, 2, . . . , }
and the probability mass function is given by
f (x) = P (X = x) = p(1 p)x ,

x = 0, 1, 2, . . ..

Using a geometric series, we get

x=0

f (x) = p

(1 p)x = 1

x=0

We say that this random variable X by the geometric random variable and
is denoted by X G(p).

62
Poisson random variable
We now introduce the Poisson random variable in a dierent way. Consider
the following function from calculus.

f (x) = e

x!

, x = 0, 1, . . ., > 0

Then f have the following properties.


1. f 0;
2. {x : f (x) > 0} = {0, 1, 2, . . . ; }
3.

f (x) = 1.

x=0

Then by Proposition 2, there exist a probability space (, F, P ) and a discrete


random variable X dened on it such that the probability mass function fX
of X is equal to
f (x) = e

x
, x = 0, 1, . . ., > 0 .
x!

We say that this random variable X by the Poisson random variable and is
denoted by X P().
That is, if X P(), then the probability mass function is, for > 0,
f (x) = P (X = x) = e

x
, x = 0, 1, . . .
x!

and by the Taylor theorem we get

x=0

f (x) = e

x
x=0

x!

= e e = 1.

We have the following approximation theorem of the Binomial random


variable to the Poisson random variable.
Theorem 14 Suppose that X B(n, p) with n large enough and p small
enough so that = np. Then, X is approximately P().

63
Proof. Suppose X B(n, p) and let = np. Then,
P (X = x) =

n!
px (1 p)nx
(n x)!x!
( )x (

)nx

n!

=
1
(n x)!x! n
n
n(n 1) (n x + 1) x (1 n/)n

=
nx
x!(1 /n)
Since n is large and p is small and = np, we get
(

1
n

Hence

)n

n(n 1) (n x + 1)
1,
nx

P (X = x) e

x!

1
n

)x

64

2.4.2

Exercises

1. Suppose that there are two defective among ten items.


(a) Suppose that one item is drawn.
i. The probability of the drawn item is defective is given by
( )( )
8

2
1 0
= (10)
10
1

ii. The probability of the drawn is non defective is given by


( )( )
2

8
0 1
= (10)
10
1

(b) Suppose that two items are drawn. Then we are interested in the
following probabilities.
i. The probability of both the drawn item are non defective is
given by
( )( )
2 8
28
0 2
( ) =
10
45
2

ii. The probability of the one is defective and the other is non
defective is given by
( )( )
2
1

8
1

( )

10
2

16
45

iii. The probability of both the drawn item are defective is given
by
( )( )
2 2
1
1 2
( ) =
10
45
2

iv. Observe that


( )( )
2
0

8
2

( )( )
2
1

8
1

( )( )
2
1

2
2

( ) + ( ) + ( )
10
2

10
2

10
2

1
28 16
+
+
=1
45 45 45

65
(c) In general, we consider the experiment of drawing n items among
N items where M are defective and N M are non defective.
Then the sample space is
= {0 among M and n among N M ,
1 among M and n 1 among N M ,
...,
n among M and 0 among N M ,}
where, 0 n M , 0 n N M .
( )(

P (x among M and n x among N M ,) =

M
x

N M
nx
( )
N
n

Dene the random variable X by the numbers of defective items.


Them the range of X is
{0, 1, 2, . . . , min{n, M }}
and the probability mass function is
( )(

f (x) = P (X = x) =

M
x

N M
nx
( )
N
n

x = 0, 1, . . . , min{n, M }

Prove that
n

x=0

min{n,M }

f (x) =

( )(
M
x

x=0

N M
nx
( )
N
n

= 1.

We say that this random variable X by the hypergeometric random variable and is denoted by X H(M, N, n).
2. Consider the sequence

an = 1 +

1
n

)n

of rational numbers.
(a) Prove the sequence an is increasing. That is, prove that an an+1
for all n IN.

66
(b) Prove that an 3 for all n IN.
(c) State the completeness of real numbers, and prove that the sequence an have the least upper bound and it is the limit.
(d) Write

1
e := lim 1 +
n
n
Then compare e with the number
1+

)n

1
1
+ + .
2! 3!

3. Consider the harmonic series.

1
n=1

Let sn denote the partial sum of the sequence { n1 : n IN}.


(a) Prove that s2n > 1 +

n
2

for all n IN.

(b) Prove that the innite series

1
n=1

is divergent.
(c) Prove that the innite integral

1
dx
xp

is convergent for all p > 1.


(d) Prove that the innite series

1
n=1

is convergent for all p > 1.

np

67

2.5

Expectation

We note that
1+2=3
is given by the axiom of the real numbers. We know that the operation + is
the binary operation. Then what is the meaning of
1 + 2 + 3 = 6?
The basic meaning of computing it is rst compute
1+2=3
and compute
3 + 3 = 6.
That is,
(1 + 2) + 3 = 6.
Now, applying the commutative law and the associative law of real numbers,
we see that
(1 + 2) + 3 = 1 + (2 + 3)
= (1 + 3) + 2 = 1 + (3 + 2)
= (2 + 1) + 3 = 2 + (1 + 3)
= (2 + 3) + 1 = 2 + (3 + 1)
=6
In this case we may write one of the above as
1+2+3
and the result is 6. Similarly, we can give the meaning of the expression of
a1 + a2 + a3 + a4
and by induction we can dene for n IN
a1 + a2 + + an .

68
Write

ai := a1 + a2 + + an .

i=1

The index i in the symbol


n

ai =

i=1

n
i=1

ak =

k=1

ai is a dummy index and


n

al =

ax =

x=1

l=1

ay =

y=1

is equal to
a1 + a2 + + an .
Given a sequence {an }
n=1 , we can dene the innite series as
lim

n=1

an :=

an := a1 + a2 +

n=1

in calculus.
Consider the following game of chance. In order to play the game you
must pay a dollars. Suppose that the random variable X takes the values
x1 , x2 , . . . , xr and receive X dollars as the result of the game. Should we need
to play the game? It is dicult to answer the question if we play the game
only once. If one play the game n times, then before the n games you have to
pay na dollars and after n games, you receive X1 +X2 + +Xn dollars where
X1 , X2 , . . . , Xn are independent and identically distributed random variables
with the common probability mass function f . Now, we write Nn (xi ) as the
numbers of xi among n games. Then we have
X1 + X2 + + Xn =

xi Nn (xi )

i=1

Dividing n we get
[

r
X1 + X2 + + Xn
Nn (xi )
=
xi
n
n
i=1

If n is large, then we see that Nn (xi )/n is approximately equal to f (xi ) and
the right hand side is approximately equal to
:=

i=1

xi f (xi ).

69
If > a then you earn dollars and if < a you lose dollars and if = a you
will be in state of breaking even.
We now call the number
:=

xi f (xi )

i=1

as the expectation of X.
Suppose a discrete random variable X have the probability mass function
f . Then we can think of the expectation of X is

xj f (xj ).

(2.4)

j=1

Denition 20 A discrete random variable X takes value x1 , x2 , . . . and have


the probability mass function f . If

|xj |f (xj ) <

then
EX :=

xj f (xj )

(2.5)

j=1

We say that the random variable X have the (nite) expectation and if

j |xj |f (xj ) = then the expectation is not dened.


Example 20
1. (Binomial random variable). If X B(n, p) then the probability mass
function is
( )

f (x) = P (X = x) =

n x
p (1 p)nx , x = 0, 1, . . . n
x

and
EX = np.
Proof. Using

( )

n
n1
x
=n
x
x1

70
we get
EX : =
=

xf (x)

x=0
n

( )

n x
p (1 p)nx
x

x=0

= np
= np

x=1
(
n1

y=0

n 1 x1
p (1 p)nx
x1
)

n1 y
p (1 p)n1y
y

= np(p + (1 p))n1
= np

2. (Geometric random variable). If X G(p) then the probability mass


function is
f (x) = P (X = x) = pq x , x = 0, 1, 2, . . ., q = 1 p
and
EX =

q
p

Proof. Using the geometric series we see that


EX : =

xpq x

x=0

= pq

xq x1

x=0

d x
q
x=0 dq

d
= pq
qx
dq x=0

= pq

d
= pq
dq

1
1q

71
(

1
= pq 2
p
q
=
p

3. (Poisson random variable). If X P(), > 0 then the probability


mass function is
x
f (x) = e , x = 0, 1, . . .
x!
and
EX =
Proof. Using the Taylor series we get
EX : =
=

x=0

xe
e

j=1

= e

x
x!

j
(j 1)!

j
j=0

j!

= e e
=

4. Suppose the random variable X have the probability mass function f .


f (x) =

1
, x = 1, 2, . . .
x(x + 1)

Now, since

x=1

f (x) =

1
x=1

1
= (1 1/2) + (1/2 1/3) + = 1
x+1

72
we see that f is indeed the probability mass function. On the other
hand, since

|x|f (x) =

x=1

1
= .
x=1 1 + x

X does not have the expectation.

2.5.1

Properties of expectation

We introduce the properties of expectation without proof. The following


result is known as the law of unconscious statistician.
Theorem 15 Suppose that the discrete random variable X have the probability mass function f , and is a real valued function dened on IR. Suppose
that

|(x)|f (x) <


x

Then the random variable Z = (X) have the expectation


EZ =

(x)f (x)

Theorem 16 Suppose that the random variable X and Y have the expectation.
1. If c is constant, and if P (X = c) = 1 then EX = c.
2. If c is constant then cX have the expectation and E(cX) = cEX.
3. The random variable X + Y have the expectation and E(X + Y ) =
EX + EY .
4. If P (X Y ) = 1 then EX EY ; In this case, EX = EY if and only
if P (X = Y ) = 1.
5. |EX| E|X|.
Theorem 17 If P (|X| M ) = 1 for a constant M then X have the expectation and |EX| M .

73
Denition 21 Random variables X1 , . . . , Xn are independent if
P (X1 A1 , . . . , Xn An ) = P (X1 A1 ) P (Xn An )
for all subsets A1 , . . . , An of IR.
Theorem 18 Suppose that X and Y are independent and have the expectation. then XY have the expectation and
E(XY ) = EXEY
Example 21 Suppose that X is a symmetric random variable and let Y =
X 2 . Then
E(XY ) = EXEY.
However, X and Y are not independent.
Theorem 19 Suppose that the discrete random variable X takes the nonnegative integers valued. Then, X have the expectation if and only if the

series
x=0 P (X x) converges. In this case,
EX =

P (X x)

x=0

2.5.2

Moment of random variable

Let X be a random variable and let EX be the expectation. Then X EX is


a random variable. We say that the random variable X EX is a deviation
of X, |X EX| is an absolute deviation of X, and (X EX)2 is a squared
deviation of X. We now dene the variance and standard deviation of X
using the squared deviation.
Denition 22 Suppose the discrete random variable X have the probability
mass function f . Let = EX. We dene the variance of X as
E(X )2 :=

(x )2 f (x)

and denoted by V (X). If the series x (x )2 f (x) converges then the variance V (X) is dened and is a nonnegative number. In this case,

V (X)
is called the standard deviation and is denoted sd(X) or (X).

74
We have the following
Theorem 20
V (X) = EX 2 [EX]2 .
Proof. Let = EX. Then
V (X) = E(X )2

(x )2 f (x)
=
x

(x2 2x + 2 )f (x)

x2 f (x) 2

xf (x) + 2

f (x)

= EX 2 2EX EX + [EX]2
= EX 2 [EX]2 .

It is often convenient to compute EX(X 1), in stead of EX 2 in getting


the variance V (X) of X.
Theorem 21
V (X) = EX(X 1) [EX]2 + EX.
Proof.
EX(X 1) [EX]2 + EX = EX 2 EX [EX]2 + EX
= EX 2 [EX]2
= V (X).

Example 22 (Binomial random variable). If X B(n, p) then


( )

f (x) = P (X = x) =

n x
p (1 p)nx , x = 0, 1, . . . n,
x

EX(X 1) = n(n 1)p2


and
V (X) = np(1 p).

75
Proof. Using

( )

n
n1
x
=n
x
x1

we get
EX(X 1) : =
=

x=0
n

x(x 1)f (x)


( )

x(x 1)

x=2

= n(n 1)p
= n(n 1)p

n x
p (1 p)nx
x
(

x=1
(
n2

y=0

n 2 x2
p (1 p)nx
x2

n1 y
p (1 p)n2y
y

= n(n 1)p (p + (1 p))n2


= n(n 1)p2
2

Example 23 (Geometric random variable)). If X G(p) then


f (x) = P (X = x) = (1 p)x p, x = 0, 1, 2, . . ., q = 1 p,
EX(X 1) =
and
V (X) =

2q 2
p2

q
p2

Proof. Let q = 1 p. Using the geometric series, we get


EX(X 1) : =

x(x 1)pq x

x=0

= pq 2
= pq 2

x=0

x(x 1)q x2

d2 x
q
2
x=0 dq

76
= pq 2

d2
qx
dq 2 x=0

d2 1
dq 2 1 q
2
= pq 2 3
p
2
2q
=
p2
= pq 2

Example 24 (Poisson random variable). If X P(), > 0 then


f (x) = e

x
, x = 0, 1, . . .,
x!

EX(X 1) = 2
and
V (X) =
Proof. Using the Taylor series, we get
EX(X 1) : =

x(x 1)e

x=0

x
x!

x2
=
e
(x 2)!
x=2

x
= 2 e
x=0 x!
2

= 2 e e
= 2 .

It is often convenient to use the moment generating function in computing


the mean and the variance. In general, the rth moment of X is dened as
EX r , r = 1, 2, . . ..

77
Let = EX be the rst moment of X. Then the rth central moment of X
is dened as
E(X )r , r = 1, 2, . . ..
We now dene the moment generating function of a discrete random
variable X. Notice that for each t, etX is also a random variable.
Denition 23 Suppose that the discrete random variable X have the probability mass function X. Then the moment generating function of X is dened
as

MX (t) := EetX =
etx f (x), t R.
x

If the series x etx f (x) converges then the moment generating function MX
is dened.
Given a random variable X, one can generate the moment using the
moment generating function. Using the Taylor series we see that
ez =

zn
n=0

n!

=1+

z
z2 z3
+
+
+ .
1! 2!
3!

Now,
MX (t) = EetX

=
etx f (x)
x

xt (xt)2 (xt)3
+
+
+ f (x)
=
1+
1!
2!
3!
x

(xt)2
=
f (x) +
xt f (x) +
f (x) +
2!
x
x
x
t2
= 1 + tEX + EX 2 +
2!
By dierentiating with respect to t we get
t2
EX 3 +
2!

MX (t) = EX 2 + tEX 3 +
..
.
(r)
MX (t) = EX r + tEX r+1 +

MX (t) = EX + tEX 2 +

78
Letting t = 0 we get the formula
(r)

MX (0) = EX r r = 1, 2, . . ..
Example 25 (Binomial random variable). If X B(n, p), then the moment generating function of X is
MX (t) = EetX
=
=

( )

tx

x=0
(
n

x=0

n x
p (1 p)nx
x

n
(pet )x (1 p)nx
x

= (1 p + pet )n
Now, since

MX (t) = n(pet + 1 p)n1 pet

MX (t) = n(n 1)(pet + 1 p)n2 pet pet + n(pet + 1 p)n1 pet


we get

EX = MX (0) = np

EX 2 = MX (0) = n(n 1)p2 + np


and
V (X) = EX 2 [EX]2 = n2 p2 np2 + np (np)2 = np(1 p).
Example 26 (Geometric random variable). If X G(p) then the moment
generating function of X is
MX (t) = EetX
=

etx q x p

x=0

= p

(qet )x

x=0

p
, qet < 1
1 qet

79
and

EX = MX (t)|t=0 =

pqet
q
|t=0 =
t
2
(1 qe )
p
( )2

V (X) = MX (t)|t=0

q
p

q
p2

Example 27
(Poisson random variable). If X P(), > 0 then the
moment generating function of X is
MX (t) = EetX

x
x!
x=0

(et )x
= e
x!
x=0
=

etx e

= exp[(et 1)]
and

EX = MX (t)|t=0 =

V (X) = MX (t)|t=0 2 = .

80

2.5.3

Exercises

1. Find the limit of the sequence


n

k=1

( )2

k
n

2. Find the limit of the sequence


n

1
.
n

( )

cos

k=1

k
n

1
.
n

3. State the denseness of rational numbers and give a sequence of rational


numbers a1 , a2 , a3 that converges the irrational number .
4. Prove that the law of exponent
2e+ = 2e 2
for real numbers.
5. Given n, z/2 , p, solve the quadratic inequality
)

2
2
z/2
z/2
2
p 2 p +
p + p2 < 0
1+
n
2n

of p.
6. We recall the following identity:

y=0

y+r1
(1 p)y = pr for 0 < p < 1.
r1

Consider the function f dened by


(

x1 r
f (x) =
p (1 p)xr , x = r, r + 1, . . . , .
r1
(a) Prove that there exist a random variable X whose probability
mass function is f .
(b) Find EX
(c) Find V (X)
(d) Find the moment generating function MX .

81

2.6
2.6.1

Absolutely continuous random variable


Distribution function

The cumulative distribution function (CDF), also probability distribution


function or just distribution function, completely describes the probability
distribution of a real-valued random variable X. For every real number x,
the CDF of X is given by
x 7 F (x) = P (X x), for x IR

(2.6)

where the right-hand side represents the probability that the random variable
X takes on a value less than or equal to x. The probability that X lies in
the interval (a, b] is therefore
F (b) F (a)
if a < b.
The CDF of X can be dened in terms of the probability density function
as follows:

F (x) =

f (t) dt for some function f : IR [0, ).

Note that in the denition above, the less or equal sign, is a convention,
but it is a universally used one, and is important for discrete distributions.
The proper use of tables of the Binomial distribution and Poisson distributions depend upon this convention.
Theorem 22
Let F be the cumulative distribution function of a random
variable X. Then
1. F is (not necessarily strictly) monotone increasing. That is, if x < y
then F (x) F (y).
2. limx F (x) = 0,

limx+ F (x) = 1.

3. F is right-continuous.
lim F (y) = F (x)
yx

Proof.

82
1. If x < y then {X x} {X y}. Hence P (X x) P (X y).
2. Observe the following fact.
limx f (x) = if and only if f (xn ) for every sequence
(xn ) with xn .
Now, let yn . Since {X yn } is increasing, we get

{X yn } = {X < }.

n=1

Hence
lim P (X bn ) = P (X < ) = 1.

bn

Similarly, we get
F () := lim F (y) = 0.
y

3. Let yn x. Since {X yn } {X x}, we get

{X yn } = {X x}.

n=1

Hence
lim F (yn ) =

yn x

lim P (X yn )

= P ( lim {X yn })
n

= P(

{X yn })

n=1

= P (X a)
= F (x).

Every probability question can be answered in terms of the distribution


function F .
If X is a discrete random variable, then it attains values x1 , x2 , ... with
probability pi = P (X = xi ), and the cdf of X will be discontinuous at the
points xi and constant in between:

83

F (x) = P (X x) =

P (X = xi ) =

xi x

p(xi ).

xi x

If the CDF F of X is continuous functioncontinuous, then X is a continuous random variable; if furthermore F is absolute continuityabsolutely
continuous, then there exists a Lebesgue-integrable function f (x) such that
F (b) F (a) = P (a X b) =

f (x) dx
a

for all real numbers a and b. (The rst of the two equalities displayed above
would not be correct in general if we had not said that the distribution is
continuous. Continuity of the distribution implies that P (X = a) = P (X =
b) = 0, so the dierence between < and ceases to be important in this
context.) The function f is equal to the [[derivative]] of F almost everywhere,
and it is called the probability density function of the distribution of X.
Example 28
1. As an example, suppose X is uniformly distributed on
the unit interval [0, 1]. Then the CDF of X is given by

F (x) = x

x<0
0x1
1 < x.

2. Take another example, suppose X takes only the discrete values 0 and
1, with equal probability. Then the CDF of X is given by

0
: x<0
F (x) = 1/2 : 0 x < 1

1
: 1 x.

2.6.2

Riemann integral

Recall that a distribution function F of an absolutely continuous random


variable X can be written as

F (x) =

f (t)dt

84
for a real valued function f dened on IR and the f is called the probability
density function of X.
In this section we review the Riemann integral

f (x)dx
a

and innite integral

f (x)dx.

Consider a function f dened on the interval [a, b]. We partition [a, b]


into
a = x0 < x1 < < xn1 < xn = b.
Given any i in the interval [xi1 , xi ] i = 1, 2, . . . n we evaluate
f (i ).
Then we get the following
n

f (i )(xi xi1 ).

(2.7)

i=1

The sum given in (2.7) is called a Riemann sum of f on [a, b]. Letting n
we can consider the following limit
lim

f (i )(xi xi1 ).

(2.8)

i=1

The limit in (2.8) may or may not exist. We denote the limit as

f (x)dx
a

That is,

f (x)dx := lim
a

f (i )(xi xi1 )

(2.9)

i=1

In particular, if the limit exist, we say that f is Riemann integrable on [a, b].
We have the following on the Riemann integral.
Theorem 23 If f is continuous on [a, b] then it is Riemann integrable.

85
Theorem 24 Suppose that f and g are Riemann integrable on [a, b]. Let c
be a constant. Then
1. f + g is Riemann integrable and

g(x)dx

f (x)dx +

(f (x) + g(x))dx =

2. cf is Riemann integrable and

cf (x)dx = c

f (x)dx

3. If f g on [a, b] then

b
a

f (x)dx

g(x)dx
a

In particular, if f 0 on [a, b]

f (x)dx 0

Theorem 25


b
b




f (x)dx
|f (x)|dx
a

a

Theorem 26 If f is Riemann integrable on [a, b] then

f (x)dx =

f (x)dx +

g(x)dx
c

If a and b are xed numbers and if f is Riemann integrable on [a, b] then

f (x)dx
a

is a number. In this case x is a dummy variable and the expression

f (t)dt =
a

f (y)dy =
a

f (z)dz =

f ()d =
a

f (x)d(x) =

86

is equal to the common number ab f (x)dx.


On the other hand, if a is a xed number and x is a running variable then
the Riemann integral
x
f (t)dt
a

is no more xed number but a function of x.


Next we dene an integral dened on a innite interval [a, ).

f (x)dx

For any b in the innite interval [a, ), The integral

f (x)dx
a

is well dened. Letting b , we can consider the limit

lim

b a

and write as

f (x)dx

f (x)dx

without regard the existence of limit. We say that this integral as an innite
integral. In particular, if the limit exist, we say that the innite integral
converges and diverges otherwise. Similarly, we can dene the innite integral

2.6.3

f (x)dx, and

f (x)dx.

Absolutely continuous random variables

Denition 24 The real valued function f dened on IR is called a probability


density function if it satises
1. f 0
2.

f (x)dx = 1.

Hereafter, the integral


clear in the context.

f (x)dx is simply denoted by

f (x)dx if it is

Denition 25 We say that X have the uniform distribution on (a, b) and is


denoted by X U (a, b) if the probability density function is
{

f (x) =

1
,
ba

axb
otherwise

87
The gamma function and the gamma distribution
We introduce the gamma function. By an integration by parts and by induction we get the identity

xn1 ex dx = (n 1)!.

Now for > 0 we dene the gamma function by

() :=

x1 ex dx.

Then, we have
Theorem 27
1. (1) = 1,
2. ( + 1) = (), and
3. (n + 1) = n!, n = 1, 2, 3, . . ..
Proof.
1. Using the fundamental theorem of calculus, we know that

b
0

ex dx = ex

]b
0

= eb + 1.

Now, we have

(1) =

ex dx = lim

b 0

ex dx = 1.

2. By an integration by parts, we see that

( + 1) =
0

x e dx =

]
x ex
0

x ex dx = ().

3. For n IN, use the induction to see that


(n + 1) = n(n) = = 1(1) = 1.

88
Now, for > 0 use the identity

ex dx =

to see that

f (x) =

ex , x > 0
0
otherwise

is a probability density function .


Use the gamma function to see that
{

1
x1 ex ,
()

f (x) =

x>0
otherwise

is also a probability density function .


We introduce an exponential distribution and a gamma distribution.
Denition 26
1. Let > 0. We say that X has an exponential distribution, denoted
X E(), if the probability density function of X is
{

f (x) =

ex , x > 0
0
otherwise

2. Let > 0. We say that X has an gamma distribution, denoted X


(, 1), if the probability density function of X is
{

f (x) =

1
x1 ex ,
()

x>0
otherwise

3. Let > 0 and > 0. We say that X has an gamma distribution,


denoted X (, ), if the probability density function of X is
{

f (x) =

1
x1 e ,
()
x

x>0
otherwise

89
Remark 5
1. We observe that

1
E() = (1, ).

2. By changing variables, we see that


{

x
1
x1 e ,
()

f (x) =

x>0
otherwise

is a probability density function .


3. If the probability density function of a random variable X is given by

f (x) =

1
r
( r2 )2 2

x 2 1 e 2 ,
r

x>0
otherwise

then we say that X a chi-squared distribution and we denoteX 2 (r).


In this case, we see that
r
2 (r) = ( , 2).
2
Suppose that X f (x). That is, a continuous random variable X has
the probability density function f (x). Then the expectation of the random
variable (X) is dened as

E(X) :=

(x)f (x)dx.

In particular, the moment generating function of X is dened by

MX (t) = EetX :=

etx f (x)dx.

We introduce a method of evaluating an integration, so called a gamma


integration, without using the fundamental theorem of calculus. Suppose for
example that X (, 1). We try to evaluate the following integration.

EX =
0

1 1 x
x e dx
()

90
Observe that

1 1 x
x e dx
()
0

1 +11 x
=
x
e dx
()
0
( + 1)
1
=
x+11 ex dx.
()
( + 1)
0

EX =

Since

1
x+11 ex
( + 1)

is the probability density function of ( + 1, 1), we see that

1
x+11 ex dx = 1.
( + 1)

So we get
EX =

( + 1)
= .
()

By gamma integrations, we get


Theorem 28

If X (, ) then

1.
EX =
2.
V ar(X) = 2
3.
MX (t) = (1 t) , 1 t > 0.
Remark 6 We observe that
1.
X (, ) MX (t) = (1 t) , 1 t > 0

91
2.
t
X E() MX (t) = (1 )1 , 1

>0

3.
X 2 (r) MX (t) = (1 2t) 2 , 1 2t > 0
r

4.
X (, ) X Z, Z (, 1)
By using the moment generating function we can prove the following
Theorem 29 Suppose X1 , . . . , Xn are independent random variables.
1. If Xi (i , ) (i = 1, 2, . . . , n) then Y =

n
i=1

Xi (

2. If Xi E() = (1, 1 ) (i = 1, 2, . . . , n) then Y =

n
i=1

i=1

Xi (n, 1 ).

3. If Xi 2 (ri ) = ( r2i , 2) (i = 1, 2, . . . , n) then Y =

2 ( ni=1 ri ).
4. X (, )

i , ).

n
i=1

Xi

(, 1), that is X Z, Z (, 1).

5. It X (, ) then cX cZ, Z (, 1). So cX (, c).


6. If X (, ) then 2 X (, 2) = 2 (2).
7. If X E() then X 2 (2) .
Example 29 Let Nt denote the numbers of an event that occur during the
period (0, t) for t > 0. Suppose that Nt P(t) ( > 0).
1. If W is the waiting time until the rst event occur then W E() =
(1, 1 ).
2. If Wk is the waiting time until the kth event occur then Wk (k, 1 ).
Proof.

92
1. Since Nt P (t), we see that
P (W > t) = P (Nt = 0),
{
et
t>0
=
0
otherwise
So
d
d
P (W t) =
(1 et )
dt
dt
{
et
t>0
=
0
otherwise
So W E().
2. Since Nt P (t), we see that
P (Wk > t) = P (Nt k 1)
=

(t)x et
x!
x=0
k1

So
d
P (Wk t)
dt
(t)x et
d k1
=
dt x=0
x!
=
=

k1

x x1 t
[xt e tx et ]
x!
x=0
k1

k1
x+1
x
tx1 et +
tx et
(x

1)!
x!
x=0
x=0

k
tk1 et
=
(k 1)!
k k1 t
=
t e
(k)
So Wk (k, 1 ).

93
The normal distribution
We have considered the following
Denition 27 Let IR, 2 (0, ). Then the random variable X is
called a normal distribution and denoted X N (, 2 ) if X has the probability density function
{

1
(x )2
f (x) = exp
, x IR.
2 2
2
We prove that f is indeed a probability density function.
Proof.
1. It is clear that f (x) > 0 for all x IR.
2. Let

1 2
1
e 2 z dz.

2
Then I is a positive constant and we get

I=

1 1 z 2 1 1 w2
e 2 dz
e 2 dw
=

2
2

1 1 (z2 +w2 )
=
e 2
dzdw.
2

Now let z = r cos and w = r sin . Then

1 1 r2
e 2 rdrd
2

0
0
1 2 1 r2
=
d
e 2 rdr
0
2 0

et dt

= 1.
Hence

and if we let z =

1 2
1
e 2 z dz = 1,

2
then we get

I=

(x)2
1
e 22 dx = 1.
2

94

Theorem 30 X N (, 2 ) Z =
Z N (0, 1).
Proof. () Let Z =

X
.

N (0, 1). That is, X = Z + ,

Then

P (Z z) =
=
=
=

X
P
z

P (X z + )
z+
(x)2
1
e 22 dx

2
z
1 1 t2
e 2 dt

Hence Z has the probability density function


fZ (z) =

1 2
1

P (Z z) = e 2 z
z
2

That is,
Z N (0, 1)
()Conversely, let Z N (0, 1) and let X = Z + . Then
P (X x) = P (Z + x)
(
)
x
= P Z

x
1 2
1

e 2 z dz
=

2
x
(t)2
1
e 22 dt
=
2
and
fX (x) =

(x)2

1
P (X x) = e 22
x
2

Hence,
X N (, 2 )

95
Theorem 31
X N (, 2 )

1
(x )2
f (x) = exp
x IR
2 2
2
, Z N (0, 1).
X = Z + or Z = X

1
MX (t) = exp(t + 2 t2 ).
2
Proof. If Z N (0, 1) then the moment generating function of Z is

MZ (t) = Ee

=
=
=
=

tZ

1 2
1
etz e 2 z dz
2

1 2
1
e 2 z +tz dz

2
{
}

1
1
1 2
2
exp (z t) + t dz
2
2

2
}
{
}
{
1 2
1
1
2

exp (z t) dz
exp t
2
2
2
{
}
1 2
exp t
2

since we know that

1
1
exp (z t)2 dz = 1.
2
2

Now, observe that X N (, 2 ) X = Z + , Z N (0, 1). Hence


we get
MX (t) = EetX = E exp{(Z + )t}
= E exp{Zt} exp{t}
= exp{t}MZ (t)
}
{
1
2
= exp{t} exp (t)
2
{
}
1 22
= exp t + t .
2

96
Theorem 32 X N (, 2 )
1.
EX =
2.
V ar(X) = 2
3.
{

EZ

(2k)!
k!2k

m = 2k k = 1, 2, . . .
m = 2k + 1.

Proof.
X N (, 2 ) X = Z + , Z N (0, 1).
Then
EX = E(Z + ) = E(Z) +
V (X) = V (Z + ) = 2 V (Z)
On the other hand

1 2
1
z e 2 z dz

2
]
[ 0

1
12 z 2
12 z 2
=
dz +
ze
dz
ze
0
2
[ 0
]

1
w
w

=
e dw +
e dw
0
2
= 0,

EZ =

and

1 2
1
z 2 e 2 z dz

2
[
]
2
2 12 z 2
=
z e
dz
2 0

V (Z) =

97
[

1 2
2
zd(e 2 z )
=
2 0
([
)

]
2
12 z 2
12 z 2
e
=
ze
+
dz
0
0
2
1 1 z2
=
e 2 dz
2
= 1.

Z N (0, 1) MZ (t)
1
= exp{ t2 }
2

1 1 2 k
=
( t)
k=0 k! 2
=
=

1 2k
t
k
k=0 k!2

(2k)!
k=0

EZ

(2k)!
k!2k

k!2k

1 2k
t
(2k)!

m = 2k k = 1, 2, . . .
m = 2k + 1.

Remark 7 Suppose that Z N (0, 1). Then one can evaluate the probability
(z) = P (Z z)
and form a normal probability table.
Example 30 If X N (, 2 ), then
(

X
b
a
P (a < X < b) = P
<
<

(
)
(
)
b
a
=

98
Example 31 The graph of the probability density function of N (, 2 ) is
1. symmetric with respect to .
2. has the maximum value at .
3. is a bell shaped.
One can use the moment generating function of the normal distribution
to get the following results.
Theorem 33
1. If X N (, 2 ), then aX + b N (a + b, a2 2 ).
2. If X1 , . . . , Xn are independent and Xi N (i , i2 ) for i = 1, 2, . . . , n,
then Y = a1 X1 + + an Xn N (a1 1 + an n , a21 12 + + a2n n2 ).
Remark 8 Using the gamma function and the property of normal distribution we can get the following identity.
( )

1
=
2

In deed,
( )

1
2

=
=
=
=
=
=

t 2 1 et dt
1

1
et dt, t = z2
0
t

12 z 2
e
2dz
0


1 2
1
e 2 z dz
2 2
0
2
1 1 z2
e 2 dz
2
0
2
1 1 z2
e 2 dz

Now using the above identity we have the following

99
Theorem 34 If Z N (0, 1), then Z 2 2 (1).
Proof.Let be the distribution function of Z N (0, 1), and let Y = Z 2 .
Then Y has the distribution function
P (Y y) = P (Z 2 y)

= P ( y Z y)

= ( y) ( y)
Hence Y has the probability density function
g(y) =
=
=
=
=
Now using

( )
1
2

P (Y y)
y

[2( y) 1]
y
1
( y)
y
1 2
1 1
e 2 ( y)
y 2
1
1
1
y 2 e 2 y
2

, we get the probability density function of Y is


1
1
1
g(y) = y 2 e 2 y , y > 0
2
1
1
1
( ) 1 y 2 e 2 y
=
12 2 2

That is, Y = X 2

1
,2
2

= 2 (1).

Corollary 1 If Z1 , . . . , Zn are independent and identically distributed N (0, 1),


then
Z12 + + Zn2 2 (n).

100

2.6.4

Exercises

1. Consider the following distribution function F .

F (x) =

2
3
11
12

x<0
0x<1
1x<2
2x<3
03x

Then,
(a) Find P (X < 3).
(b) Find P (X = 1).
(c) Find P (X > 12 ).
(d) Find P (2 < X 4).
2. Consider F is a distribution function of a general random variable.
(a) If the distribution function F is continuous and strictly increasing
then F 1 (y), y [0, 1] is the unique real number x such that
F (x) = y.
(b) Unfortunately, the distribution does not, in general, have an inverse. One may dene, for
y [0, 1]
,
F 1 (y) = inf {F (x) y}.
xIR

(c) The median is F 1 (0.5).


(d) Put = F 1 (0.95). Then we call the 95th percentile.
(e) The inverse of the distribution function is called the quantile function.
3. The inverse of the distribution function can be used to translate results
obtained for the uniform distribution to other distributions. Some useful properties of the inverse distribution function are:

101
(a) F 1 is nondecreasing
(b) F 1 (F (x)) x
(c) F (F 1 (y)) y
(d) F 1 (y) x if and only if y F (x)
(e) If Y has a U [0, 1] distribution then F 1 (Y ) is distributed as F
(f) If {X } is a collection of independent F -distributed random variables dened on the same sample space, then there exist random
variables Y such that Y is distributed as U [0, 1] and
F 1 (Y ) = X
with probability 1 for all .
4. It is known that there is a one to one correspondence between a random
variable X and a distribution function F . That is, one can prove the
following:
(a) Given a random variable X, dene the distribution function
FX (x) = P (X x)
then FX is monotone, FX () := limy FX (y) = 1 and FX () :=
limy FX (y) = 0, and FX is right continuous. See Theorem 22.
(b) Conversely, given a distribution function F satisfying,
i. F is monotone,
ii. F limy F (y) = 1 and limy F (y) = 0, and
iii. F is right continuous,
then one can dene a probability space (, P ) and a random variable X whose distribution function FX is equal to F . This will be
proved in a advanced probability theory.
5. Evaluate

x2 dx.

6. Evaluate

cos xdx.

102
7. Prove that if X (, ) then EX = .
8. Prove that if X (, ) then V (X) = 2 .
9. Prove that if X (, ) then
MX (t) = (1 t) , 1 t > 0.
10. We now introduce the beta distribution.
(a) Evaluate
(b) Evaluate
(c) Evaluate

1
0

1
0
1
0

x(1 x)dx.
x99 (1 x)dx.
x(1 x)99 dx.

(d) For a > 0, b > 0, dene the beta function

B(a, b) :=
0

xa1 (1 x)b1 dx.

(e) Prove that


B(a, b) =

(a)(b)
.
(a + b)

(f) For a > 0 b > 0 prove that the function f dened by


{

f (x) =

(a+b) a1
x (1
(a)(b)

x)b1 ,

0<x<1
otherwise

is a probability density function of a random variable. The random


variable X is said to have a beta distribution and denoted X
Beta(a, b).
(g) Find EX for X Beta(a, b).
(h) Fund V (X) for X Beta(a, b).
11. Find
arg min E(X b)2
where X is a continuous random variable.
12. Find
arg min E|X b|
where X is a continuous random variable.

103

2.7

Bivariate random variables

It is natural that two or more random variables are involved when we try to
use a probabilistic model to a phenomenon in a model real life. We study
the properties that occurs when we model using two random variables.

2.7.1

Marginal and conditional distribution

Consider the joint distribution function of the two random variables X and
Y dened by
FX,Y (x, y) = P (X x, Y y), (x, y) IR2 .
We suppose that X and Y are absolutely continuous random variables in
the sense that there exists a function f : IR2 [0, ) such that

FX,Y (x, y) =

f (s, t)dsdt.

The function f (x, y) is called the joint probability density function of X


and Y . In this case we denote this by (X, Y ) f (x, y)
Example 32 Determine the constant c so that f (x, y) = c1(0<x<y<1) is the
joint probability density function of the random variables X and Y .
Solution. Notice that the function f is nonnegative valued. We determine
c so that

f (x, y)dxdy = 1.

f (x, y)dxdy = c

1(0<x<y<1) dxdy

dxdy

= c
0

0
1

= c
ydy
0
c
=
.
2
Therefore c = 2.

104

Example 33 Find the constant c so that f (x, y) = ce(x +y )/2 , (x, y) IR2 is
the joint probability density function of the random variables X and Y .
2

Solution.Notice that the function f is nonnegative valued. We determine c


so that

f (x, y)dxdy = 1.
Here, the interval of integration is the whole real line.

f (x, y)dxdy = c

= c
Now, using the identity

ex

2 /2

e(x

2 +y 2 )/2

ex

dx =

2 /2

dx

dxdy

ey

2 /2

dy.

we see that

f (x, y)dxdy = c

e(x

2 +y 2 )/2

dxdy

= c2.
So c =

1
.
2

Remark 9 We say that (X, Y ) has a standard bivariate normal distribution


if the joint probability density function of X and Y is
f (x, y) =

1 (x2 +y2 )/2


e
, (x, y) IR2 .
2

Theorem 35 Suppose that (X, Y ) f (x, y). Dene

fX (x) =

f (x, y)dy

Then fX is the probability density function of X.

105
Proof. Clearly, fX 0.
P (X x) = P (X x, < Y < )
x

=
=
Therefore

f (s, t)ds dt

fX (s)ds.

d
P (X x) = fX (x).
dx

Denition 28 Suppose that (X, Y ) f (x, y). Then the probability density
function

fX (x) = f (x, y)dy


is called the marginal probability density function of X. Similarly,

fY (y) =

f (x, y)dx

is called the marginal probability density function of Y .


Example 34 Suppose that (X, Y ) f (x, y) = 2 1(0<x<y<1) . Find fX and
fY .
Solution.

fX (x) =

f (x, y)dy

2dy
x

= 2(1 x)1(0<x<1) ,

fY (y) =

f (x, y)dx

2dx
0

= 2y1(0<y<1) .

106

Example 35 Suppose that (X, Y ) f (x, y) where


f (x, y) =

1 (x2 +y2 )/2


e
, (x, y) IR2 .
2

Find the marginal probability density functions of X and Y .


Solution.

fX (x) =

f (x, y)dy

1 (x2 +y2 )/2


e
dy
2

1
1
2
2
= ex /2 ey /2 dy
2
2
1 x2 /2
= e
, x IR
2
=

That is, X N (0, 1). Similarly, Y N (0, 1).

Remark 10
P (a < X1 < b) = P (a < X < b, < Y < )

=
a
b

=
a

f (x, y)dy dx

fX (x)dx

where fX (x); pdf of X.

2.7.2

Conditional distribution

Theorem 36 Suppose that (X, Y ) f (x, y). Dene


f (y|x) =

f (x, y)
, fX (x) > 0
fX (x)

Then f (y|x) becomes a probability density function . That is,

107
1. f (y|x) 0,

for all y,

2.

f (y|x)dy =

f (x, y)
dy =
fX (x)

f (x, y)dy
= 1.
fX (x)

Denition 29 Suppose that (X, Y ) f (x, y). Then the probability density
function
f (x, y)
f (y|x) =
, fX (x) > 0
fX (x)
is called the conditional probability density function of Y given X = x.
Denition 30 Suppose that (X, Y ) f (x, y). Then

F (y|x) =

f (t|x)dt

is called the conditional distribution function of Y given X = x. In general,,


P (Y A|X = x) :=

f (t|x)dt.
A

Example 36 Suppose that (X, Y ) f (x, y) = 2 1(0<x<y<1) .


1. Find f (y|x).
(

2. Find P Y > 12 |X = x .
Solution. We know that
fX (x) = 2(1 x)1(0<x<1) .
Therefore, the conditional probability density function of Y given X = x is
f (x, y)
, f (x) > 0
f (x)
1
1(x,1) (y), 0 < x < 1.
=
1x

f (y|x) =

Therefore

1
P Y > |X = x =
2
{ 1

x1
1
2

1
dx2 = 1
1x
1
1
dy = 1x
1x

1
2

1
2

1
2

f (y|x)dy

<x<1
0 < x < 21

108

Now, we dene the conditional expectation.


Denition 31 Suppose that (X, Y ) f (x, y) and u(X, Y ) is a function of
two random variables X and Y . Then

E[u(X, Y )|X = x] :=

u(x, y)f (y|x)dy

is called the conditional expectation of u(X, Y ) given X = x. In this case


E[u(X, Y )|X = x] is a function, say (x), of x. Similarly, if

(x) = E[u(X, Y )|X = x] =

u(x, y)f (y|x)dy

then the random variable


E[u(X, Y )|X] = (X)
is called the conditional expectation of u(X, Y ) given X.
Theorem 37
1. E(aY + b|X = x) = aE(Y |X = x) + b
2. Y 0 E(Y |X = x) 0
3. E[E(Y |X = x)] = E(Y )
4. E[g(X)Y |X = x] = g(x)E(Y |X = x)
Proof. 1 and 2 are clear from the denition. In order to prove 3 we
notice that

E(Y |X = x) = yf (y|x)dy.
Then
E[E(Y |X)] :=

E(Y |X = x)fX (x)dx

yf (y|x)dy fX (x)dx
(

yf (y|x)fX (x)dy dx

yf (x, y)dydx

= E(Y ).

109
Since
E[g(X)Y |X = x] :=

g(x)yf (y|x)dy = g(x)E[Y |X = x]

we get that 4.

We now specify the conditional mean and conditional variance.


Denition 32
1.
E(Y |X = x) =

yf (y|x)dy

is called the conditional mean of Y given X = x.


2.
V (Y |X = x) =

{y E(Y |X = x)}2 f (y|x)dy

= E[{Y E(Y |X)}2 |X = x]


is called the conditional variance of Y given X = x.
Theorem 38
V (Y |X = x) = E(Y 2 |X = x) (E(Y |X = x))2
Proof. Let
g(x) = E(Y |X = x).
Then
V (Y |X = x)

:=

(y g(x))2 f (y|x)dy
y f (y|x)dy 2g(x)

= E(Y 2 |X = x) (g(x))2 .

yf (y|x)dy + g(x)

f (y|x)dy

110
Example 37 Suppose that (X, Y ) f (x, y) = 2 1(0<x<y<1) . Find E[Y |X =
x] and V ar[Y |X = x].
Solution. Notice that
{

f (y|x) =

1
,
1x

x < y < 1, 0 < x < 1


otherwise

Therefore
E[Y |X = x] =

yf (y|x)dy

y
dy
x 1x
1
1 x2
=

1x
2
1+x
=
, 0<x<1
2

and

y2
dy
x 1x
1 + x + x2
.
=
3

E[Y 2 |X = x] =

Therefore
(

1 + x + x2
1+x

3
2
2
(1 x)
=
, 0 < x < 1.
12

V [Y |X = x] =

)2

Remark 11 We can extend the concepts of conditional distribution to more


than two random variables. For example, suppose that
f (x1 , . . . , xn )
is the joint probability density function of (X1 , . . . , Xn ). Then, we can dene
a marginal probability density function of X1 as

f1 (x1 ) =

...

f (x1 , x2 , . . . , xn )dx2 dxn

111
and a conditional probability density function of (X2 , . . . , Xn ) given X1 = x1
as
f (x1 , x2 , . . . , xn )
f (x2 , . . . , xn |X1 = x1 ) =
.
f1 (x1 )
Theorem 39
1. E(Y + Z|X = x) = E(Y |X = x) + E(Z|X = x)
2. E(Y h(X))2 E[(Y E(Y |X))2 ], h
3. V (Y ) = E[V (Y |X)] + V ar[E(Y |X)]
Proof. 1.
E(Y + Z|X = x)

:=

(y + z)f (y, z|x)dydx


[

f (y, z|x)dz dy +

f (y, z|x)dy dx

yf (y|x)dy +

zf (z|x)dz

= E[Y |X = x] + E[Z|X = x]
2. Let g(x) = E[Y |X = x]. Then we get
h(X))2 ]
E{E[Y h(X))2 |X = x]}
E{E[(Y g(X) + g(X) h(X))2 |X = x]}
E{E[(Y g(X))2 |X = x]}
+E{E[(g(X) h(X))2 |X = x]}
+2E{E[(Y g(X)) (g(X) h(X))|X = x]}
= EV (Y |X = x) + E(g(X) h(X))2
+2E{(g(X) h(X)) E[Y g(X)|X = x]}
= EV (Y |X = x) + E(g(X) h(X))2 .

E[(Y
=
=
=

Since E(Y |X = x) = g(x),


()E[(Y h(X))2 ] = EV (Y |X = x) + E(g(X) h(X))2 .

112
Therefore
E[(Y
=
=
=
=

h(X))2 ]
E{E[(Y h(X))2 |X = x]}
E[V (Y |X = x)] + E[(g(X) h(X)2 ]
E[E[(Y g(X))2 |X = x] + E[(g(X) h(X))2 ]
E(Y g(X))2 + E(g(X) h(X))2
E[(Y g(X))2 ].

Therefore
E(Y h(X))2 E[(Y E(Y |X = x))2 ],
for all h
3. Let
h(X) = 2 = E(Y )
Then
V (Y ) =
=
=
=

E[(Y 2 )2 ]
E[V (Y |X)] + E[(g(X) 2 )2 ]
E[V (Y |X)] + V (g(X))
E[V (Y |X)] + V [E(Y |X)].

Example 38 Suppose that (X, Y ) f (x, y) = 2 1(0<x<y<1) .


1. Find V (Y )
2. Find E[V (Y |X)].
3. Find V ar[E(Y |X)].
Solution. We know that

fY (y) =

fX (x) =

f (x, y)dx = 2y1(0,1) (y),

f (x, y)dy = 2(1 x1 )1(0,1) (x1 )

113
1.

V (Y ) =
0

y 2 2ydy

)2

y2ydy

1
2
1
( )2 = ,
2
3
18

2.
V (Y |X) =

(1 x)2
,
12

1
E[(1 x)2 ]
12
1 1
2(1 x)3 dx1
=
12 0
1
=
,
24

E[V (Y |X)] =

3.
E(Y |X) =

1+x
,
2

V [E(Y |X)] = V (

1+X
)
2

1
V (X)
4
1
1
=

4 18
1
=
,
72

4.
E[V (Y |X)] + V [E(Y |X)]
1
1
+
=
24 72
1
=
18
= V (Y )

114

2.7.3

Covariance and correlation coecient

Suppose that
X; 1 = E(X), 12 = V (X)
and
Y ; 2 = E(Y ), 22 = V (Y )
Denition 33 The covariance of X and Y is dened as
Cov(X, Y ) = E[(X 1 )(Y 2 )]
and the correlation coecient of X and Y is dened as
:= Corr(X, Y ) =

Cov(X, Y )
1 2

for 1 > 0, 2 > 0


Remark 12 We have the following Cauchy-Schwartz inequality.
(E(XY ))2 E(X)2 E(Y )2
Proof. We note that
h(t) = E(Y tX)2
= E(Y )2 2tE(XY ) + t2 E(X 2 ) 0
for all t. Then
D = [E(XY )]2 E(X 2 )E(Y 2 ) 0

Properties 1
1. Cov(aX + b, cY + d) = acCov(X, Y )
2. Cov(X, Y ) = E(XY ) E(X)E(Y )
3. 1 1 ;
= 1 P (Y = aX + b) = 1 with a > 0
= 1 P (Y = aX + b) = 1 with a < 0

115
4.
V (X + Y ) = V (X) + 2Cov(X, Y ) + V (Y )
Proof.
1.
Cov(aX + b, cY + d)
= E[{aX + b E)(aX + b)}{cY + d E(cY + d)}]
= E[ac(X E(X))(Y E(Y ))]
= ac Cov(X, Y )
2.
Cov(X, Y ) =
=
=
=

E[(X 1 )(Y 2 )]
E[XY 2 X 1 Y + 1 2 ]
E(XY ) 2 E(X) 1 E(Y ) + 1 2
E(XY ) 1 2

3. Let
h(t) = E{(Y 2 ) t(X 1 )}2 0
By Cauchy-Schwartz inequality, we get
|| 1
Now, putting, t = 21
then,
2
)
1
= E(Y 2 )2 2tE(X 1 )(Y 2 )
+t2 E(X 1 )2
2
2
= 22 2 1 2 + 2 12
1
1
2
2
= 2 (1 )

h(

116
Therefore,
2
(X 1 )}2 ] = 0
1

2 = 1 E[{(Y 2 )
P [Y 2

2
(X 1 ) = 0] = 1
1

Remark 13 In some cases, we can consider correlation analysis between the


random variables X and Y by computing a correlation coecient.
In other cases, we need to analysis the relation
E(Y |X = x)
between the random variables X and Y . The relation
E(Y |X = x)
is called a regression curve and analyzing the regression curve is called a
regression analysis
Denition 34 The regression curve of Y given X = x is dened as
E(Y |X = x).
In particular, the line
E(Y |X = x) = a + bx
is called a regression line of Y on x.
Theorem 40 Consider the regression line
E(Y |X = x) = a + bx
of Y on x. Then,
1. E(Y |X = x) = 2 + 21 (X 1 )
2. E[V (Y |X = x)] = 22 (12 ). Suppose, further, that V (Y |x) is constant
with respect to x. Then,
{

V (Y |x) =

22 (1

)
2

22 if 2 0
0 if 2 1

117
Proof.
1. Since E(Y |X = x) = a + bx,
E[E(Y |X = x)] = E(Y ).
Therefore
2 = a + b1
Now, notice that
1 2 + 1 2
= EXY
= E(E(XY |X = x))
= E(XE(Y |X = x))
= E(X(a + bX))
= E[aX + bX 2 ]
= a1 + b(12 + 21 ),
we get
1 2 + 1 (a + b1 ) = a1 + b12 + b21 ,
2
b= .
1
Therefore
a = 2
That is,
E(Y |x) = 2 +

2
1
1
2
(x 1 .)
1

2. Since
V (Y ) = E[V (Y |x)] + V ar[E(Y |x)],
we get
E[V (Y |x)] = V (Y ) V [E(Y |x)]
= 22 V (a + bx)
= 22 b2 V (X)
2
= 22 2 22 12
1
2
= 2 (1 2 ).

118

We now dene the joint moment generating function.


Denition 35
1. M (t1 , t2 ) = E[et1 X+t2 Y ] is called the joint moment generating function
of X and Y if it exists for |t1 | < h1 , |t2 | < h2 , and some h1 , h2 > 0
2.
n

M (t1 , t2 , . . . , tn ) = E[exp(

ti Xi )] = E(et X )

is called the joint moment generating function of X1 , . . . , Xn if it exists


for |ti | < hi and some hi > 0 (i = 1, . . . , n).
Properties 2 Let M (t1 , t2 ) be the mgf of (X, Y )
1. Uniqueness of mgf. If the joint mgf exists, it determines that joint
distribution uniquely.
2. M (t1 , 0) = E(et1 X ) = MX (t1 ) is the mgf of X
M (0, t2 ) = E(et2 Y ) = MY (t2 ) is the mgf of Y
3. (Marginal mgf )
k+m
M (t1 , t2 )|t1 =t2 =0 = E(X k Y m )
tk1 tk2

2.7.4

Stochastic Independence

(1) Independence between events


Denition 36
1. A1 , A2 are independent P (A1 A2 ) = P (A1 )P (A2 )
P (A1 |A2 ) = P (A1 ), P (A1 ) > 0
2. A1 , A2 , . . . , An are independent
P (Ai1 Aim ) = P (Ai1 ) P (Aim )
for 1 i1 < < im n

119
(2) Independence between two classes of events
Denition 37 {A1 , . . . , Am } and {B1 , . . . , Bn } are independent
P (Ai Bj ) = P (Ai )P (Bj ), i = 1, . . . , m, j = 1, . . . , n
(3) Independence between two discrete random variables.
Denition 38 Let X be a discrete random variable with A = {x1 , . . . , } and
let Y be a discrete random variable with B = {y1 , . . . , }. Then X, Y are
independent
{(X = x1 )(X = x2 ) } and {(Y = y1 )(Y = y2 ) }are independent
P [(X = xi ) (Y = yi ] = P [X = xi ]P [Y = yi ],
i = 1, . . . , j = 1, . . .
f (x, y) = fX (x) fY (y), x, y
(4) Independence between two random variables.
Denition 39 Suppose f (x, y) is the joint probability density function of X
and Y . Then X and Y are independent
f (x, y) = fX (x)fY (y),
for all x, y
Theorem 41 Suppose f (x, y) is the joint probability density function of
X and Y . Then X and Y are (stochastically) independent f (x, y) =
g(x)h(y), for all x,y for some g, h
Proof. () Suppose X and Y are independent. Then
f (x, y) = fX (x)fY (y),
for all x, y by denition.
() Suppose that f (x, y) = g(x)h(y), x, y. Then,

fX (x) =

f (x, y)dx2

= g(x)
= c1 g(x)

h(y)dy

120
with c1 =

h(y)dy. Similarly,
f2 (y) = c2 h(y),

with c2 =

g(x)dx. Then
fX (x)fY (y) = c1 c2 g(x)h(y) = c1 c2 f (x, y)

Since

c1 c2 =

g(x)dx

h(y)dy

g(x)h(y)dxdy

f (x, y)dxdy

= 1
Therefore, f (x, y) fX (x)fY (y) for all x, y

Example 39
1. Let (X, Y ) f (x, y) where
{

f (x, y) =

exy x > 0, y > 0


0
o.w

Then f (x, y) = ex I(x>0) ey I(y>0) . So X, Y are independent


2. Let (X, Y ) f (x, y) where
{

f (x, y) =

2exy 0 < x < y <


0
o.w

f (x, y) = 2ex ey 1(0<x<y<) . So X, Y are not independent.


Theorem 42 Suppose that X and Y are independent. Then,
E[u(X)v(Y )] = E[u(X)]E[v(X)]
provided the expectation exists

121
Proof. Using independence and the Fubini theorem, we see that

E[u(X)v(Y )] :=

u(x)v(y)f (x, y)dxdy


u(x)v(y)f1 (x)f2 (y)dxdy

v(y)f2 (y)[

= E[u(X)]

u(x)f1 (x)dx]dy
v(y)f2 (y)dy

= E[u(X)]E[v(Y )]

By taking u(X) = 1(a,b) (x), v(Y ) = 1(c,d) (y)), we get


Theorem 43 Suppose that X and Y are independent. Then,
P (a < X < b, c < Y < d) = P (a < X < b) P (c < Y < d)
Remark 14
1. Suppose that X and Y are independent. Then Cov(X, Y ) = 0. That
is, Corr(X, Y ) = 0.
2. In general, the fact that Corr(X, Y ) = 0 does not implies the independence of X and Y .
Proof.
1.
Cov(X, Y ) = E(XY ) E(X)E(Y )
= E(X)E(Y ) E(X)E(Y )
= 0.
2. X U (1, 1), Y = X 2 Then, X, Y ; dependent. Yet,
Cov(X, Y ) = 0,
Because
Cov(X, Y ) = E(XY ) E(X)E(Y )
= E(X 3 ) 0E(Y ) = 0

122
P [ 12 < X < 12 , X 2 > 14 ] = 0
P [ 12 < X < 12 ]P [X 2 > 41 ] =

1
2

1
2

1
4

Therefore, X, Y ; not independent.

Theorem 44 X and Y are independent


M (t1 , t2 ) := M (t1 , 0)M (0, t2 ) = MX (t1 )MY (t2 ).
where the mgf s can be dened
Proof. ()
M (t1 , t2 ) =
=
=
=

E[exp(t1 X + t2 Y )]
E[et1 X et2 Y ]
E[et1 X ]E[et2 Y ](independence)
MX (t1 )MY (t2 ).

()
M (t1 , t2 ) = MX (t1 )MY (t2 )
Then,

et1 x1 +t2 x2 f (x, y)dx1 dx2


et1 x1 et2 x2 f1 (x)f2 (y)dxdy

So, by the uniqueness of mgf, we get


f (x, y) = f1 (x) f2 (y) for all x, y.
So X and Y are independent.

(5) Multivariate case

123
Denition 40 X1 , . . . , Xn are independent

f (x1 , . . . , xn ) = f1 (x1 )f2 (x2 ) fn (xn ),


for all x1 , . . . , xn where fi are the marginal pdf and f (x1 , . . . , xn ) is the joint
pdf
Remark 15 Pairwise independence does not imply mutual independence.
Example 40
{

f (x) =

for (x1 , x2 , x3 ) C
0 otherwise
1
4

where
C = {(1, 0, 0), (0, 1, 0), (0, 0, 1), (1, 1, 0)}
Example 41 Let X1 , X2 , X3 : IID with pdf
f (x) = 2x1(0,1) (x).
1. Distribution of Y = max(X1 , X2 , X3 ) ?
Solution. (distribution function method)
G(y) =
=
=
=

P (Y y)
P (X1 y, X2 y, X3 y)
P (X1 y)P (X2 y)P (X3 y)
[P (X1 y)]3
[

]3

2xdx

This is equal to

G(y) =
So, pdf of Y is

g(y) =

1
0

for 0 < y < 1


for y 1
for y 0

6y 5 for 0 < y < 1


0
otherwise

124
2. Distribution of Z = min(X1 , X2 , X3 ) ?
Solution. (distribution function method)
G(z) =
=
=
=

P (Z z)
P (min(X1 , X2 , X3 ) z)
1 P (min(X1 , X2 , X3 ) > z)
1 [P (X1 > z)]3 .

This is equal to

2 3

1 (1 z )

G(z) =

1
0

for 0 < z < 1


for z 1
for y 0

So, pdf of Z is
{

g(y) =

6(1 z 2 )2 z for 0 < z < 1


0
otherwise

125

2.7.5

Bivariate normal distribution

Denition 41 We say that (X, Y ) has the bivariate normal distribution


and )
(
2

1 1 2
is denoted by (X, Y )T N2 (, ) with = (1 , 2 )T and =
.
sym.
22
if
1
1
f (x, y) =
exp[
Q(x, y)]
2(1 2 )
2 12 22 (1 )2
where
Q(x, y) = (

x 1 2
x 1 y 2
y 2 2
) 2(
)(
)+(
).
1
1
2
2

Result 1 f is a joint probability density function of (X, Y ). That is,


1. f (x, y) 0
2.

f (x, y)dxdy = 1.

Proof. Observe that

fX (x) =

f (x, y)dy
[

]
[
x1 2
1
1
exp 2(1
)
w)
(w2 2 x
2) (
1
1

=
dw.
exp
2(1 2 )
21 1 2

Let
w=

y 2
.
2

Then
[

x1 2
1
)
exp 2(1
2) (
1

fX (x) =

21 1 2
(
)

[
x 1 2
1
w

exp
2(1 2 )
1
(
)2 ]
2

x 1
+
dw.
2
2(1 )
1

126
That is,
[

1)
exp (x
212

fX (x) =
21

1

2 1 2
(
)

[
1
x 1 2 ]
exp
w

dw
2(1 2 )
1
]
[
1
1 x 1 2
exp (
) .
=
2 1
1 2

Therefore X N (1 , 12 ). Therefore

f (x, y)dxdy =

f1 (x)dx = 1.

Result 2 Suppose that (X, Y )T N2 (,

). Then

1. X N (1 , 12 ); Y N (2 , 22 )
2. Y |X=x N (2 + 21 (x 1 ), 22 (1 2 )).
Solution.
f (Y |y) =

f (x, y)
fX (x)

gives the result.

Remark 16 Suppose that (X, Y )T N2 (, ). Then it is known that there

exist a symmetric matrix V such that V 2 = .


Theorem 45 Suppose that (X, Y )T N2 (,

). Then

1. The mgf is
M(X,Y ) (t1 , t2 )
1
= exp{1 t1 + 2 t2 + (12 t21 + 21 2 + 22 t22 )}.
2

127
2.
1/2

f (x) = |2|

1
exp (x )T 1 (x )
2

|| = 12 22 (1 2 ),

1 12
=
1 2 sym.

1 2
1
22

1
M (t) = exp(T t + tT t)
2

X VZ +
Here, Z N2 (0, I2 ), V 2 = , V is symmetric.
Proof.
M(X,Y ) (t1 , t2 )
= E exp{t1 X + t2 Y }
= E[E exp{t1 X + t2 Y }|X].
Now, Y |X=x is a normal distribution, we get
E[exp{t1 X + t2 Y }|X = x]
= exp(t1 x)E[exp{t2 Y }|X = x]
(

= exp(t1 x) exp E(Y |X = x)t2


)
1
+ t22 V (Y |X = x) .
2

128
That is,
E[exp{t1 X + t2 Y }|X = x]
(

= exp(t1 x) exp {2 +
)
1
+ t22 (1 2 )22
2{

2
(x 1 )}t2
2
}

x
= exp (t1 1 + 2 t2 )
1
{
}
1
exp t1 1 + t2 2 + (1 2 )22 t22 .
2
Therefore
M(X,Y ) (t1 , t2 )
= E[E[exp{t1 X + t2 Y }|X]]
(
2
= E[exp(t1 X) exp {2 + (X 1 )}t2
2
)
1 2
+ t2 (1 2 )22 ]
2 {
}
X
= E[exp (t1 1 + 2 t2 )
1
{
}
1
exp t1 1 + t2 2 + (1 2 )22 t22 ]
2
= MZ (t1 1 + 2 t2 )
{
}
1
exp t1 1 + t2 2 + (1 2 )22 t22
2
with Z N (0, 1)
1
= exp{ (t1 1 + 2 t2 )2 }
2
{
}
1
exp t1 1 + t2 2 + (1 2 )22 t22
2
1 22
= exp{t1 1 + t2 2 + (1 t1 + 21 2 t1 t2 + 22 t22 )}
2
}
{
1
= exp T + tT t .
2
Since X V Z + ,
MX (t)

129
= E exp(tT X)
= E exp(tT V Z + tT )
= exp(tT )MZ (V T t)
1
= exp(tT ) exp{ (V T t)T (V T t)}
2
1
= exp(tT + tT t).
2

Result 3 Suppose that (X, Y )T N2 (,


(

X Y 2
(W1 , W2 ) =
,
1
2

). Then
(

N2 0,

1
1.

))

Solution.
M(W1 ,W2 ) (t1 , t2 )
= E exp(t1 W1 + t2 W2 )
X
Y 2
= E exp(t1
+ t2
)
1
2
1
2
t1
t2
= exp( t2 )E exp( X + Y )
1
2
1
2
1
2
t1 t2
= exp( t2 )M(X,Y ) ( , )
1
2
1 2
[t
1
2
t2
1
= exp( t2 ) exp
1 + 2
1
2
1
2
t1
t1 t2
t1
1
+ {12 ( )2 + 21 2 ( )( ) + 12 ( )2 }]
2
1
1 2
1
1 2
= exp[ (t1 + 2t1 t2 + t22 )]
[2
(
) ]
1 T 1
= exp t
t
1
2
(

and this is the mgf of N2 0,

1
1

))

130
Result 4 Suppose that (X, Y )T N2 (,

). Then

EX = 1 , EY = 2
V (X) = 12 , V (Y ) = 22
Cov(X, Y ) = 1 2 , Corr(X, Y ) =
Solution.
1.
EXY
2
M(X,Y ) (t1 , t2 )|t1 =t2 =0
t1 t2
= 1 2 + 1 2 .

2.
E

X 1 Y 2
1
2
2

1
=
exp[ (t21 + 2t1 t2 + t22 )]|t1 =t2 =0
t1 t2
2
= .

Result 5 Suppose that (X, Y )T N2 (,


then X and Y are independent.

). Then if = corr(X, Y ) = 0

Solution. Let = 0. Then


M(X,Y ) (t1 , t2 )
1
= exp{t1 1 + t2 2 + (12 t21 + 21 2 t1 t2 + 22 t22 )}
2
1
1 22
= exp{t1 1 + 1 t1 } exp{t2 2 + 22 t22 }
2
2
= MX (t1 )MY (t2 )
Therefore X and Y are independent.

131

2.8

Convergence of random variables

2.8.1

Convergence of random variables

Consider a sequence {xn } of real numbers. We can specify the denition of


the limit of the sequence {xn }.
Denition 42 A sequence {xn } of real numbers converges to a real number
x if given > 0 there exists N IN such that if n N then |xn x| < .
We denote this convergence by
lim xn = x

or
xn x, n .
Now, we consider a sequence {Xn } of random variables and a random variable
X. What is the meaning of
Xn X, n ?
Recall that a random variable X is a function dened on the sample space
. One can consider the point-wise convergence
Xn () X(), .
However, this convergence is known as not very useful in statistics and probability. We introduce a few convergences of a sequence of random variables
that are commonly used in statistics and probability.
Denition 43
1. A sequence {Xn } of random variables converges with probability one to
a random variable X if
P (Xn () X()) = 1.
We denote this convergence by
Xn wp1 X
or
Xn a.s X.

132
2. A sequence {Xn } of random variables converges in probability to a random variable X if for each > 0
lim P (|Xn X| > ) = 0.

We denote this convergence by


Xn P X.
Remark 17
1. Convergence with probability one is also called as a strong convergence.
2. Convergence in probability is also called as a weak convergence.
Theorem 46
1. If Xn converge strongly then it converge weakly.
2. Converse of the above statement is no valid in general.
We introduce the law of large numbers without proof.
Theorem 47 Let X1 , . . . , Xn be IID with mean .Consider the sample mean
n = X1 + + Xn .
X
n
Then
n converges strongly to . This is known as the strong law of large
1. X
numbers.
n converges weakly to . This is known as the weak law of large
2. X
numbers.
We next introduce the convergence in distribution.
Denition 44 Suppose that the sequence Xn of random variables correspond
to the distribution functions Fn and the random variable X correspond to the
distribution function F . The sequence {Xn } of random variables converges
in distribution to the random variable X if
Fn (x) F (x)
for every x at which F is continuous. We denote this convergence by
Xn D X.

133
Theorem 48
1. If Xn converge weakly to X then Xn converge in distribution to X.
2. Converse of the above statement is not true in general. However, if X
is a constant, then the converse is also true. That is,
Xn P c Xn D c.
Example 42 Let Z1 , . . . , Zn be IID N (0, 1). Consider the sample variance
Z1 + + Zn
.
Zn :=
n
Then

Zn D 0

Therefore
Zn P 0
Solution. Note that

1
Zn N (0, ).
n

Let Fn be the distribution function of Zn . Then


x
nt2
1

Fn (x) =
e 2 dt

2n1
nx

v2
1
e 2 dv, nt = v.
=

2
Therefore
Fn (x)
That is, if

F (x) =
then for x = 0 we get

1
2

1
0
1

x<0
x=0
x > 0.
x = 0
x = 0.

Fn (x) F (x).

Since F is not continuous at 0


Zn D 0.

134

2.8.2

The central limit theorem

We introduce the central limit theorem.


Theorem 49 Let X1 , . . . , Xn be IID (, 2 ). Then
Yn =
That is,

n
X

D Z N (0, 1)

n ) D N (0, 2 )
n(X

Remark 18 Proof of the central limit theorem need the more advanced stus.
The following theorem is useful.
Theorem 50 (Slutsky theorem) Let Xn D X and Vn P c. Then
1.
Xn + Vn D X + c
2.
Xn Vn D cX
Example 43 Let Z1 , . . . , Zn be IID N (0, 1) and let
Vn = Z12 + + Zn2 .
Let Z N (0, 1), Vn and Z are independent. Let
Z
Tn =
Vn /n
then
Tn D N (0, 1).

135
Proof. By the weak law of large numbers
Vn P
EZ12 = 1
n
Therefore

Vn P
1
n

By Slutsky theorem, we get


Z
Tn =
D N (0, 1)
Vn /n

136

2.8.3

Exercises

1. Suppose that X1 , . . . , Xn are independent and identically N (, 1),


IR distributed random variables.
(a) Prove that
n P .
X
(b) Prove that

n ) D Z N (0, 1).
n(X

(c) Suppose that h is dierentiable and h is continuous at . Then,


prove there exist x between x and such that

h(x) h() = h (x )(x ).


n | prove that
(d) For Yn satisfying |Yn | |X

n ) h()) = h (Yn )(X


n ) ).
n(h(X
(e) Prove that
Yn P .
(f) Prove that

h (Yn ) P h ().
(g) Prove that

n ) h()) D N (0, [h ()]2 ).


n(h(X

This is called the delta method.


2. Suppose that X1 , . . . , Xn are IID N (, 2 ), (, 2 ) IR (0, ). Then,
nd a function h satisfying

n(h(S2 ) h( 2 )) D N (0, 1).


n

3. Suppose that X1 , . . . , Xn are IID B(1, p), 0 < p < 1. Let


n

p =
Prove that

i=1

Xi

4n arcsin p arcsin p D N (0, 1) .

Chapter 3
Basic thinking in Statistics
3.1
3.1.1

Sampling Theory
Descriptive statistics

Statistics can be classied as a descriptive statistics and an inferential statistics.


Suppose that the data of observed values after completing the statistical experiment are given. Then we can consider a frequency distribution,
an histogram, a pie chart, a stem and leaf display. Then we can also nd
statistics such as sample mean, sample median, sample variance, and sample
correlation coecient to analyze the data. This kind of process is known as
a descriptive statistics.
We mainly consider the inferential statistics.

3.1.2

Sampling theory

Consider the random variables X1 , . . . , Xn having the joint probability density function
f (x1 , . . . , xn ; )
that intrinsically contains an unknown parameter .
In a statistical inference, one try to infer the unknown parameter based
on random variables X1 , . . . , Xn that can be obtained as a model of a statistical experiment.
137

138
Example 44 Suppose that X1 , . . . , Xn is a sequence of independent and
identically B(1, p), (0 < p < 1), distributed random variables. Then the
joint probability density function is
f (x1 , . . . , xn ; p) = f (x1 ; p) f (xn ; p)
= px1 (1 p)1x1 pxn (1 p)1xn .
Suppose that we want determine
E(X1 + + Xn )
then we see that
E(X1 + + Xn )
=

x1 =0

x1 =0

(x1 + + xn )f (x1 , . . . , xn ; p)

xn =0

(x1 + + xn )px1 (1 p)1x1 pxn (1 p)1xn

xn =0

= np
and this is not a known number unless p is a known number.
Example 45 Suppose that X1 , . . . , Xn is a sequence of independent and
identically N (, 1), IR, distributed random variables. Then the joint
probability density function is
f (x1 , . . . , xn ; ) = f (x1 ; ) f (xn ; )
{
}
{
}
1
(x1 )2
1
(xn )2
= exp
exp
.
2
2
2
2
Suppose that we want determine
E(X1 + + Xn )
then we see that
E(X1 + + Xn ) =

(x1 + + xn )f (x1 , . . . , xn ; )dx1 dxn = n

and this is not a known number unless is a known number.

139
The unknown constant, that is the constant that we wish to know, such
as p or appears in the previous examples is called a parameter. The only
fact that we know about p is that p is a constant that belongs to the interval
[0, 1], or (0, 1). Similarly, the only fact that we know about is that is a
constant that belongs to the interval (, ). We say that the interval that
contain parameter is a parameter space. In general, if is a parameter and
is the allowable interval to contain , then is called a parameter space.
We need the denition of statistic based on the random variables X1 , ..., Xn
whose joint probability density function
f (x1 , ..., xn ; )
which is determined by a parameter .
Denition 45 Suppose that X1 , ..., Xn are random variables with the joint
probability density function f (x1 , ..., xn ; ) which is determined by a parameter . Then we say that T (X1 , ..., Xn ) is a statistic if it is a function
of X1 , ..., Xn which does not depend on the unknown parameter . That is, a
statistic is an observable random variables which is used for statistical inference about the unknown parameter .
We know the meaning of saying that X1 , ..., Xn is IID f (x). That is,
X1 , . . . , Xn is a sequence of independent and identically f (x) = f (x; ),
, distributed random variables. In this case the joint probability density
function is given by
f (x1 , . . . , xn ; ) = f (x1 ; ) f (xn ; ).
We use the following denition of a random sample in a statistical inference.
Denition 46 We say that X1 , ..., Xn is a random sample from a population
with the probability density function f (x; ) if and only if X1 , ..., Xn is IID
f (x; ).
Example 46 Let X1 , ..., Xn be a random sample from a population with
mean and variance 2 . Then
=X
n := X1 + + Xn
X
n

140
is called the sample mean and
S2 =

n
1
(Xi Xn )2
n 1 i=1

is the sample variance. These are statistics that will be used for statistical
inference about and 2 . The statistic Sn2
Sn2

n
1
=
(Xi Xn )2 ,
n i=1

which is asymptotically equivalent to the sample variance S 2 , is also used to


infer the variance 2 .
Remark 19 We have encountered the problem of nding the distribution of
a transformed random variable T (X1 , ..., Xn ) of random variables (X1 , ..., Xn )
given the distribution of (X1 , ..., Xn ). We have solved the problem by computing the distribution function or the moment generating function of the
T (X1 , ..., Xn ).
Example 47 Let X1 , ..., Xn be a random sample from N (, 2 ) with both
, 2 are unknown.
n ; a statistic and Xn N (, 2 ).
1. X
n
2.

n
X
;
/ n

not a statistic and

n
X

/ n

N (0, 1).

We introduce the order statistics.

3.1.3

Order statistics

Denition 47 Let X1 , ..., Xn be a random sample. Then the order statistic


based on X1 , ..., Xn is the statistic
X(1) ... X(n)
which are obtained by arranging Xi in order of magnitude.

141
Note that, if Xi are of continuous type then, with probability 1, X(1) < ... <
X(n) . Assume that Xi are continuous for the rest of the section.
Result 6 Let X1 , ..., Xn be a random sample from a distribution with probability density function f (x)I(x A). Then, the joint probability density
function of Y1 = X(1) , ..., Yn = X(n) is
g(y1 , ..., yn ) = n!f (y1 ) f (yn ), y1 < ... < yn , yi A.
Result 7 (Marginal distribution of X(k) , 1 k n)
Under the conditions in Result 1, the probability density function of X(k) is
{

gk (y) =

n!
[F (y)]k1 [1
(k1)!(nk)!

F (y)]nk f (y)

yA
otherwise

where F (y) is the distribution function of X1 .


Result 8 (Distribution of (X(i) , X(j) ), 1 i < j n)
Under the conditions in Result 1, the j pdf of (X(i) , X(j) ) is
n!
(i 1)!(j i 1)!(n j)!
[F (x)]i1 [F (y) F (x)]ji1 [1 F (y)]nj f (x)f (y),
x < y, x A, y A

gij (x, y) =

We review the probability integral transform.


Theorem 51 (Probability Integral Transformation [PIT])
1. Let X have the distribution function F which is continuous and strictly
increasing. Then F (X) U (0, 1).
2. Let U U (0, 1) and X have distribution function F which is continuous and strictly increasing. Then F 1 (U ) X.
Proof.
1. Let Y = F (X). Then, for 0 < y < 1,
GY (y) := P [F (X) y] = P [X F 1 (y)] = F (F 1 (y)) = y;
and 0 otherwise. Therefore, Y = F (X) U (0, 1).

142
2. Let Y = F 1 (U ). Then,
FY (y) := P [F 1 (U ) y] = P [U F (y)] = F (y);
Therefore, Y = F 1 (U ) X.

Remark 20
1. We can drop the assumption that F is strictly increasing by dening
F 1 (y) := inf{x : F (x) y}.
2. The second part of the theorem is used to simulate rv with distribution
function F , whereas the rst part is used in the study of order statistic.
Example 48 U1 , ..., Un : a random sample from U (0, 1). Then U(k) , (1
k n) (k, n k + 1).
Proof. The U(k) has probability density function
gk (y) =

n!
y r1 (1 y)nr , 0 < y < 1
(k 1)!(n k)!

This is the probability density function of (k, n k + 1).

Example 49 X1 , ..., Xn : a random sample from a distribution with the distribution function F . Suppose that F is continuous and strictly increasing.
Then F (X(k) ) (k, n k + 1).
Proof. Note that Ui = F (Xi ) i = 1, ..., n ; IID U (0, 1) by the PIT. Moreover,
F is increasing, hence it preserve order; U(j) = F (X(j) ) 1 j n. Therefore,
F (X(k) ) (k, n k + 1).

It is known that if X U (0, 1) then the distribution of Y = ln(1 X)


is E(1). Conversely if, Y E(1) then the distribution of X = 1 eY is
U (0, 1). See Exercise below.

143
Example 50 Let U(1) < ... < U(n) be the order statistics based on a random
sample U1 , ..., Un from U (0, 1). Then, prove that
[

] [

U(1)
U(2)
,
U(2)
U(1)

]2

U(n1)
, ...,
U(n)

]n1

, U(n)

]n

are IID U (0, 1).


Solution. Note that
1. Vi = 1 Ui are IID U (0, 1)
2. Xi = ln(1 Vi ) = ln Ui are IID E(1).
3. X(i) = ln(1 V(i) ) = ln U(ni+1) are the order statistic E(1).
4.
Z1 = nX(1) = n ln U(n)
Z2 = (n 1)(X(2) X(1) ) = (n 1) ln

Zn = (X(n) X(n1) ) = ln

U(n1)
U(n)

U(1)
U(2)

are IID E(1)


5.

U(1)
U(n1)
, ...,
U(2)
U(n)

]n1

, U(n)

]n

are IID U (0, 1).

3.1.4

The 2 , t, and F distributions

Denition 48 Let Z1 , ..., Zn be IID N (0, 1). Then, 2 has a Chi-square


distribution with n degree of freedom if
2 Z12 + + Zn2 .

144
Denote 2 2 (n).
Remark 21 Note that the random variable 2 has the probability density
function
n
x
1
f (x) = n n x 2 1 e 2 I(x > 0).
( 2 )2 2
and, hence, 2 (n) ( n2 , 2).
Theorem 52 (Fundamental results of normal sampling ) Suppose that X1 , ..., Xn
: IID N (, 2 ). Write
2 + + (Xn X)
2
X1 + + Xn
(X1 X)
2

X=
, S =
.
n
n1
Then,
2

N (, ), or X
N (0, 1)
1. X
n
/ n
n
2
(n 1)S 2
i=1 (Xi X)
2.
=
2 (n 1),
2
2

and S 2 are independent.


3. X

Denition 49 Suppose that Z N (0, 1), V 2 (n), and Z, V : independent. Then, T has the (Students) t distribution with n degree of freedom
if
Z
T
,
V /n
and denoted by T t(n).
Remark 22
1. If T t(n), then T has the probability density function
(

n+1
1
2
( )(
f (x) =
) n+1 , x IR
n
n 2 1 + x2 2
n

.
2. If T t(n) then T t(n). That is the distribution of T is symmetric
with respect to 0.

145
Example 51 Let X1 , . . . , Xn be IID N (, 2 ), (, 2 ) IR (0, ).Let
n
1
n )2
(Xi X
S =
n 1 i=1
2

Then we know that

n
X
t(n 1).

/ n

Therefore we have
(

n t/2 (n 1)
n + t/2 (n 1)
P X
<<X
n
n

= 1 ,

where t/2 (n 1) is the /2th upper quantile of the t(n 1).


Denition 50 Suppose that U 2 (m), V 2 (n), and U, V : independent. Then, F has the F distribution with degree of freedom m and n
if
U/m
F
,
V /m
and denoted by F F (m, n).
Remark 23
1. If F F (m, n) then F has the probability density function
f (x) =

) ( )m

m+n
m 2
2
n
( ) ( )
m
2 n2

x 2 1
m

1+

mx
n

) m+n , x > 0
2

2. If T t(n) then T 2 F (1, n).


3. If F F (m, n) then 1/F F (n, m).
Example 52 Let X1 , . . . , Xm be IID N (1 , 12 ), (1 , 22 ) IR(0, ) and let
Y1 , . . . , Yn be IID N (2 , 22 ), (2 , 22 ) IR(0, ). Suppose that (X1 , . . . , Xm )
and (Y1 , . . . , Yn ) are independent. Let
2
S1 =

m
1
m )2 ,
(Xi X
m 1 i=1

146
and
2
S2 =

n
1
(Yi Yn )2
n 1 i=1

Then we know that

n 2
2
m
Xi X
S1 /12
i=1 ( 1 ) /m 1
= n Y Y
F (m 1, n 1)
2
i
2
S2 /22
i=1 ( ) /n 1
2

Therefore we get

2
2
S2
22
S2

P
= 1 ,
2 F/2 (m 1, n 1) < 2 < F1/2 (m 1, n 1)
2
1
S1
S1

where F (m 1, n 1) is the th upper quantile of the F (m 1, n 1) and


F1/2 (m, n) = 1/F/2 (n, m).

147

3.1.5

Exercises

1. If X U (0, 1) then use the distribution function method to prove the


distribution of Y = ln(1 X) is E(1). Conversely if, Y E(1)
then use the distribution function method to prove the distribution of
X = 1 eY is U (0, 1).
2. Suppose that X1 , X2 , X3 have the continuous probability density function f (x)1(x A). Then, use the change of variable method to prove
that Y1 = X(1) , Y2 = X(2) , Y3 = X(3) have the joint probability density
function
g(y1 , y2 , y3 ) = 3!f (y1 )f (y2 )f (y3 ), y1 < y2 < y3 , yi A .
3. Suppose that X1 , X2 , X3 have the continuous probability density function f (x)1(x A) and the distribution function F . Prove that Y2 =
X(2) has the marginal probability density function
{

g2 (y) =

3!
F (y)[1
1!1!

F (y)]f (y)

yA
otherwise.

4. Let X(1) < ... < X(n) be the order statistics from E() , > 0. Then,
nX(1) , (n 1)(X(2) X(1) ), ..., (X(n) X(n1) ) are IID E(). Use the
change of variable technique to prove this result.
Hint. Note that
(a) If we let
T :

Z1 = nX(1)
Z2 = (n 1)(X(2) X(1) )
..
.
Zn = (X(n) X(n1) ),

then the inverse transform is given by


(b)
T 1 :

X(1) =

1
Z1
n

148
1
1
Z2 + Z1
n1
n

X(2) =
..
.

1
1
1
X(n) = Zn + Zn1 + +
Z2 + Z1 .
2
n1
n
(c) Then, T is 1 1 from A := {(x1 , ..., xn ) : 0 < x1 < x2 < < xn }
onto B := {(z1 , ..., zn ) : z1 > 0, z2 > 0, ..., zn > 0} with

1
n
1
n

0
1
n1

0
0 0

1
n

1
n1

|JT 1 | = | det
..
.

| =
.

n!

(d) The joint probability density function of (X(1) , ..., X(n) ) is


f (x1 , ..., xn ) = n!n exp[

xi ]I(0 < x1 < x2 < < xn ).

i=1

(e) Therefore, the joint probability density function of (Z1 , ..., Zn ) is


g(z1 , ..., zn ) = n!n exp[

xi ]I(z1 > 0, ..., zn > 0)|JT 1 |

i=1

= n exp[

i=1

zi ]

I(zi > 0)

i=1

[ exp[zi ]I(zi > 0)] are IID E().

i=1

5. Suppose U1 , ..., Un : a random sample from U (0, 1). Find the probability
density function of the range R = U(n) U(1) .
6. Suppose U1 , ..., Un : a random sample from U (0, 1). Find the joint probability density function of
X1 = U(1) , X2 = U(2) U(1) , ..., Xn = U(n) U(n1) , Xn+1 = 1 U(n) .
7. Let X1 , ..., Xn be a random sample from a distribution with the distribution function F . Suppose that F is continuous (and strictly increasing). Let Y1 = min1in Xi , and Yn = max1in Xi .

149
(a) What are the joint and marginal distribution of Z1 = F (Y1 ) and
Zn = F (Yn )?
(b) Prove that

Z1
Zn

and Zn are independent.

150

3.2

Estimation in statistical inference

3.2.1

Point estimation

We consider the problems in point estimation.


Suppose that the unknown parameter belongs to the parameter space
. Suppose that
X1 = x1 , . . . , Xn = xn
are observed values of the random sample from the probability density function f (x; ), . We try to determine in terms of a statistic
u(X1 , . . . , Xn ).
We introduce the concepts of the point estimator and the point estimate.
Denition 51 Let the observed values
X1 = x1 , . . . , Xn = xn
of the random sample from the probability density function f (x; ), be
given.
1. We say that the function
1 , . . . , Xn ) = u(X1 , . . . , Xn )
(X
of random variables (X1 , . . . , Xn ) as a point estimator of . In this case
the point estimator
1 , . . . , Xn )
(X
is a random variable.
2. We say that the function
1 , . . . , xn ) = u(x1 , . . . , xn )
(x
of observed values (x1 , . . . , xn ) as a point estimate of . In this case
the point estimate
1 , . . . , xn )
(x
is a numerical value.

151
3. If we use to infer the unknown parameter then can be an estimator
or an estimate. In this case we have to understand the meaning in the
context.
Example 53 Suppose that X1 , . . . Xn are random sample from B(1, p), 0 <
p < 1. In this case, p is the population success ratio.
1. The function of random variables
p = p(X1 , . . . , Xn ) =

n
1
Xi
n i=1

is a point estimator of p and is the sample success ratio.


2. Suppose that n = 3 and the observed values are 0, 1, 1. In this case,
p = p(0, 1, 1) =

0+1+1
2
=
3
3

is a point estimate of p.
Example 54 Suppose that X1 , . . . Xn are random sample from N (, 1),
IR. In this case, is the population mean and the population median.
Firstly, we use the sample mean.
1. The function of random variables

=
(X1 , . . . , Xn ) =

n
1
Xi
n i=1

is a point estimator of and it is a sample mean.


2. Suppose that n = 3 and the observed values are 3, 2, 5. In this case,

=
(3, 2, 5) =

3 2 + 5
=0
3

is a point estimate of .
Secondly, we use the sample median.
1. The function of random variables

=
(X1 , . . . , Xn ) = median1in Xi
is a point estimator of and it is a sample median.

152
2. Suppose that n = 3 and the observed values are 3, 2, 5. In this case,

=
(3, 2, 5) = 2
is a point estimate of .
There are many other estimators to infer an unknown parameter. Classical point estimators includes the method of moment estimator , the maximum
likelihood estimator , and the Bayesian estimator..
We introduce the method of moment estimator and the maximum likelihood estimator in this section.
Method of moment estimator
Denition 52 (Method of moment estimator) Let = g(m1 , m2 , . . . , mk ) be
a function of the population moments mr = EX r , r 1. Then, the method of
X2 , . . . , Xk ) where Xr = 1 ni=1 Xir
moment estimator of is M M E := g(X,
n
are the sample moments.
Properties 3
1. n M M E P if g is continuous.

2. n(n M M E ) D N (0, ) if g is dierentiable.


3. It may not be asymptotically ecient.
Example 55 Suppose that X1 , . . . , Xn are random sample from B(1, p), p
[0, 1].
1. Note that p = Ep X1 is the population mean. Therefore sample mean
MME

n
1
n
=
Xi = X
n i=1

is the method of moment estimator of p.


2. Note that p(1 p) = V arp X1 is the population variance. Therefore the
sample variance
MME

p(1d
p)

n (
)
1
n 2
Xi X
n i=1

153
is the method of moment estimator of p(1 p). In this case, by the
property
X2 = X
of the Bernoulli random variable, we see that
MME

p(1d
p)

n (
)
1
n 2
Xi X
n i=1

n
(
)
1
n 2
=
Xi2 X
n i=1
n
(
)
1
n 2
Xi X
n i=1
(
)
n 1 X
n
= X

= pM M E 1 pM M E .
3. Noth that p2 = (Ep X1 )2 is a function of the population mean. Therefore
the function of sample moment
pc2

MME

n
1
Xi
n i=1

)2

= (
pM M E )2 .

is the method of moment estimator of p2 .


Example 56 The Gamma distribution
(, )
have the probability density function
1
()

x1 e

and the population mean and the population variance are given by
=
and
2 = 2 .

154
Therefore we have
= /
and
= 2 / 2 .
Hence

and

2/ 1
2

M M E = X
(Xi X)
n i=1
n
1
MME

2 /X.

=
(Xi X)
n i=1

Maximum likelihood estimator


Let X1 , . . . Xn be a random sample from a population with the probability
density function f (, ), IR.
1. Idea of maximum likelihood is originated by Fisher, R. A. : If
n

f (xi ; 1 ) >

i=1

f (xi ; 2 )

i=1

for a given data x = (x1 , . . . , xn )T , the model f (, 1 ) is more likely to


be a true model than the model f (, 2 ).
2. Likelihood function and log likelihood function
L() or L(; x) :=

f (xi ; ).

i=1

as a function of .
l() := log L() =

log f (xi ; ).

i=1

3. Maximum likelihood estimator is


M LE := arg max L(; x) if it exists.
(a) May not exist

155
(b) May not be unique even when it exists.
4. Likelihood equation: M LE is a solution of
= 0,
l()
:=
where l()

l()

provided that l() is smooth.

(a) Need to check whether it is indeed a global maximizer.


(b) That l() < 0 for all is one sucient condition for the global
maximizer to be unique (strictly concave function).
Example 57 (MLE of Bernoulli()) Let X1 , . . . , Xn be r.s. from B(1, ),
0 < < 1.
n
n
l() =

xi log + (n

i=1

=
l()

i=1

xi

l() =

Therefore

i=1
2

xi ) log(1 )

i=1

n
n ni=1 xi 1

:=
xi
1
n i=1

xi

n ni=1 xi

< 0,
(1 )2

n
1
M LE =
Xi .
n i=1

Example 58 (MLE of Logistic (, 1)) Let X1 , . . . , Xn be r.s. from Logistic(, 1)


with pdf
exp((x ))
f (x : ) =
< x <
(1 + exp((x )))2
where < < .
l() = n
x + n 2

log(1 + exp((xi )))

i=1

=n2
l()
l() =

exp((xi ))
i=1 1 + exp((xi ))

exp((xi ))
< 0,
2
i=1 (1 + exp((xi )))

156
Therefore M LE is the unique solution of
exp((xi ))
n
= .
2
i=1 1 + exp((xi ))

How to solve it?


Theorem 53 (MLE and Reparametrization) For = g() with g being one
to one and onto,
M LE = g(M LE ).
Proof.
L(, x) =
=

i=1
n

f (xi ; )
f (xi ; g 1 ())

i=1

:= L(g 1 (), x), abused notation.


max L(g 1 (), x) = max L(, x)

g()

= L(M LE , x)
= L(g 1 (g(M LE )), x).
Therefore

M LE = arg max L(g 1 (), x) = g(M LE ).

Example 59 Let X1 , . . . , Xn be r.s. from Exp(), > 0.


L() =

1
i=1

With

xi

1
= ,

L() =

i=1

exi .

157

l() = n log

xi

i=1
n

l() = n
xi
i=1
l() = n < 0,
2

Therefore

n
M LE = n
i=1

So
M LE =

1
M LE

xi

i=1

xi

Note that, as a function of , l() is not a strictly concave function.


Unbiasedness and consistency
We introduce the unbiasedness and consistency of an estimator.
Denition 53 Let n = (X1 , . . . , Xn ) be an estimator of the unknown parameter .
1. We say that n is a unbiased estimator of if
E (n ) = for all .
2. We say that n is an asymptotic unbiased estimator of if
lim E (n ) = for all .

3. We say that n is a consistent estimator of if


(

lim P |n | > = 0 for all > 0.

We review the weak law of large numbers that useful in verifying the consistency of an estimator.

158
Theorem 54 Let X1 , . . . , Xn be a random sample from a population with
the mean . Then the sample mean
n

n = 1
X
Xi
n i=1

converges weakly to the population mean . That is, for > 0


(

n | > = 0.
lim P |X

Example 60 Suppose that X1 , . . . , Xn be a random sample from B(1, p),


p [0, 1]. Then, the estimator
n
i=1

p =

Xi

is
1. a method of moment estimator of p,
2. an maximum likelihood estimator of p,
3. an unbiased estimator of p, and
4. a consistent estimator of p.
Example 61 Suppose that X1 , . . . , Xn be a random sample from N (, 1),
IR. Then, the estimator
n

i=1

Xi

is
1. a method of moment estimator of ,
2. an maximum likelihood estimator of ,
3. an unbiased estimator of , and
4. a consistent estimator of .

159
Example 62 Suppose that X1 , . . . , Xn be a random sample from N (, 2 ),
IR and 2 > 0. Then, the estimator
n

i=1

Xi

is
1. a method of moment estimator of ,
2. an maximum likelihood estimator of ,
3. an unbiased estimator of , and
4. a consistent estimator of .
The estimator

Sn2

i=1 (Xi

Xn )2

is
1. a method of moment estimator of 2 ,
2. an maximum likelihood estimator of 2 ,
3. a consistent estimator of 2 , and
4. not an unbiased estimator of . That is, we have
Sn2 =
=

n [
]
1
n 2
Xi + X
n i=1

n
n [
]

1
n 2 ,
X
[Xi ]2
n i=1
i=1

ESn2 =

]
1[ 2
n1 2
n 2 =
.
n
n

160

3.2.2

Interval estimation

Interval estimation of the population mean


Suppose that X1 , . . . , Xn be a random sample from N (, 2 ), IR and
2 > 0. Then from the fundamental results of normal sampling we know
that
2
n N (, )
X
n
That is,
n
X

N (0, 1)
/ n
Suppose that the population variance 2 is a known number. Let z/2 be
/2th upper quantile of the standard normal distribution. Then,
(

P z/2

n
X
< z/2 = 1 .
<
/ n

Therefore we get
(

+ z/2
z/2 < < X
X
n
n

= 1 .

In this case the random interval, is known,


(

z/2 , X
+ z/2
X
n
n

is called the 100(1 )% condence interval for .


= x the interval
Once we given the observed values X
(

x z/2 , x + z/2
n
n

is no more random but a xed interval. For example, when = 0.05, we


have z/2 = 1.96. In this case the interpretation that
(

1.96
1.96
x < < x +
n
n

= 0.95

161
is wrong. This probability is 0 or 1. If
(

1.96
1.96
x < < x +
n
n

=1

then the observed interval surely


(

1.96
1.96
x , x +
n
n

contain the parameter and if


(

1.96
1.96
x < < x +
n
n

then the observed interval

1.96
1.96
x , x +
n
n

=0

does not contain the parameter . The 95% condence interval mans that
about 95% of the observed intervals do contain the unknown parameter .
In general we get the following.
Denition 54 Suppose that X1 , . . . , Xn be a random sample from a population with the probability density function f (x; )), . If the statistics
1 (X) = u (X1 , . . . , Xn ) and 2 (X) = v (X1 , . . . , Xn ) satisfy
(

P 1 (X) < < 2 (X) = 1 ,


then the random interval

1 (X) , 2 (X)

is called the 100(1 )% condence interval for the parameter .


Example 63 Suppose that X1 , . . . , Xn is a random sample from N (, 2 ),
IR and 2 > 0.
1. Suppose that 2 is known. Then
n z/2
X
n
is the 100(1 )% condence interval for .

162
2. Suppose that 2 is unknown. Then
n t/2 (n 1) S
X
n
is the 100(1 )% condence interval for where
S2 =

n (
)
1
n 2
Xi X
n 1 i=1

and t/2 (n 1) be /2th upper quantile of the t distribution with n 1


degree of freedom. Using the relation
n
S2 =
S2
n1 n
where
Sn2 =

n (
)
1
n 2
Xi X
n i=1

we see that

n t/2 (n 1) Sn
X
n1
is also the 100(1 )% condence interval for .
3. These condence intervals are exact in the sense that we derive them
using the exact distributions.
Proof.
1. Suppose that 2 is known. In this case we can use
Z=


X
N (0, 1)
/ n

to derive the desired condence interval.


2. Suppose that 2 is unknown. In this case

X
N (0, 1)
/ n
is no more suitable to derive the condence interval. We have to estimate . Using the fundamental results from normal sampling, we
get

163
(a)

/ n

(b)

(n1)S 2
2

N (0, 1)
2 (n 1)

and S 2 are independent.


(c) X
Write


(n 1)S 2
X
/
T =
/(n 1)
/ n
2
Then we get


X
t(n 1)
S/ n
Now we can use this to derive the desired condence interval. Indeed,
let t/2 (n 1) be /2th upper quantile of the t distribution with n 1
degree of freedom. Then,
T =

n
X
< t/2 (n 1) = 1 .
t/2 (n 1) <
S/ n

Therefore we get
(

t/2 (n 1) S < < X


+ t/2 (n 1) S
X
n
n

In this case the random interval


(

t/2 (n 1) S , X
+ t/2 (n 1) S
X
n
n

= 1 .
)

is called the 100(1 )% condence interval for .


3. These intervals are not an approximate but an exact.

Example 64 Suppose that X1 , . . . , Xn is a random sample from a population


with the mean and the variance 2 with E|X|4 < .
1. Suppose that 2 is known. Then
n z/2
X
n
is the 100(1 )% condence interval for .

164
2. Suppose that 2 is unknown. Then
n z/2 S
X
n
is the 100(1 )% condence interval for where
S2 =

n (
)
1
n 2 .
Xi X
n 1 i=1

3. These condence intervals are approximate in the sense that we derive


them using the approximate distributions.
Proof. We know that
2

E(X) = , and V (X) = .


n
1. Suppose that 2 is known. In this case we cannot use the standardized
quantity

X

/ n
to derive the condence interval because we dont have any information
about the distribution. If n is large, we may apply the cental limit
theorem to get the approximate distribution. Indeed, by the central
limit theorem, we have
)
(
)
(
n D N 0, 2 as n
n X
or


X
N (0, 1) for n large
/ n
Now use the relation to get the approximate condence interval.
2. Suppose that 2 is unknown. Apply the central limit theorem to get
n
X
D N (0, 1) as n
/ n
and apply the law of large number to get
Sn2 P 2 or Sn2 P 2

165
Now, apply the Slutskys theorem to get

X
D N (0, 1) as n
S/ n
or


X
N (0, 1) for n large
S/ n

Now use the relation to get the approximate condence interval.


3. These intervals are approximate and are valid for large n.

Example 65 Suppose that X1 , . . . , Xn is a random sample from B(1, p),


0 < p < 1. Let
n
Xi
p = i=1
n
be the sample ratio. Then, for n large, the 100(1 )% condence intervals
for the population ratio p are given by
1.
p +

2
z/2

2n

z/2
1+

2.

2
z/2

4n2

2
z/2

p z/2
3.

p(1
p)
n
n

p (1 p)
n

z/2
z/2

arcsin p < arcsin p < arcsin p +


4n
4n

4. These are approximate intervals that are valid for n large.

Performance : [3] > [1] > [2]


Complexity : [3] < [1] < [2].

166
Proof.
1. Apply the central limit theorem to get

n(
p p) D N (0, p(1 p))
Therefore we get

p p

or

p(1p)
n

p p

p(1p)
n

D N (0, 1)

N (0, 1) for n large

Now,
p p

< z/2
p(1p)
n
2 )
z/2
2

1+

2
z/2
p 2 p +
p + p2 < 0
2n

[1]

2. Apply the weak law of large numbers and the central limit theorem to
get
p P p
and

p p

p(1p)
n

D N (0, 1)

Apply the Slutskys theorem to get


p p

or

p(1
p)
n

p p

p(1
p)
n

D N (0, 1)

N (0, 1) for n large

Now use this to derive the desired condence interval.

167
3. Apply the central limit theorem to get

n(
p p) D N (0, p(1 p))

Note that h(x) = arcsin( x) is dierentiable. Apply the delta method


to get
)

4n arcsin p arcsin p D N (0, 1)


or

4n arcsin p arcsin p N (0, 1) for n large

Now use this to derive the desired condence interval.

The following example shows how to nd the condence interval for the
dierence of means under the assumption that the population variances are
known.
Example 66 Suppose that X1 , . . . , Xm is a random sample from a population with the unknown mean 1 and the known variance 12 , (1 , 12 )
IR (0, ) and Y1 , . . . , Yn is a random sample from a population with the
unknown mean 2 and the known variance 22 , (1 , 22 ) IR (0, ). Suppose that (X1 , . . . , Xm ) and (Y1 , . . . , Yn ) are independent.
1. Suppose that X1 , . . . , Xm and Y1 , . . . , Yn are normal samples. In this
case,

2
2
Y z/2 1 + 2
X
m
n
is the 100(1)% condence interval for the dierence of mean 1 2 .
This interval is exact.
2. Suppose that X1 , . . . , Xm and Y1 , . . . , Yn are general samples. In this
case,

2
2
Y z/2 1 + 2
X
m
n
is the 100(1)% condence interval for the dierence of mean 1 2 .
This interval is approximate that is valid for m and n large.

168
Proof.
1. Note that

N 1 , 1
X
m
(

2
Y N 2 , 2
n

)
)

and Y are independent. Therefore we get


and X
(

2
2
Y N 1 2 , 1 + 2
X
n
m

or

Y (1 2 )
X

12
n

22
m

N (0, 1).

Use this to get the exact condence interval for 1 2 .


2. Note that, abusing notations,
(

12

X N 1 ,
m

Y D N 2 , 2
n

and Y are independent. Therefore we get


and X
(

2
2
Y N 1 2 , 1 + 2
X
n
m

or

Y (1 2 )
X

12
n

22
m

N (0, 1) for m and n large.

Use this to get the approximate condence interval for 1 2 .

The following example shows how to nd the condence interval for the
population variance.

169
Example 67 Suppose that X1 , . . . , Xn is a random sample from N (, 2 ),
(, 2 ) IR (0, ).
1. Suppose that is known. In this case, the 100(1 )% condence
interval for 2 is obtained from
n

21/2

i=1

(n) <

(Xi )2
< 2/2 (n) :
2

since we know that


n
i=1

(Xi )2
2 (n) .
2

2. Suppose that is unknown. In this case, the 100(1 )% condence


interval for 2 is obtained from
n

i=1

21/2 (n 1) <

Xi X
2

)2

< 2/2 (n 1)

since we know that


n
i=1

Xi X
2

)2

2 (n 1)

The following example shows how to nd the condence interval for the
ratio of the two population variances.
Example 68 Suppose that X1 , . . . , Xm is a random sample from N (1 , 12 ),
(1 , 12 ) IR (0, ) and Y1 , . . . , Yn is a random sample from N (2 , 22 ),
(1 , 22 ) IR (0, ). Suppose that (X1 , . . . , Xm ) and (Y1 , . . . , Yn ) are independent.
1. Suppose that 1 and 2 are known. In this case, the 100(1 )%
2
condence interval for the ratio of variance 22 is obtained from
1

(Xi 1 )2 /12
< F/2 (m, n)
2
2
i=1 (Yi 2 ) /2

F1/2 (m, n) < i=1


m
and

F1/2 (m, n) = F/2 (m, n)1

170
since we know that
m

2
2
i=1 (Xi 1 ) /1
n
2
2
i=1 (Yi 2 ) /2

F/2 (m, n)

2. Suppose that 1 and 2 are unknown. In this case, the 100(1 )%


2
condence interval for the ratio of variance 22 is obtained from
1

m (
i=1

Xi X

i=1

Yi Y

F1/2 (m 1, n 1) < (
n
and

)2

)2

/12
/22

< F/2 (m 1, n 1)

F1/2 (m 1, n 1) = F/2 (m 1, n 1)1

since we know that


n

i=1

Xi X

i=1

Yi Y

m (

)2

)2

/12
/22

F/2 (m 1, n 1)

171

3.2.3

Exercises

1. Suppose that X1 , . . . , Xn is a random sample from (, ), > 0,


> 0. Find the maximum likelihood estimator of and .
2. Suppose that X1 , . . . , Xn is a random sample from P (), > 0. Prove
n and Sn2 are the method of moment estimator of .
that X
3. Suppose that X1 , . . . , Xn is a random sample from P (), > 0. Find
the maximum likelihood estimator of .
4. Suppose that (X1 , Y1 )T , . . . , (Xn , Yn )T is a random sample from a population with the mean vector (1 , 2 )T and the variance-covariance matrix is
(
)
12
1 2
.
1 2 22
Find the method of moment estimator of the population correlation
coecient .
5. Suppose that X f (x; ), = {0, 1, 2, 3} and the probability
mass function
f (x; ) = P (X = x)
is given by the following.

f (1; ) = P (X
f (2; ) = P (X
f (3; ) = P (X
f (4; ) = P (X
Total

= 1) |
= 2) |
= 3) |
= 4) |

=0

=1

=2

1
4
1
4
1
4
1
4

1
2

0
0

1
3
1
3
1
3

1
2

0
1

=3
0
1
2

0
1
2

Find the maximum likelihood estimator of .


6. Suppose that X1 , . . . , Xn is a random sample from a population with
the probability density function f (x; ), . In this case, the estimator
bLSE = arg min

i=1

(Xi )2

172
Now, suppose that X1 , . . . , Xn
is called the least square estimator of .
is a random sample from N (, 1), IR. Prove that the least square
estimator of is the same as maximum likelihood estimator of .
7. Suppose that X1 , . . . , Xn is a random sample from N (, 2 ), (, 2 )
IR (0, ). Find the maximum likelihood estimator of the population
variance 2 .

173

3.3
3.3.1

Testing in statistical inference


Testing statistical hypothesis

Testing statistical hypotheses is to determine if a claim or conjecture about


some feature of the population, a parameter, is strongly supported by the
information obtained from the sample data. Here we illustrate the testing
of hypotheses concerning a population mean . The available data will be
assumed to be a random sample of size n from the population of interest.
Further, the sample size n will be large (n 30 for a rule of thumb).
The formulation of a hypotheses testing problem and then the steps for
solving it require a number of denitions and concepts. We will introduce
these key statistical concepts. The null hypothesis and the alternative hypothesis, two types of error, level of signicance, rejection region and the
p-value in the context of a specic problem to help integrate them with intuitive reasoning.
Problem 1 Can a new trac system reduce the mean trac time in attending school from ones home?
The city wants to reduce the trac time it takes a student to attend school
from his home. From extensive records, it is found that the trac times have
a distribution with mean = 60 and standard deviation = 5 minutes. The
city suggests that a new trac system will reduce the mean trac time for
a student to attend school from his home. For experimental verication, a
random sample of 49 trac times will be taken with the new system upgrade
and the sample mean
= X1 + + X49
X
49
will be calculated. How should the result be used toward a statistical validation of the claim that the true (population) mean trac time is less that
60 minutes?

3.3.2

Formulating the hypotheses

Denition 55 In the language of statistics, the claim or the research hypothesis that we wish to establish is called the alternative hypothesis H1 .
The opposite statement, one that nullies the research hypothesis, is called

174
the null hypothesis H0 . The word null in this context means that the
assertion we are seeking to establish is actually void.
When our goal is to establish an assertion with substantive support obtained from the sample, the negation of the assertion is taken to be the null
hypothesis H0 and the assertion itself is taken to be the alternative hypothesis
H1 .
Example 69 In our trac problem
1. The alternative hypothesis is
H1 : The mean trac time is reduced
and
2. the null hypothesis is
H0 : The mean trac time is the same as before.
Our initial question, Is there strong evidence in support of the claim?
now translate to, Is there strong evidence for rejecting H0 ? The rst version
typically appears in the statement of a practical problem, whereas the second
the correspondence between the two formulation of a question.
Before claiming that a statement is established statistically, adequate
evidence from data must be produced to support it. A close analogy can
be made to a court trial where the jury clings to the null hypothesis of not
guilty unless there is convincing evidence of guilt. The intent of the hearings
is to establish the assertion that the accused is guilty, rather then to prove
that he or she is innocent.
Court Trial
Statistical Testing
Requires strong
evidence
Guilt
Conjecture(research
to establish :
hypothesis)
Null hypothesis (H0 ) Not guilty.
Conjecture is false.
Alternative
hypothesis(H1 )
Guilty.
Conjecture is true.
Attitude :
Uphold not guilty Retain the null
unless there is
hypothesis unless it
a strong evidence
makes the sample
of guilt.
data very unlikely
to happen.

175
False rejection of H0 is more serious error than failing to reject H0 when H1
is true.
Once H0 and H1 are formulated, our goal is to analyze the sample data
in order to choose between them.
A decision rule, or a test of the null hypothesis, species a course
of action by stating what sample information is to be used and how it is to
be used in making a decision. Bear in mind that we are to make one of the
following two decisions:
Rule 1 Either reject H0 and conclude that H1 is substantiated or retain H0
and conclude that H1 is fail to be substantiated.
Rejection of H0 amounts to saying that H1 is substantiated, whereas non
rejection or retention of H0 means that H1 fails to be substantiated. A
key point is that a decision to reject H0 must be based on strong evidence.
Otherwise, the claim H1 could not be established beyond a reasonable doubt.
In our problem of evaluating the new trac system, let be the population mean trac time. Because is claimed to be lower than 60 minutes,
we formulate the alternative hypothesis as H1 : < 60. According to the
description of the problem, the researcher does not care to distinguish between the situations that = 60 and > 60 for the claim is false in either
statement of no dierence. Accordingly, we formulate the
Example 70
Test H0 : = 60 versus H1 : < 60.

3.3.3

Test criterion and rejection region

calculated from the measurements of n = 49


Naturally, the sample mean X,
randomly selected trac times, ought to be the basis for rejecting H0 or
should we reject
not. The question now is: For what sort of values of X
H0 ? Because the claim states that is low (a left sided alternative), only
can contradict H0 in favor of H1 . Therefore, a reasonable
low values of X
decision rule should be of the form
c
Reject H0 if X
>c
Retain H0 if X

176
c, where R stands
This decision rule is conveniently expressed as R : X
c]
for the rejection of H0 . Also, in this context, the set of outcomes [X
is called the rejection region or critical region, and the cuto point c is
called the critical value.
The cuto point c must be specied in order to fully describe a decision
rule. To this end, we consider the case when H0 holds, that is, = 60.
Rejection of H0 would then be a wrong decision, amounting to a false acceptance of the claim- a serious error. For an adequate protection against this
c] is very small when = 60. For
kind of error, we must ensure that P [X
example, suppose that we wish to hold a low probability of = 0.05 for a
wrong rejection of H0 . Then our task is to nd the c that makes
c] = 0.05 when = 60
P [X
is approximately normal with
We know that, for large n, the distribution
of X

mean and standard deviation / n, whatever the form of the underlying


population. Here n = 49 is large, and we initially assume that is known.
Specically, we assume that = 5 seconds, the same standard deviation as
is
with the
original trac time. Then, when = 60, the distribution of X
N (60, 5/ 49) so
60
X

Z=
5/ 49
has the N (0, 1) distribution.

Because P [Z 1.645] = 0.05, the cuto c on the X-scale


must be 1.645
standard deviations below 0 = 60 or
(

5
c = 60 1.645
49

= 58.8

Our decision rule is now completely specied by the rejection region)


58.8
R:X
that has = 0.05 as the probability of wrongly rejecting H0 .
F igure Rejection region with the cuto c = 58.8
A normal curve
58.8
R:X

60.0

177
R : Z 1.645

we can cast the


Instead of locating the rejection region on the scale of X,
decision criterion on the standardized scale as well:
0
60
X
X

=
Z=
/ n
5/ 49
and set the rejection region as R : Z 1.645(See the Figure). This form
is more convenient because the cuto 1.645 is directly read o the normal
table, whereas the determination of c involves additional numerical work.
whose value serves to determine the
Denition 56 The random variable X
action is called the test statistic. A test of the null hypothesis is a
for which
course of action specifying the set of values of a test statistic X,
H0 is to be rejected. The set is called the rejection region of the test.
A test is completely specied by a test statistic and the rejection region.

3.3.4

Two types of error and their probabilities

Up to this point we only considered the probability of rejecting H0 when, in


fact, H0 is true and illustrated how a decision rule is determined by setting
this probability equal to 0.05. The following table shows all the consequences
that might arise from the use of a decision rule.
Unknown True Situation
Decision Based
On Sample
H0 True
= 60

H1 True
< 60

Reject H0

Correct decision

Retain H0

Wrong rejection of H0
(Type I error)
Correct decision

Wrong retention of H0
(Type II error)

In particular, when our sample based decision is to reject H0 , we either


have a correct decision (if H1 is true) or we commit a Type I error (if H0 is
true). On the other hand, a decision to retain H0 either constitutes a correct
decision (if H0 is true) or leads to a Type II error. To summarize:

178
Denition 57 Type I error is the error of rejection of H0 when H0 is true.
Type II error is the error of non rejection of H0 when H1 is true.
= Probability of making a Type I error
(also called the level of signicance)
= Probability of making a Type II error
In our problem of evaluating the new trac system, the rejection region
c; thus,
is of the form R : X
c] when = 60 (H0 true)
= P [X
> c] when < 60 (H1 true)
= P [X
Of course, the probability depends on the numerical value of that
prevails under H1 . Figure shows the Type I error probability as the shaded
area under the normal curve that has = 60 and the Type II error probability
as the shaded area under the numerical that has = 58.8, a case of H1
being true
F igure The error probabilities and
H0 true
A normal curve
c

Reject H0

= 60.0
c

x
Reject H0

H1 true
A normal curve
= 58.8

From Figure, it is transparent that no choice of the cuto c can minimize


both the error probabilities and . If c is moved to the left, gets smaller
but gets larger, and if c is moved to the right, just the opposite eects
take place. In view of this dilemma and the fact that a wrong rejection of

179
H0 is the more serious error, we hold at a predetermined low level such as
0.10, 0.05, or 0.01 when choosing a rejection region. We will not pursue the
evaluation of , but we do note that if the turns out to be uncomfortably
large, the sample size must be increased.

3.3.5

Performing a test

When determining the rejection region for this example, we assumed that
= 5 seconds, the same standard deviation as with the original
money
is N (60, 5/ 49) and
machines. Then, when = 60, the distribution of X
58.86 was arrived at by xing = 0.05 and
the rejection region R : X
referring
60
X

Z=
5/ 49
to the standard normal distribution.
In practice, we are usually not sure about the assumption that = 5, the
standard deviation of the trac times using the new system, is the same as
with the original system. But that does not cause any problem as long as the
sample size is large. When n is large (n 30), the normal approximation
remains valid even if is estimated by the sample standard deviation
for X
S. Therefore, for testing H0 : = 0 versus H1 : < 0 with level of
signicance , we employ the test statistic
0
X

Z=
S/ n
and set the rejection region R : Z z . This test is commonly called a
large sample normal test or a Z-test.
Example 71 Referring to the new trac system trac times, suppose that,
from the measurements of a random sample of 49 trac times, the sample
mean and standard deviation are found to be 58.5 and 4.2 minutes, respectively. Test the null hypothesis H0 : = 60 versus H1 : < 60 using a
2.5% level of signicance and state whether or not the claim ( < 60) is
substantiated.
Solution. Because n = 49 and the null hypothesis species that has
the value 0 = 60, we employ the test statistic
60
X

Z=
S/ 49

180
The rejection region should consist of small values of Z because H1 is left
sided. For a 2.5% level of signicance, we take = 0.025, and since z0.025 =
1.96, the rejection region is(see Figure) R : Z 1.96.
F igure The rejection region for Z
A normal curve
0.025

1.96

0.0

With the observed values x = 58.5 and s = 4.2, we calculated the test
statistic
58.5 60

z=
= 2.5
4.2/ 49
Because this observed z is in R, the null hypothesis is rejected at the level
of signicance = 0.025. We conclude that the claim of a reduction in the
mean trac time is strongly supported by the data.

3.3.6

The p-value

Our test in Example was based on the xed level of signicance = 0.025,
and we rejected H0 because the observed z = 2.5 fell in the rejection region
R : Z 1.96. A strong evidence against H0 emerged due to the fact that
a small was used. The natural question at this point is: How small an
could we use and still arrive at the conclusion of rejecting H0 ? To answer
this question, we consider the observed z = 2.5 itself as the cuto point
(critical value) and calculate the rejection probability
P [Z 2.5] = 0.0062
The smallest possible that would permit rejection of H0 , on the basis
of the observed z = 2.5, is therefore 0.0062. It is called the signicance
probability or p-value of the observed z. This very small p-value, 0.0062,
signied a strong rejection of H0 or that the result is highly statistically
signicant.
Denition 58 The p-value is the probability, calculated under H0 , that the
test statistic takes a value equal to or more extreme than the value actually

181
observed. The p-value serves as a measure of strength of evidence against
H0 . A small p-value means that the null hypothesis is strongly rejected or the
result is highly statistically signicant.
Our illustrations of the basic concepts of hypotheses tests thus far focused
on a problem where the alternative hypothesis is of the form H1 : < 0 ,
called a left-sided alternative. If the alternative hypothesis in a problem
states that the true is lager than its null hypothesis value of 0 , we formulate
the right-sided alternative H1 : > 0 and use a right-sided rejection region
R : Z z .
We illustrate the right-sided case in an example after summarizing the
main steps that are involved in the conduct of a statistical test.
1. Formulate the null hypothesis H0 and the alternative hypothesis H1 .
2. Test criterion: State the test statistic and the form of the rejection
region.
3. With a specied , determine the rejection region.
4. Calculate the test statistic from the data.
5. Draw a conclusion: State whether or not H0 is rejected at the specied
and interpret the conclusion in the context of the problem. Also, it is
a good statistical practice to calculate the p-value and strengthen the
conclusion.

182

Chapter 4
The uniform distribution
1. Let X1 , . . . , Xn be a random sample from U (0, ), > 0. Let X(n) =
max1in Xi . Then X(n) D . That is
{

Fn (x) F (x) =

0, x <
1, x

where Fn id the distribution function of X(n) .


Solution. Notice that

Fn (x) =

0,

( )n
x

1,

x<0
, 0x<
x .

Now take limit to get


{

Fn (x) F (x) :=

0, x <
1, x

which is the distribution of the constant .

2. Let X1 , . . . , Xn be a random sample from U (0, ), > 0. Let X(n) =


max1in Xi . Then n( X(n) ) D Exponential( 1 ). That is
{

Gn (x) F (z) :=
183

0,
z<0
1 ez/ , z 0

184
where Gn id the distribution function of n( X(n) ).
Solution. The distribution function of Zn = n( X(n) ) is
(
)
z
Gn (z) = P X(n) +
n
(
)
z
= 1 P X(n) <
n

0,
z<0

( z )n
n
=
, 0 z < n
1

1,
z n

0,
1 1
1,
(

Now take limit to get

Gn (z) G(z) :=

z
n

)n

z<0
, 0 z < n
z n

0,
z<0
1 ez/ , z 0,

where G is the distribution of the constant Exponential( 1 ).

3. Let X1 , . . . , Xn be arandom sample from U (0, ), > 0. Let X(n) =

max1in Xi . Then X(n) P .


Solution. In order to show that X(n) is a consistent estimator of , we
notice that, as n ,
n
(a) EX(n) = n+1
,
n
n
(b) V ar(X(n) ) = n+2
2 ( n+1
)2 0.

Notice that
E(X(n) )2 = V ar(X(n) ) + (EX(n) )2 .
Markov inequality gives
X(n) P .
Continuous mapping theorem for convergence in probability gives

X(n) P .

Bibliography
[1] Brockwell, P. J., Davis, R. A. (1991) Time Series: Theory and Methods. Second Edition, Springer-Verlag, New York.
[2] Etter, D.R., Reis, T.F., Visser, L. G. (2003) Michigan status report,
17th Eastern Black Bear Workshop, Mount Olive, New Jersey, March
2-3, 2003
[3] Hogg, R., Craig, A. T. (1978) Introduction to mathematical Statistics.
Fourth Edition, Collier Macmillan Publishers, London.
[4] Johnson, R. A., Bhattacharyya, G. K. (1987) Statistics: Principle and
Methods. Third Edition. Wieley Series in Probability and Statistics.
John Weley and Sons, Inc., New York.
[5] Mayer, A. D., Sykes A. M. (1996) Statistics, Modular Mathematics
Series, Adward Arnold.
[6] McColl, J. H. (1995) Probability, Modular Mathematics Series, Adward Arnold.
[7] Ross, S. (1997) Introduction to probability model. Sixth Edition, Academic Press, New York.
[8] Ross, S. (1998) A rst cource in probability. Fifth Edition, Prentice
Hall, Upper Saddle River, New Jersey 07458.
[9] R. J. Adler, An introduction to continuity, extrema, and related topics
for general Gaussian processes. IMS Lecture Notes-Monograph Series
12. Institute of Mathmatical Statistics, Hayward, CA, 1990.
[10] J. Bae, An Empirical CLT for Stationary Martingale Dierences. J.
Korean Math. Soc. 32 (1995), 427-446.
185

186
[11] Bae, J. (1996). An Empirical LIL for stationary martingale dierences
: An invariance principle approach. J. Korean Math. Soc. 33 4 909928.
[12] Bae, J. and Kim, S. (2002). The Uniform CLT for the Kaplan-Meier
Integral Process under Bracketing Entropy. Submitted.
[13] J. Bae and S. Levental, Uniform CLT for Markov Chains and Its
Invariance Principle: A Martingale Approach. J. Theoret. Probab. 8
(1995), 549-570.
[14] R. F. Bass, Law of the iterated logarithm for set-indexed partial sum
processes with nite variance. Z. Wahrsch. verw. Gebiete 70 (1985),
591-608.
[15] Cantelli, F. P.(1933) Sulla determinazione empirica delle leggi di probabilita. Giornale dellIstituto Italiano degli Attuari 4, 421-424.
[16] DeHardt, J. Generalizations of the Glivenko-Cantelli theorem. Ann.
Math. Statist. 42 (1971), 2050-2055.
[17] M. D. Donsker, An Invariance principle for certain probability limit
theorems. Memoirs of the American Mathematical Society. 6 (1951),
1-12.
[18] P. Doukhan, P. Massart, and E. Rio, Invariance principles for absolute
regular empirical process. Ann. Inst. H. Poincare Probab. Statist. 31
(1995), 393-427.
[19] R. M. Dudley, Donsker classes of functions. In Statistics and Related Topics(Proc. Symp. Ottawa, 1980). North-Holland, Amsterdam,
(1981), 341-352.
[20] R. M. Dudley, A Course on Empirical Processes. Lecture notes in
Math. 1097 Springer-Verlag, New York, 1984.
[21] Dudley, R. M. and Philipp, W. (1983). Invariance principles for sums
of Banach space valued random elements and empirical processes. Z.
Wahrsch. verw. Gebiete 62 509-552.
[22] R. Durrett, Probability: Theory and Examples. Wadsworth, Belmont,
CA, 1991.

187
[23] Glivenko, V. (1933) Sulla determinazione empirica della leggi di probabilita. Giornale dellIstituto Italiano degli Attuari 4, 92-99.
[24] D. Freedman, On Tail Probabilities for Martingales. Ann. Probab. 3
(1975), 100-118.
[25] M.I. Gordin and B. A. Lifsic, B. A, The central limit theorem for
stationary Markov Processes. Sov. Math. Dokl. 19 (1978), 392-394.
[26] J. Homann-Jrgensen, Stochastic processes on Polish spaces. Aarhus
Universitet. Matematisk Institut. 1991.
[27] Kaplan, E. L. and Meier, P. Nonparametric estimation from incomplete observations. J.Amer.Statist.Assoc. 53 (1958), 457-481.
[28] Kuelbs, J and Dudley, R. M. (1980). Log Log law for Empirical Measures. Ann. Probab. 8 405-418.
[29] S. Levental, A Uniform CLT for Uniformly Bounded Families of Martingale Dierences. J. Theoret. Probab. 2 (1989), 271-287.
[30] M. Ossiander, A Central Limit Theorem under Metric Entropy with
L2 Bracketing. Ann. Probab. 15 (1987), 897-919.
[31] Pisier, G. (1975/1976). Le theoreme de la limite centrale et la loi
du logarithme itere dans les espaces de Banach. Seminaire MaureySchwarz 1975-1976 exposes Nos. 3 et 4.
[32] D. Pollard, Convergence of Stochastic Processes. Springer series in
Statistics. Springer-Verlag, New York, 1984.
[33] D. Pollard, Empirical Processes: Theory and Applications. Regional
conference series in Probability and Statistics 2, Inst. Math. Statist.,
Hayward, CA, 1990
[34] Strassen, V. (1964). An invariance principle for the law of the iterated
logarithm. Z. Wahrsch. verw. Gebiete 3 211-226.
[35] Stute, W and Wang, J. L. The strong law under random censorship.
Ann. Statist. 21 (1993), 1591-1607.

188
[36] Stute, W. (1995). The central limit theorem under random censorship.
Ann. Statist. 23 422-439.
[37] Van de Geer, S. Empirical Processes in M-Estimation. Cambridge in
Statististical and Probabilistic Mathematics. Cambridge University
Press, Cambridge, United Kingdom. (2000).
[38] Van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence
and Empirical Processes with Applications to Statistics. Springer series
in Statistics. Springer-Verlag, New York.
[39] K. Ziegler, Functional Central Limit Theorem for Triangular Arrays
of Function-Indexed Processes under Uniformly Integrable Entropy
Conditions. J. Multivariate. Anal. 62 (1997), 233-272.

Index
F distribution, 145
p-value, 180
t distribution, 144

Classical denition of probability, 33


classical model, 18
complement, 28
complete data, 18
absolutely continuous random variable, compound event, 24
83
conditional distribution function, 107
alternative hypothesis, 173
conditional expectation, 108
approximate condence interval, 164 conditional mean, 109
asymptotic unbiased estimator, 157 conditional probability, 43
Axioms of probability, 35
conditional probability density function, 107
Bayes theorem, 51
conditional variance, 109
Bayesian probability, 35
condence interval, 160, 161
Bernoulli random variable, 60
consistent estimator, 157
Bernoulli trial, 59
converges in distribution, 132
beta distribution, 102
converges in probability, 132
beta function, 102
converges with probability one, 131
Binomial random variable, 61
correlation analysis, 116
binomial theorem, 31
countable set, 29
bivariate normal distribution, 125
critical region, 176
Bonferronis inequality, 53
critical value, 176
Booles inequality, 42
Brownian motion as a natural phede Morgans law, 41
nomena, 7
Brownian motion as stochastic pro- delta method, 136
densely ordered set, 29
cess, 8
descriptive statistics, 137
deterministic experiment, 22
Central limit theorem, 20
dierence, 28
central limit theorem, 18, 134
discrete random variable, 56
central moment, 77
chi-squared distribution, 89
distribution function, 18, 81
189

190
element, 26
elementary event, 24
empirical distribution function, 18
event, 24
expectation, 69
exponential distribution, 88
family of events, 48
nite set, 28
Frequency probability, 33
gamma distribution, 88
gamma function, 87
gamma integration, 89
geometric Brownian motion, 11
geometric random variable, 61
hypergeometric random variable, 65

level of signicance, 178


likelihood function, 154
marginal probability density function,
105
medical dada, 17
method of moment estimator, 152
moment, 76
moment generating function, 9, 77,
89
multiplication theorem, 43
normal distribution, 9
null event, 24
null hypothesis, 174
order statistic, 140
order statistics, 19
outcom, 22

independent, 45
independent increment, 10
independent trial, 10
indicator function, 18
inferential statistics, 137
innite integral, 84
innite set, 28
intersection, 27
interval estimation, 160

pairwise independent, 46
parameter, 139
parameter space, 139
partition, 48
point estimate, 150
point estimation, 150
point estimator, 150
Poisson random variable, 62
population success ratio, 151
joint distribution function, 103
probability, 21, 36
joint moment generating function, 118 probability density function, 9, 84
joint probability density function, 103 probability mass function, 57
probability theory, 21
Kaplan-Meier estimator, 19
proper subset, 27
Law of large numbers, 19
law of large numbers, 18, 132
Law of the iterated logarithm, 20
law of the iterated logarithm, 18
least square estimator, 172

quantile function, 100


random experiment, 22
random interval, 161
random sample, 18, 139

191
random variable, 9, 55
realization, 55
regression analysis, 116
regression curve, 116
regression line, 116
rejection region, 176
Riemann integrable, 84
Riemann integral, 84
Riemann sum, 84
sample median, 151
sample space, 23
sample success ratio, 151
set, 26
set build form, 27
set theory, 26
sets are equal, 27
Slutsky theorem, 134
standard Brownian motion, 11
standard deviation, 73
standard normal distribution, 9
state arithmetic, 21
stationary increment, 10
statistic, 21
strong law of large numbers, 132
subjective probability, 35
subset, 27
symmetric dierence, 53
tabular form, 27
test statistic, 177
theorem of total probability, 49
total event, 24
unbiased estimator, 157
uncountable set, 29
union, 27
variance, 73

weak law of large numbers, 132, 157


standard bivariate normal distribution,
104

Vous aimerez peut-être aussi