Académique Documents
Professionnel Documents
Culture Documents
Statistics
Jongsig Bae
Preface
Statistics is a mathematical science pertaining to the collection, analysis,
interpretation or explanation, and presentation of data. It also provides
tools for prediction and forecasting based on data. It is applicable to a wide
variety of academic disciplines, from the natural and social sciences to the
humanities, government and business.
Statistical methods can be used to summarize or describe a collection of
data; this is called descriptive statistics. In addition, patterns in the data
may be modeled in a way that accounts for randomness and uncertainty in
the observations, and are then used to draw inferences about the process
or population being studied; this is called inferential statistics. Descriptive,
predictive, and inferential statistics comprise applied statistics.
There is also a discipline called mathematical statistics, which is concerned with the theoretical basis of the subject. Moreover, there is a branch
of statistics called exact statistics that is based on exact probability statements.
The word statistics can either be singular or plural. In its singular form,
statistics refers to the mathematical science discussed in this article. In its
plural form, statistics is the plural of the word statistic, which refers to a
quantity (such as a mean) calculated from a set of data.
February, 2016
Contents
1 Panorama in Statistics:Examples
1.1 Brownian motion . . . . . . . . . . . . . . . . . . . .
1.1.1 Brownian motion as a natural phenomena . .
1.1.2 Brownian motion as a stochastic process . . .
1.1.3 Exercises . . . . . . . . . . . . . . . . . . . . .
1.2 Stock pricing problem . . . . . . . . . . . . . . . . .
1.2.1 A geometric Brownian motion . . . . . . . . .
1.2.2 Black-Scholes option pricing formula . . . . .
1.3 Medical data and limiting theorems . . . . . . . . . .
1.3.1 Complete data and the classical model . . . .
1.3.2 Incomplete data and the Kaplan-Meier model
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
7
8
12
13
13
13
17
18
19
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
21
21
23
24
26
30
32
33
33
33
35
35
41
43
43
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2.4
2.5
2.6
2.7
2.8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
45
48
53
54
56
64
67
72
73
80
81
81
83
86
100
103
103
106
114
118
125
131
131
134
136
.
.
.
.
.
.
.
.
.
.
.
137
. 137
. 137
. 137
. 140
. 143
. 147
. 150
. 150
. 160
. 171
. 173
5
3.3.1
3.3.2
3.3.3
3.3.4
3.3.5
3.3.6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
173
173
175
177
179
180
183
Chapter 1
Panorama in
Statistics:Examples
1.1
Brownian motion
1.1.1
Internet 1 Brownian motion, named after the Scottish botanist Robert Brown,
is the random movement of microscopic particles suspended in a liquid or gas,
caused by collisions with molecules of the surrounding medium. Also called
Brownian movement.
7
8
More specically, we name it Brownian motion as a natural phenomena. We
now introduce a Brownian motion as a stochastic process.
Before introducing a Brownian motion as a stochastic process, we rst
introduce the concept of normal distribution.
1.1.2
ez dz =
2
z2
1
e 2 dz = 1
2
Now, if we let
z2
1
(z) = e 2 , z IR
2
then
1. 0
2.
(z)dz = 1
f (x)dx = 1
9
3. P (a X b) =
b
a
f (x)dx
then the variable X is called a random variable having the probability density
function function f .
We now have the following
Denition 1 If the random variable Z has the probability density function
z2
1
(z) = e 2 , z IR
2
e 22 dx = 1
2
Denition 2 If the random variable X has the probability density function
(x)2
1
f (x) =
e 22 , x IR
2
Let X N (, 2 ). Then
1
2
at x = .
2 2
t
2
, < t <
10
A collection of random variables is called a stochastic process. We rstly
look at some examples of discrete time stochastic process {X(t) : t T }.
Example 2 (Discrete time stochastic process)
1. If T = IN, then {Xn : n IN} = {X1 , X2 , . . . .} is a discrete time
stochastic process.
2. If T = ZZ, then {Xn : n ZZ} = {. . . . , X1 , X0 , X1 , . . . .} is a discrete
time stochastic process.
If the random variables in a discrete time stochastic process are independent then we call the stochastic process an independent trial.
We secondly look at examples of continuous time stochastic process {X(t) :
t T }.
Example 3 (Continuous time stochastic process)
1. If T = [0, 1] then {X(t) : 0 t 1} is a continuous time stochastic
process.
2. If T = [0, ) then {X(t) : 0 t < } is a continuous time stochastic
process
3. If T = IR then {X(t) : t IR} is a continuous time stochastic process
Consider a continuous time stochastic process {X(t) : 0 t < }. If the
random variables
X(t1 ) X(t0 ), X(t2 ) X(t1 ), , X(tk ) X(tk1 )
are independent for t0 < t1 < < tk , then {X(t) : 0 t < } is said to
have an independent increment. If the random variables
X(t + s) X(t), X(s) X(0)
have the same distribution for all s and t, then {X(t) : 0 t < } is said
to have stationary increment.
We now dene a Brownian motion as a probabilistic model.
Denition 3 A continuous time stochastic process {X(t) : t 0} is called
a Brownian motion if the following properties is satised
11
1. X(0) = 0,
2. {X(t) : t 0} have stationary and independent increment, and
3. for xed s > 0, the random variable X(s) is a normal distribution with
the mean s and the variance 2 s.
We denote this by {X(t) : t 0} BM (, 2 ).
Remark 1 Brownian motion is often called a Winer process.
Theorem 3 (Existence of a Brownian motion) There exists a Brownian motion as a continuous time stochastic process.
Project 1 Think of modeling a heart beat.
Denition 4 Let {X(t) : t 0} BM (, 2 ) be a Brownian motion. If
= 0 and 2 = 1, then it is called a standard Brownian motion and denoted
by {B(t) : t 0}. In this case, we have
X(t) = B(t) + t.
We now dene a geometric Brownian motion.
Denition 5 Let {Y (t) : t 0} BM (, 2 ) be a Brownian motion. Then
the stochastic process {X(t) : t 0} dened by
X(t) = eY (t)
is called a geometric Brownian motion.
12
1.1.3
Exercises
2. Prove that
ez dz =
2
z2
1
e 2 dz = 1.
2
4. Prove that
(x)2
1
e 22 dx = 1.
2
(x)2
1
e 22 dx = .
x
5. Prove that
(x)2
1
x2
e 22 dx = 2 + 2 .
6. Prove that
x2
t2
1
etx e 2 dx = e 2
for all t.
7. Prove that
(x)2
2 t2
1
etx
e 22 dx = et+ 2
for all t.
8. Let c(z) =
1
.
1+z 2
Prove that
c(z)dz = .
x
dx.
(1 + x2 )
13
1.2
1.2.1
St = S0 exp
t + Wt ,
2
1.2.2
In 1973, Fischer Black and Myron Scholes published their ground breaking
paper the pricing of options and corporate liabilities. Not only did this specify
the rst successful options pricing formula, but it also described a general
framework for pricing other derivative instruments. That paper launched the
eld of nancial engineering. Black and Scholes had a very hard time getting
14
that paper published. Eventually, it took the intersession of Eugene Fama
and Merton Miller to get it accepted by the Journal of Political Economy. In
the mean time, Black and Scholes had published in the Journal of Finance a
more accessible (1972) paper that cited the as-yet unpublished (1973) option
pricing formula in an empirical analysis of current options trading.
The Black-Scholes (1973) option pricing formula prices European put or
call options on a stock that does not pay a dividend or make other distributions. The formula assumes the underlying stock price follows a geometric
Brownian motion with constant volatility. It is historically signicant as
the original option pricing formula published by Black and Scholes in their
landmark (1973) paper.
In this subsection we briey give the values for a call price c.
Suppose the present price of a stock is X(0) = x0 , and let X(t) denote its
price at time t. Suppose we are interested in the stock over the time interval
0 to T . Assume that the discounter factor is (equivalently, the interest rate
is 100% compounded continuously), and so the present value of the stock
price at time t is et X(t).
We can regard the evolution of the price of the stock over time as our
experiment, and the outcome of the experiment is the value of the function
X(t), 0 t T . Suppose the option gives one the right to buy one share of
the stock at time t for a price K. Suppose that X(t) is a geometric Brownian
motion process. That is
X(t) = x0 eY (t)
(1.1)
where {Y (t), t 0} is a Brownian motion process with drift coecient
and variance parameter 2 . By computing a conditional expectation we
have that, for s < t,
E[X(t)|X(u), 0 u s] = X(s)e(ts)(+
2 /2)
(1.2)
then it is known that the price an option to purchase a share of the stock at
time t for a xed price K is given by
c = E[et (X(t) K)+ .]
15
By observing that X(t) = x0 eY (t) , where Y (t) is normal with mean t
and variance 2 t, we see that
1
2
2
(x0 ey K)+
e(yt) /2t dy
2t 2
1
2
2
(x0 ey K)
e(yt) /2t dy.
=
2
log(K/x0 )
2t
1 wt w2 /2
1 w2 /2
cet = x0 et
e
e
dw K
e
dw
2 a
2 a
where
a=
log(K/x0 ) t
Now,
1 wt w2 /2
e
2
dw
2 a
1 (wt)2 /2
2
e
dw
= et /2
2 a
2
= et /2 P {N ( t, 1) a}
2
= et /2 P {N (0, 1) a t}
2
= et /2 P {N (0, 1) (a t)}
2
= et /2 ( t a)
where is the standard normal distribution function. Thus, we see that
2
cet = x0 et+ t/2 ( t a) K(a).
Using that
+ 2 /2 =
and letting b = a, we can write this as follows:
c = x0 ( t + b) Ket (b).
where
b=
t 2 t/2 log(K/x0 ) t
16
Remark 2 The Black Scholes formula tells us that the price of an option is
determined by
1. the initial price x0 ,
2. the option exercising time t,
3. the option exercising price K,
4. the call interest , and
5. the variance of the stock 2 .
17
1.3
18
rem, and law of the iterated logarithm are Glivenko-Cantelli theorem(1933),
Donsker theorem(1951), and Strassen theorem(1964) respectively.
1.3.1
n
1
1(,x] (Xi ).
n i=1
(1.3)
z
t2
n(Fn (x) F (x))
1
e 2 dt.
lim P
z =
n
2
F (x) F 2 (x)
3. (Law of the iterated logarithm) For xed x, the limit points of the sequence
2 ln ln n
is bounded.
19
1.3.2
We consider the random censorship model where one observes the incomplete
data {Zi , i }. The {Zi } are independent copies of Z whose distribution is
H. The {Zi , i } are obtained by the equations Zi = min(Xi , Xi ) and i =
{Xi Yi } where the {Yi } are independent copies of the censoring random
variable Y with distribution G which is also assumed to be independent of
F , the distribution of IID random variables {Xi } of original interest in a
statistical inference.
This model has been introduced by Kaplan-Meier in 1958 and the KaplanMeier estimator given below is known as one of the reasonable estimator of
the underlying distribution F in the random censorship model.
The Kaplan-Meier estimator Fn is dened as follow.
Fn (x) = 1
i=1
[i;n]
1
ni+1
]1(,x] (Zi )
(1.4)
where
Z(1) , . . . , Z(n)
are order statistics of
Z1 , . . . , Zn
and [i;n] are the concomitant of Z(i) . That is if, Z(i) = Zj , then [i;n] = j
We need more notation in describing the limit theorem.
Let F {a} = F (a) F (a) denote the jump size of F at a and let A be
the set of all atoms of H which is an empty set when H is continuous. Let
H = inf {x : H(x) = 1} denote the least upper bound of the support of H.
Consider a sub-distribution function F dened by
F (x) = F (x){x < H } + [F (H ) + {H A}F {H }]{x H }.
(1.5)
We are now introduce the three limit theorems in the medical data of the
Kaplan-Meier model.
Fact 2
1. (Law of large numbers) For xed x, Fn (x) converges to F (x) strongly.
That is,
(
)
P Fn (x) F (x) = 1.
20
z
t2
n(Fn (x) F (x))
1
e 2 dt.
z =
lim P
n
2
F (x) F 2 (x)
3. (Law of the iterated logarithm) For xed x, the limit points of the sequence
n(Fn (x) F (x))
2 ln ln n
is bounded.
Remark 3 We omit the detailed mathematical conditions.
Chapter 2
Basic thinking in Probability
2.1
2.1.1
Why are you studying probability? We believe that many problems in real
life depend on uncertainty and they can sometimes answered by probabilistic
idea. Uncertainty is part of our daily experience.
A probability is a number that expresses the degree to which an occurrence is certain or uncertain. Probability is the branch of mathematics that
seeks to study uncertainty in a systematic way.
Statistic is a bunch of data consists of meaningful numbers whose name
is originated from state arithmetic. Statistics is branch of science that collect
data, and draw an inference from them.
1. The quantum theory of physics depicts a Universe whose fundamental
structure is random.
2. Modern models of inheritance, too, are based on a probabilistic understanding of how genes are transmitted from parents to children.
Genetic counseling relies on this work.
3. Probability arguments underpin recent modeling of the AIDS epidemic.
4. Have you ever applied for credit?
5. Court cases are increasingly being inuenced by forensic evidence based
on probabilities.
21
22
We begin with listing some experiments.
1. Flip a coin.
2. Record the amount of money needed to buy a gift for your parents.
3. Record the sex of baby born at a local maternity hospital.
4. Interview 100 shoppers in a busy shopping center: record the number
who have heard of a new brand of refrigerator.
5. Test electrical components consecutively as they come o a production
line; record the number tested until the rst defective component is
discovered.
6. Measure the length of a sh.
7. Record a stock price increment.
8. Investigate credits in a statistics class.
9. Operate on a patient to remove a stomach cancer; record the amount
of time(months) that goes by until the patient relapses.
10. Record the birth weights(kilograms) of twins.
An outcome is a result of an experiment. An outcome is not always
a number. Some outcomes in above experiment are numbers, others are
not. For example, in ipping coin the outcomes consist of head and tail.
The outcomes of recording the birth weights of twins are ordered pairs of
numbers.
Experiments in the above examples are called a random experiment in
the sense that the numbers of outcome is at least two.
On the other hand, a deterministic experiment is an experiment whose
result is deterministic. So the number of outcome in a deterministic experiment is only one. Because we always know the result of a deterministic
experiment in advance, it is not our goal of studying in probability any more.
The following is an example of a deterministic experiment.
Example 4
1. Record the age of baby born at a local maternity hospital.
23
2.1.2
Sample space
24
7. In recording a stock price increment,
= (, )
because the stock price increment can be a any real numbers: positive,
zero, or negative numbers.
8. In investigating credits in a statistics class
= {A+, A, B+, B, C+, C, D+, D, F }.
9. In recording the amount of time(months) that goes by until the patient
relapses after operating to remove a stomach cancer,
= [0, ).
10. In recording the birth weights(kilograms) of twins,
= [0, ) [0, ).
2.1.3
Event
25
2. In recording the amount of money needed to buy a gift for your parents,
if A is
A = [0, 50000] = [0, )
then A is the event that the amount of money is at most 50000.
3. In recording the sex of baby born at a local maternity hospital, if A is
A = {G} = {G, B}
then A is the event that the baby is a girl.
4. In interviewing 100 shoppers in a busy shopping center and recording
the number who have heard of a new brand of refrigerator, if A is
A = {1, . . . , 100} = {0, 1, . . . , 100}
then A is the event that at least one of the interviewee have heard of
the new brand of refrigerator.
5. In recording the number tested until the rst defective component is
discovered, if A is
A = {100, 101, . . .} = {0, 1, . . . , }
then A is the event that the defective component is not discovered before
100.
6. In measuring the length of a sh, if A is
A = [30, ) = [0, )
then A is the event that the length of the sh is at least 30
7. In recording a stock price increment, if A is
A = [30, 30] = (, )
then A is the event that the stock price increment is at most 30.
26
8. In investigating credits in a statistics class, if A is
A = {A+, A, B+, B} = {A+, A, B+, B, C+, C, D+, D, F }
then A is the event that the credit is at least B.
9. In recording the amount of time(months) that goes by until the patient
relapses after operating to remove a stomach cancer, if A is
A = [120, ) = [0, )
then A is the event that the patient does not relapse before 10 years.
10. In recording the birth weights(kilograms) of twins, if A is
A = [4, ) [4, ) = [0, ) [0, )
then A is the event that the weight of both twins are at least 4.
2.1.4
27
There are two ways of describing, or specifying the members of, a set.
One way is by intensional denition, using a rule or semantic description. In
this case the set is denoted
{x S : p(x)}
This is called a set build form.
The second way is by extension, that is, listing each member of the set. An
extensional denition is notated by enclosing the list of members in braces.
{x1 , x2 , . . .}
This is called a tabular form. The order in which the elements of a set are
listed in an extensional denition is irrelevant, as are any repetitions in the
list.
One often has the choice of specifying a set intensionally or extensionally.
If every member of set A is also a member of set B, then A is said to
be a subset of B, written A B (also pronounced A is contained in B).
Equivalently, we can write B A, read as B is a superset of A, B includes
A, or B contains A. The relationship between sets established by is called
inclusion or containment.
If A is a subset of, but not equal to, B, then A is called a proper subset
of B, written A B (A is a proper subset of B) or B A (B is proper
superset of A).
The empty set is a subset of every set and every set is a subset of itself:
The statement that sets A and B are equal means that they have precisely
the same members (i.e., every member of A is also a member of B and vice
versa). An obvious but very handy identity, which can often be used to show
that two seemingly dierent sets are equal. A = B if and only if A B and
B A.
The set
A B = {x : x A or x B}
is called the union of A and B.
The set
A B = {x : x A and x B}
is called the intersection of A and B.
The set
A \ B = {x : x A and x
/ B}
28
is called the dierenceof A and B.
What is a universal set? Does there exist the universal set U of all sets?
If there exist such a U then the set
R = {A U : A
/ A}
is well-dened. The problem arises when it is considered whether R is an
element of itself. If R is an element of R, then according to the denition R is
not an element of R. If R is not an element of R, then R has to be an element
of R, again by its very denition. The statements R is an element of R
and R is not an element of R cannot both be true, thus the contradiction.
This is known as the Russells paradox. We conclude that there does not
exist the set of all sets. When we say that U is a universal set, it is the set
of all elements in a restricted sense. So, for example, the sample space is
the universal set of all possible outcomes in an experiment. The set IR is the
universal set of all real numbers.
Now let U be a given universal set. The set
AC = U \ A
is called the complement of A. It is clear that
A \ B = A BC .
The cardinality of a set is the number of members of the set. For example,
1. The cardinality of empty set is zero.
2. The cardinality of {x1 , x2 , . . . , xn } is n which is a nite number.
Some sets have innite cardinality. The set N of natural numbers, for
instance, is innite. Some innite cardinalities are greater than others. For
instance, the set of real numbers has greater cardinality than the set of
natural numbers. However, it can be shown that the cardinality of (which is
to say, the number of points on) a straight line is the same as the cardinality
of any segment of that line, of an entire plane, and indeed of any Euclidean
space.
If a set A has its proper subset that is one to one correspondence to A
itself, then A is called an innite set. If a set A is not a innite set, then it
is a nite set. For example the set of natural numbers
IN = {1, 2, 3, . . .}
29
has a proper subset of even numbers
INE = {2, 4, 6, . . .}
one correspondence to. So IN is an innite set. On the other hand the set
A = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
cannot have any proper subset that is one correspondence to A. So A cannot
be a innite set. So it is a nite set.
If an innite set A is one to one correspondence to the
IN = {1, 2, 3, 4, . . .}
then we say that A is a countable set. By denition a countable set can be
represented as
{x1 , x2 , x3 , . . .}.
By denition, IN is a countable set. It is also known that the set ZZ of
integers and the setQ rational numbers are countable sets.
If an innite set is not a countable then it is called an uncountable set.
Consider
[0, 1] = {x IR : 0 x 1}
where IR is the set of real numbers. It is known that the set [0, 1] is an
uncountable set. It is proved by Cantor. Therefore the set IR of real numbers
and the set
QC = IR \Q
of irrational numbers are uncountable. We observe that
IN ZZ Q IR.
Because Q is a countable set, any innite subset of Q is countable set. The
rationales are a densely ordered set: between any two rationales, there sits
another one, in fact innitely many other ones.
A new set can be constructed by associating every element of one set with
every element of another set. The Cartesian product of two sets A and B,
denoted by A B is the set of all ordered pairs (a, b) such that a is a member
of A and b is a member of B.
Example 7
30
1. {1, 2} {a, b} = {(1, a), (1, b), (2, a), (2, b)}.
2. {1, 2, g} {r, w, g}
= {(1, r), (1, w), (1, g), (2, r), (2, w), (2, g), (g, r), (g, w), (g, g)}.
3. {1, 2} {1, 2} = {(1, 1), (1, 2), (2, 1), (2, 2)}.
4. A B C = (A B) C.
5. IR2 := IR IR = {(x, y) : x, y IR}.
6. IR3 := IR IR IR = {(x, y, z) : x, y, z IR}.
In general, IRn is called the Euclidean space.
In some random experiments the outcome is the sequence of numbers
followed by a repeated the basic experiment countably many times. In this
case, outcomes are sequences of numbers. For example, if the sample space
of the basic experiment is {0, 1} and we repeat this experiment countably
many times, then we could take the new sample space = {0, 1}IN , the space
of entire sequences of = (1 , 2 , ), where i are 0 or 1. It is known the
set of outcomes of a independent trials is such an example.
In some random experiments the outcome is the path or trajectory followed by a system over an interval of time. In this case, outcomes are functions. For example, if the system is observed over [0, ) and its path is
continuous, we could take = C[0, ), the vector space of continuous realvalued functions on [0, ). It is known the sample paths of a Brownian
motion is such an example.
We now understand the sample space and events as the universal set and
subsets.
2.1.5
Binomial theorem
We dene
( )
n
r
, for r n, by
( )
n!
n
,
:=
(n r)!r!
r
31
( )
( )
n
n!
n
n!
=
=
.
=
(n r)!r!
r!(n r)!
nr
r
n
n1
n1
=
+
, 1 r n.
r
r1
r
(x + y) =
k=0
Here,
( )
n
r
( )
n k nk
x y .
k
(2.1)
32
2.1.6
Exercises
x=0
( )
n x
p (1 p)nx = 1.
x
( )2
n
2n
=
n
x=0 x
33
2.2
Denitions of Probability
In the previous sections, we dened the sample space as the set of all possible
outcomes in a stochastic experiment, and introduced the event as a subset
of the sample space. In this section we make a formal denition of the
probability of an event within a sample space.
2.2.1
The probability of an event is the ratio of the number of cases favorable to it,
to the number of all cases possible when nothing leads us to expect that any
one of these cases should occur more than any other, which renders them,
for us, equally possible. This denition is essentially a consequence of the
principle of indierence. If elementary events are assigned equal probabilities,
then the probability of a disjunction of elementary events is just the number
of events in the disjunction divided by the total number of elementary events.
The classical denition of probability was called into question by several
writers of the nineteenth century, including John Venn and George Boole.
The frequentist denition of probability became widely accepted as a result
of their criticism, and especially through the works of R.A. Fisher. The
classical denition enjoyed a revival of sorts due to the general interest in
Bayesian probability.
Example 8 Toss a die, and record the face that lands uppermost. The sample space is = {1, 2, 3, 4, 5, 6}. The die is described as fair if these six
outcomes are equally likely. Consider an event A of even numbers. That is
A = {2, 4, 6}. In this case, the classical probability of A is given by
P (A) =
2.2.2
number of outcomes in A
3
= = 0.5
number of outcomes in
6
Frequency probability
34
school is often associated with the names of Jerzy Neyman and Egon Pearson
who described the logic of statistical hypothesis testing. Other inuential
gures of the frequentist school include John Venn, R.A. Fisher, and Richard
von Mises.
Frequentists talk about probabilities only when dealing with well-dened
random experiments. The set of all possible outcomes of a random experiment is called the sample space of the experiment. An event is dened as
a particular subset of the sample space that you want to consider. For any
event only one of two possibilities can happen; it occurs or it does not occur.
The relative frequency of occurrence of an event, in a number of repetitions
of the experiment, is a measure of the probability of that event.
A further and more controversial claim is that in the long run, as the
number of trials approaches innity, the relative frequency will converge exactly to the probability.
One objection to this is that we can only ever observe a nite sequence,
and thus the extrapolation to the innite involves unwarranted metaphysical
assumptions. This conicts with the standard claim that the frequency interpretation is somehow more objective than other theories of probability.
Example 9 Toss a coin over and over again, and record the number of the
face that land uppermost.
In the rst, the basic experiment is repeated 100 times. The event {H}
occurs on 48 out of the 100 replicates. The relative frequency of this event is
just the proportion of replicates on which A occurs. That is,
RF100 ({H}) =
48
= .48
100
The notation RF100 ({H}) is dened as the relative frequency of the event
{H} in the 100 replicates.
In the second, the basic experiment is repeated 1000 times. The event
{H} occurs on 495 out of the 1000 replicates. The relative frequency of this
event is
495
RF1000 ({H}) =
= .495
1000
In the third, the basic experiment is repeated 10000 times. The event
{H} occurs on 4995 out of the 10000 replicates. The relative frequency of
this event is
4995
= .4995
RF10000 ({H}) =
10000
35
In the above example, we see intuitively that the limit of RFn ({A}) is
0.5, the classical probability of {H}.
Now we are in a position to give a formal denition of the probability of
an event. The probability of the event A, written P (A), is dened to be the
long-run relative frequency of A. That is,
P (A) = lim RFn (A)
n
where RFn (A) is dened as the relative frequency of the event A in the n
replicates and the meaning of the limit will be discussed in the later section.
2.2.3
Bayesian probability
2.2.4
Axioms of probability
36
Second axiom
This is the assumption of unit measure: that the probability that some elementary event in the entire sample space will occur is 1. More specically,
there are no elementary events outside the sample space.
P () = 1
This is often overlooked in some mistaken probability calculations; if you
cannot precisely dene the whole sample space, then the probability of any
subset cannot be dened either.
Third axiom
This is the assumption of -additivity:
Any countable sequence of pairwise disjoint events A1 , A2 , ... satises
P (
n=1 An ) =
P (An )
n=1
P (An )
n=1
37
Solution. Let n(A) denote the number that the event A occurs. Then,
1. Given A , we have RFn (A) = n(A)/n. So we have
0 P (A) 1
2. Since RFn () = n/n = 1 we get
P () = 1
3. Given a pairwise disjoint events A1 , A2 , . . ., we let ni denotes the frequency of the event Ai where
n1 + n2 + = n
Then, the frequency of the event
A1 A2
is exactly n1 + n2 + . Hence
n1 + n2 +
n
n1 n2
=
+
+
n
n
= RFn (A1 ) + RFn (A2 ) + .
RFn (A1 A2 ) =
i=1
Ai =
P (Ai )
i=1
From the Kolmogorov axioms one can deduce some useful rules for calculating probabilities.
Theorem 5
P (AC ) = 1 P (A).
38
Proof. Notice that
A AC =
and, from the second axiom we get
P (A AC ) = P () = 1
Since A and AC are disjoint, the third axiom gives
P (A AC ) = P (A) + P (AC )
Hence
P (A) + P (AC ) = 1
That is,
P (AC ) = 1 P (A)
39
C = AB
D = A BC
E = AC B
The third axiom gives
P (A B) = P (C) + P (D) + P (E)
and
P (A) = P (C) + P (D),
P (B) = P (C) + P (E)
Hence
P (A) + P (B) = [P (C) + P (D)] + [P (C) + P (E)]
= [P (C) + P (D) + P (E)] + P (C)
= P (A B) + P (A B)
Hence
P (A B) = P (A) + P (B) P (A B)
If A and B are disjoint then P (A B) = P () = 0. Therefore we get
P (A B) = P (A) + P (B)
40
An )
n=1
P (An )
n=1
n=1
An ) = lim P (An )
n
n=1
An ) = lim P (An )
n
41
2.2.5
Exercises
)C
Ai
i=1
AC
i ,
i=1
)C
Ai
i=1
AC
i
i=1
Ai )
i=1
P (Ai )
i=1
k=1
Bk =
n=1
Bn =
n
k=1
n=1
An .
42
5. Prove that
P(
i=1
Ai )
P (Ai )
i=1
43
2.3
In this section, we consider concepts of the conditional probability and independence of events.
2.3.1
Conditional probability
Conditional probability is the probability of some event B, given the occurrence of some other event A. Conditional probability is written P (B|A), and
is read the probability of B, given A.
We give the following
Denition 7 Given a probability space (, F, P ) and two events A, B F
with P (A) > 0, the conditional probability of B given A is dened by
P (B|A) =
P (B A)
P (A)
(2.2)
44
3. For any countable sequence of pairwise disjoint events B1 , B2 , . . .
P(
Bi |A) =
i=1
P (Bi |A)
i=1
Proof.
1. Let B . Then
0 P (B|A) =
P (B A)
1
P (A)
2.
P (|A) =
P ( A)
P (A)
=
=1
P (A)
P (A)
Bi A)
P (A)
n=1 P (Bn A)
=
P (A)
P (Bn A)
=
P (A)
n=1
Bn |A) =
n=1
P(
n=1
P (Bn |A)
n=1
Bn ) A =
n=1
(Bn A)
n=1
45
2.3.2
Independence of events
P (B A)
P (A)
46
=
=
=
=
47
3. The relation
P (A B C) = P (A)P (B)P (C)
does not imply the mutual independence of A, B, C.
Consider the sample space
= {(i, j) : i, j = 1, 2, 3, 4, 5, 6}
in tossing two dice.
Let
A = { i is 1, 2, or 3}
B = { i is 3, 4, or 5}
C = { i + j is 9}
In this case
ABC
AB
BC
C A
=
=
=
=
{(3, 6)}
{(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6)}
{(3, 6), (4, 5), (5, 4)}
{(3, 6)}
and
P (A B C) = 1/36 = P (A)P (B)P (C),
However A, B, C are not pairwise independent.
In general, we have the following
Denition 11 The events A1 , A2 . . . An are mutually independent if
P (Ai1 Aim ) = P (Ai1 ) P (Aim )
for 1 i1 < < im n. They are pairwise independent if every pair of
events are independent.
We now consider the independence of a sequence of events A1 , A2 , A3 , . . ..
Denition 12 A sequence of events A1 , A2 , A3 , . . . are mutually independent
if every nite sub-collection of events are mutually independent. They are
pairwise independent if every pair of events are independent.
48
Consider a family of events in a sample space .
Example 12
1. {, }
2. {, A, AC , }, and
3. {B : B }
are meaningful families of events in probability and statistics.
Denition 13 Two families of events A and B are independent if
P (A B) = P (A)P (B)
for any A A and any B B.
Finally we dene the conditional dependence.
Denition 14 Consider A, B, C with P (C) > 0. We say that A and B are
conditionally independent, given C if
P (A B|C) = P (A|C)P (B|C)
2.3.3
Bayes theorem
P (B)
.
P (A)
In other words, one can only assume that P (A|B) is approximately equal
to P (B|A) if the prior probabilities P (A) and P (B) are also approximately
equal.
We begin by introducing the theorem of total probability.
Denition 15 A family of events {A1 , A2 , . . .} is called a partition of the
sample space if
49
1.
n=1
An =
2. An Am = for n = m
3. P (An ) > 0 for all n.
Theorem 12 (The theorem of total probability) Let {An } be a partition of
the sample space . Given A we have
P (A) =
P (An )P (A|An )
(2.3)
n=1
A=A=A
An =
n=1
(A An )
n=1
Notice that {(AAn )} are disjoint. So the third axiom and the multiplication
theorem gives
P (A) =
=
n=1
P (A An )
P (An ) P (A|An )
n=1
r
n
50
We have supposed that the students have told the truth to the question of
using the legal drug. What if we asked the question about an illegal drug? If
asked the question outright, many students might lie to avoid embarrassment.
The following example shows the role played by the theorem of probability.
Example 14 Suppose we plan a sample survey to discover what proportion
of students have used an illegal drug. We prepare three question cards. The
rst card says Have you ever used illegal drug? The second card instructs
the student to Answer No and the third card instructs the student to Answer Yes. We intend to show each respondent all three cards, mix them,
allow the respond to choose one at random and then to respond appropriately.
In this way we hope to persuade virtually everyone to respond truthfully. r
students said Yes among n answered the question of using the illegal drug.
How do we nd out the proportion of students who have used the legal drug?
Solution. Let p be the probability of a randomly selected student has
used the illegal drug. Consider the three events
A1 = The event of choosing the rst card
A2 = The event of choosing the second card
A3 = The event of choosing the third card
Then, A1 , A2 , A3 are disjoint and A1 A2 A3 = , and
P (A1 ) = P (A2 ) = P (A3 ) =
1
3
51
That is,
p+1 r
=
3
n
Hence, one can estimate p by
p
=
3r
1
n
P (An ) P (A|An )
P (An ) P (A|An )
=
P (A)
n=1 P (An ) P (A|An )
That is,
P (An |A) P (An ) P (A|An ) , n = 1, 2, . . .
Proof. Multiplication theorem of probability gives
P (A An ) = P (A|An )P (An )
and
P (A An ) = P (An A) = P (An |A)P (A)
Hence
P (An |A)P (A) = P (A|An )P (An )
That is,
P (An |A) =
P (An ) P (A|An )
P (A)
P (An ) P (A|An )
P (An ) P (A|An )
=
P (A)
i=1 P (An ) P (A|An )
52
Remark 4 Notice that an intuitive way of expressing Bayes theorem is
P (B|A) = P (A|B)
P (B)
.
P (A)
P (cause)
P (eect)
In this case, P (cause|eect) is the probability that is not easy to nd, yet
P (eect|cause) is the probability that is relatively easy to nd.
53
2.3.4
Exercises
1. Prove that
P (A B) = 1 (1 P (A)(1 P (B)
for any independent events A and B.
2. Prove that
P(
)=1
i=1
(1 P (Ai ))
i=1
An ) = 1
n=1
4. Prove that
P (A B) P (A) + P (B) 1
for any events A and B. This is known as Bonferronis inequality.
5. Dene
AB = (A \ B) (B \ A)
for any events A and B. We say that AB is the symmetric dierence
of A and B. Prove that
P (AB) = P (A) + P (B) 2P (A B)
54
2.4
95;
90;
85;
80;
75;
70;
65;
60;
0
55
Denition 16 A random variable X on a probability space (, F, P ) is a
real valued function dened on the sample space .
Consider a random variable X : IR. Let be an element of .
Then, the real number X() is called a realization of the random variable X.
We usually denote the random variable by in capital X, and the realization
in lower case x.
Example 18
1. Consider the experiment of recording the numbers of baby born tomorrow in a hospital. Dene the random variable X by the identity function
X() = . Then the range of X is
RX = {0, 1, 2, . . .}
2. Consider the experiment of recording the credit after taking the basic
thinking in probability. Dene the random variable Y by the following
rule.
Y (A+)
Y (A)
Y (B+)
Y (B)
Y (C+)
Y (C)
Y (D+)
Y (D)
Y (F )
=
=
=
=
=
=
=
=
=
95;
90;
85;
80;
75;
70;
65;
60;
0
56
In the above example, the range RX is a countable set and RY is a nite
set. In this case the random variable X and Y are called a discrete random
variable. The range of X is
RZ = [0, )
and in this case the random variable Z is not a discrete. We treat this random
variable in a later section.
2.4.1
57
Denition 18 Consider a discrete random variable X with range RX . The
function fX dened by
fX (x) = P (X = x)
for x IR is called the probability mass function of X. In this case, if x RC
X
then fX (x) = 0 and if x RX then fX (x) > 0.
Proposition 1 Let fX be the probability mass function of the discrete random variable X. Then
1. fX (x) 0 for all x IR,
2. RX = {x : fX (x) > 0} is nite or countable, and
3. if RX = {x : fX (x) > 0} = {x1 , x2 , . . .}, then
fX (xi ) = 1.
i=1
Proof. 1 and 2 are clear from denition. In order to prove s we observe that
{X = xi }, i = 1, 2 . . ., is a partition of . Hence we get
i=1
fX (xi ) =
P (X = xi ) = P
i=1
{X = xi } = P () = 1.
i=1
i=1
f (xi ) = 1.
58
2. R = {x : f (x) > 0} is nite or countable, and
3. if R = {x : f (x) > 0} = {x1 , x2 , . . .}, then
i=1
f (xi ) = 1.
i=1
f (xi ) = 1.
Example 19 Consider
59
X
P (X = x)
1
4
1
2
1
4
Then,
fX (0) + fX (1) + fX (2) = 1.
We introduce some discrete random variables.
Bernoulli random variable
In the theory of probability and statistics, a Bernoulli trial is an experiment whose outcome is random and can be either of two possible outcomes,
success and failure.
In practice it refers to a single experiment which can have one of two
possible outcomes. These events can be phrased into yes or no questions:
1. Did the coin land heads?
2. Was the newborn child a girl?
3. Were a persons eyes green?
4. Did a mosquito die after the area was sprayed with insecticide?
5. Did a potential customer decide to buy a product?
6. Did a citizen vote for a specic candidate?
7. Did an employee vote pro-union?
Therefore success and failure are labels for outcomes, and should not be
construed literally. Examples of Bernoulli trials include
1. Flipping a coin. In this context, obverse (heads) conventionally denotes success and reverse (tails) denotes failure. A fair coin has the
probability of success 0.5 by denition.
2. Rolling a die, where a six is success and everything else a failure.
60
3. In conducting a political opinion poll, choosing a voter at random to
ascertain whether that voter will vote yes in an upcoming referendum.
In a Bernoulli trial, the sample space can be written as
:= {S, F }
where S stands for a success and F stands for a failure. Let p [0, 1]. Then
the probability P is given by
P (F ) = 1 p
P (S) = p,
Then a random variable X can be dened on this sample space. That is,
a function X : IR. In this case the random variable given by
{
X() =
1
0
if = S
if = F
61
Dene a random variable X by the number of success in the n trials. Then
the range of X is
{0, 1, 2, . . . , n}
and the probability mass function is given by
( )
f (x) = P (X = x) =
n x
p (1 p)nx ,
x
x = 0, 1, . . . , n
x=0
f (x) =
( )
x=0
n x
p (1 p)nx = (p + (1 p))n = 1,
x
We say that this random variable X by the Binomial random variable and
is denoted by X B(n, p).
Geometric random variable
We perform a Bernoulli trials independently until the rst success occur.
Then the sample space is
:= {S, F S, F F S, F F F S, . . .}
and
Dene a random variable X by the number of failures until the rst success.
Then the range of X is
{0, 1, 2, . . . , }
and the probability mass function is given by
f (x) = P (X = x) = p(1 p)x ,
x = 0, 1, 2, . . ..
x=0
f (x) = p
(1 p)x = 1
x=0
We say that this random variable X by the geometric random variable and
is denoted by X G(p).
62
Poisson random variable
We now introduce the Poisson random variable in a dierent way. Consider
the following function from calculus.
f (x) = e
x!
, x = 0, 1, . . ., > 0
f (x) = 1.
x=0
x
, x = 0, 1, . . ., > 0 .
x!
We say that this random variable X by the Poisson random variable and is
denoted by X P().
That is, if X P(), then the probability mass function is, for > 0,
f (x) = P (X = x) = e
x
, x = 0, 1, . . .
x!
x=0
f (x) = e
x
x=0
x!
= e e = 1.
63
Proof. Suppose X B(n, p) and let = np. Then,
P (X = x) =
n!
px (1 p)nx
(n x)!x!
( )x (
)nx
n!
=
1
(n x)!x! n
n
n(n 1) (n x + 1) x (1 n/)n
=
nx
x!(1 /n)
Since n is large and p is small and = np, we get
(
1
n
Hence
)n
n(n 1) (n x + 1)
1,
nx
P (X = x) e
x!
1
n
)x
64
2.4.2
Exercises
2
1 0
= (10)
10
1
8
0 1
= (10)
10
1
(b) Suppose that two items are drawn. Then we are interested in the
following probabilities.
i. The probability of both the drawn item are non defective is
given by
( )( )
2 8
28
0 2
( ) =
10
45
2
ii. The probability of the one is defective and the other is non
defective is given by
( )( )
2
1
8
1
( )
10
2
16
45
iii. The probability of both the drawn item are defective is given
by
( )( )
2 2
1
1 2
( ) =
10
45
2
8
2
( )( )
2
1
8
1
( )( )
2
1
2
2
( ) + ( ) + ( )
10
2
10
2
10
2
1
28 16
+
+
=1
45 45 45
65
(c) In general, we consider the experiment of drawing n items among
N items where M are defective and N M are non defective.
Then the sample space is
= {0 among M and n among N M ,
1 among M and n 1 among N M ,
...,
n among M and 0 among N M ,}
where, 0 n M , 0 n N M .
( )(
M
x
N M
nx
( )
N
n
f (x) = P (X = x) =
M
x
N M
nx
( )
N
n
x = 0, 1, . . . , min{n, M }
Prove that
n
x=0
min{n,M }
f (x) =
( )(
M
x
x=0
N M
nx
( )
N
n
= 1.
We say that this random variable X by the hypergeometric random variable and is denoted by X H(M, N, n).
2. Consider the sequence
an = 1 +
1
n
)n
of rational numbers.
(a) Prove the sequence an is increasing. That is, prove that an an+1
for all n IN.
66
(b) Prove that an 3 for all n IN.
(c) State the completeness of real numbers, and prove that the sequence an have the least upper bound and it is the limit.
(d) Write
1
e := lim 1 +
n
n
Then compare e with the number
1+
)n
1
1
+ + .
2! 3!
1
n=1
n
2
1
n=1
is divergent.
(c) Prove that the innite integral
1
dx
xp
1
n=1
np
67
2.5
Expectation
We note that
1+2=3
is given by the axiom of the real numbers. We know that the operation + is
the binary operation. Then what is the meaning of
1 + 2 + 3 = 6?
The basic meaning of computing it is rst compute
1+2=3
and compute
3 + 3 = 6.
That is,
(1 + 2) + 3 = 6.
Now, applying the commutative law and the associative law of real numbers,
we see that
(1 + 2) + 3 = 1 + (2 + 3)
= (1 + 3) + 2 = 1 + (3 + 2)
= (2 + 1) + 3 = 2 + (1 + 3)
= (2 + 3) + 1 = 2 + (3 + 1)
=6
In this case we may write one of the above as
1+2+3
and the result is 6. Similarly, we can give the meaning of the expression of
a1 + a2 + a3 + a4
and by induction we can dene for n IN
a1 + a2 + + an .
68
Write
ai := a1 + a2 + + an .
i=1
ai =
i=1
n
i=1
ak =
k=1
al =
ax =
x=1
l=1
ay =
y=1
is equal to
a1 + a2 + + an .
Given a sequence {an }
n=1 , we can dene the innite series as
lim
n=1
an :=
an := a1 + a2 +
n=1
in calculus.
Consider the following game of chance. In order to play the game you
must pay a dollars. Suppose that the random variable X takes the values
x1 , x2 , . . . , xr and receive X dollars as the result of the game. Should we need
to play the game? It is dicult to answer the question if we play the game
only once. If one play the game n times, then before the n games you have to
pay na dollars and after n games, you receive X1 +X2 + +Xn dollars where
X1 , X2 , . . . , Xn are independent and identically distributed random variables
with the common probability mass function f . Now, we write Nn (xi ) as the
numbers of xi among n games. Then we have
X1 + X2 + + Xn =
xi Nn (xi )
i=1
Dividing n we get
[
r
X1 + X2 + + Xn
Nn (xi )
=
xi
n
n
i=1
If n is large, then we see that Nn (xi )/n is approximately equal to f (xi ) and
the right hand side is approximately equal to
:=
i=1
xi f (xi ).
69
If > a then you earn dollars and if < a you lose dollars and if = a you
will be in state of breaking even.
We now call the number
:=
xi f (xi )
i=1
as the expectation of X.
Suppose a discrete random variable X have the probability mass function
f . Then we can think of the expectation of X is
xj f (xj ).
(2.4)
j=1
then
EX :=
xj f (xj )
(2.5)
j=1
We say that the random variable X have the (nite) expectation and if
f (x) = P (X = x) =
n x
p (1 p)nx , x = 0, 1, . . . n
x
and
EX = np.
Proof. Using
( )
n
n1
x
=n
x
x1
70
we get
EX : =
=
xf (x)
x=0
n
( )
n x
p (1 p)nx
x
x=0
= np
= np
x=1
(
n1
y=0
n 1 x1
p (1 p)nx
x1
)
n1 y
p (1 p)n1y
y
= np(p + (1 p))n1
= np
q
p
xpq x
x=0
= pq
xq x1
x=0
d x
q
x=0 dq
d
= pq
qx
dq x=0
= pq
d
= pq
dq
1
1q
71
(
1
= pq 2
p
q
=
p
x=0
xe
e
j=1
= e
x
x!
j
(j 1)!
j
j=0
j!
= e e
=
1
, x = 1, 2, . . .
x(x + 1)
Now, since
x=1
f (x) =
1
x=1
1
= (1 1/2) + (1/2 1/3) + = 1
x+1
72
we see that f is indeed the probability mass function. On the other
hand, since
|x|f (x) =
x=1
1
= .
x=1 1 + x
2.5.1
Properties of expectation
(x)f (x)
Theorem 16 Suppose that the random variable X and Y have the expectation.
1. If c is constant, and if P (X = c) = 1 then EX = c.
2. If c is constant then cX have the expectation and E(cX) = cEX.
3. The random variable X + Y have the expectation and E(X + Y ) =
EX + EY .
4. If P (X Y ) = 1 then EX EY ; In this case, EX = EY if and only
if P (X = Y ) = 1.
5. |EX| E|X|.
Theorem 17 If P (|X| M ) = 1 for a constant M then X have the expectation and |EX| M .
73
Denition 21 Random variables X1 , . . . , Xn are independent if
P (X1 A1 , . . . , Xn An ) = P (X1 A1 ) P (Xn An )
for all subsets A1 , . . . , An of IR.
Theorem 18 Suppose that X and Y are independent and have the expectation. then XY have the expectation and
E(XY ) = EXEY
Example 21 Suppose that X is a symmetric random variable and let Y =
X 2 . Then
E(XY ) = EXEY.
However, X and Y are not independent.
Theorem 19 Suppose that the discrete random variable X takes the nonnegative integers valued. Then, X have the expectation if and only if the
series
x=0 P (X x) converges. In this case,
EX =
P (X x)
x=0
2.5.2
(x )2 f (x)
and denoted by V (X). If the series x (x )2 f (x) converges then the variance V (X) is dened and is a nonnegative number. In this case,
V (X)
is called the standard deviation and is denoted sd(X) or (X).
74
We have the following
Theorem 20
V (X) = EX 2 [EX]2 .
Proof. Let = EX. Then
V (X) = E(X )2
(x )2 f (x)
=
x
(x2 2x + 2 )f (x)
x2 f (x) 2
xf (x) + 2
f (x)
= EX 2 2EX EX + [EX]2
= EX 2 [EX]2 .
f (x) = P (X = x) =
n x
p (1 p)nx , x = 0, 1, . . . n,
x
75
Proof. Using
( )
n
n1
x
=n
x
x1
we get
EX(X 1) : =
=
x=0
n
x(x 1)
x=2
= n(n 1)p
= n(n 1)p
n x
p (1 p)nx
x
(
x=1
(
n2
y=0
n 2 x2
p (1 p)nx
x2
n1 y
p (1 p)n2y
y
2q 2
p2
q
p2
x(x 1)pq x
x=0
= pq 2
= pq 2
x=0
x(x 1)q x2
d2 x
q
2
x=0 dq
76
= pq 2
d2
qx
dq 2 x=0
d2 1
dq 2 1 q
2
= pq 2 3
p
2
2q
=
p2
= pq 2
x
, x = 0, 1, . . .,
x!
EX(X 1) = 2
and
V (X) =
Proof. Using the Taylor series, we get
EX(X 1) : =
x(x 1)e
x=0
x
x!
x2
=
e
(x 2)!
x=2
x
= 2 e
x=0 x!
2
= 2 e e
= 2 .
77
Let = EX be the rst moment of X. Then the rth central moment of X
is dened as
E(X )r , r = 1, 2, . . ..
We now dene the moment generating function of a discrete random
variable X. Notice that for each t, etX is also a random variable.
Denition 23 Suppose that the discrete random variable X have the probability mass function X. Then the moment generating function of X is dened
as
MX (t) := EetX =
etx f (x), t R.
x
If the series x etx f (x) converges then the moment generating function MX
is dened.
Given a random variable X, one can generate the moment using the
moment generating function. Using the Taylor series we see that
ez =
zn
n=0
n!
=1+
z
z2 z3
+
+
+ .
1! 2!
3!
Now,
MX (t) = EetX
=
etx f (x)
x
xt (xt)2 (xt)3
+
+
+ f (x)
=
1+
1!
2!
3!
x
(xt)2
=
f (x) +
xt f (x) +
f (x) +
2!
x
x
x
t2
= 1 + tEX + EX 2 +
2!
By dierentiating with respect to t we get
t2
EX 3 +
2!
MX (t) = EX 2 + tEX 3 +
..
.
(r)
MX (t) = EX r + tEX r+1 +
MX (t) = EX + tEX 2 +
78
Letting t = 0 we get the formula
(r)
MX (0) = EX r r = 1, 2, . . ..
Example 25 (Binomial random variable). If X B(n, p), then the moment generating function of X is
MX (t) = EetX
=
=
( )
tx
x=0
(
n
x=0
n x
p (1 p)nx
x
n
(pet )x (1 p)nx
x
= (1 p + pet )n
Now, since
EX = MX (0) = np
etx q x p
x=0
= p
(qet )x
x=0
p
, qet < 1
1 qet
79
and
EX = MX (t)|t=0 =
pqet
q
|t=0 =
t
2
(1 qe )
p
( )2
V (X) = MX (t)|t=0
q
p
q
p2
Example 27
(Poisson random variable). If X P(), > 0 then the
moment generating function of X is
MX (t) = EetX
x
x!
x=0
(et )x
= e
x!
x=0
=
etx e
= exp[(et 1)]
and
EX = MX (t)|t=0 =
V (X) = MX (t)|t=0 2 = .
80
2.5.3
Exercises
k=1
( )2
k
n
1
.
n
( )
cos
k=1
k
n
1
.
n
2
2
z/2
z/2
2
p 2 p +
p + p2 < 0
1+
n
2n
of p.
6. We recall the following identity:
y=0
y+r1
(1 p)y = pr for 0 < p < 1.
r1
x1 r
f (x) =
p (1 p)xr , x = r, r + 1, . . . , .
r1
(a) Prove that there exist a random variable X whose probability
mass function is f .
(b) Find EX
(c) Find V (X)
(d) Find the moment generating function MX .
81
2.6
2.6.1
(2.6)
where the right-hand side represents the probability that the random variable
X takes on a value less than or equal to x. The probability that X lies in
the interval (a, b] is therefore
F (b) F (a)
if a < b.
The CDF of X can be dened in terms of the probability density function
as follows:
F (x) =
Note that in the denition above, the less or equal sign, is a convention,
but it is a universally used one, and is important for discrete distributions.
The proper use of tables of the Binomial distribution and Poisson distributions depend upon this convention.
Theorem 22
Let F be the cumulative distribution function of a random
variable X. Then
1. F is (not necessarily strictly) monotone increasing. That is, if x < y
then F (x) F (y).
2. limx F (x) = 0,
limx+ F (x) = 1.
3. F is right-continuous.
lim F (y) = F (x)
yx
Proof.
82
1. If x < y then {X x} {X y}. Hence P (X x) P (X y).
2. Observe the following fact.
limx f (x) = if and only if f (xn ) for every sequence
(xn ) with xn .
Now, let yn . Since {X yn } is increasing, we get
{X yn } = {X < }.
n=1
Hence
lim P (X bn ) = P (X < ) = 1.
bn
Similarly, we get
F () := lim F (y) = 0.
y
{X yn } = {X x}.
n=1
Hence
lim F (yn ) =
yn x
lim P (X yn )
= P ( lim {X yn })
n
= P(
{X yn })
n=1
= P (X a)
= F (x).
83
F (x) = P (X x) =
P (X = xi ) =
xi x
p(xi ).
xi x
If the CDF F of X is continuous functioncontinuous, then X is a continuous random variable; if furthermore F is absolute continuityabsolutely
continuous, then there exists a Lebesgue-integrable function f (x) such that
F (b) F (a) = P (a X b) =
f (x) dx
a
for all real numbers a and b. (The rst of the two equalities displayed above
would not be correct in general if we had not said that the distribution is
continuous. Continuity of the distribution implies that P (X = a) = P (X =
b) = 0, so the dierence between < and ceases to be important in this
context.) The function f is equal to the [[derivative]] of F almost everywhere,
and it is called the probability density function of the distribution of X.
Example 28
1. As an example, suppose X is uniformly distributed on
the unit interval [0, 1]. Then the CDF of X is given by
F (x) = x
x<0
0x1
1 < x.
2. Take another example, suppose X takes only the discrete values 0 and
1, with equal probability. Then the CDF of X is given by
0
: x<0
F (x) = 1/2 : 0 x < 1
1
: 1 x.
2.6.2
Riemann integral
F (x) =
f (t)dt
84
for a real valued function f dened on IR and the f is called the probability
density function of X.
In this section we review the Riemann integral
f (x)dx
a
f (x)dx.
f (i )(xi xi1 ).
(2.7)
i=1
The sum given in (2.7) is called a Riemann sum of f on [a, b]. Letting n
we can consider the following limit
lim
f (i )(xi xi1 ).
(2.8)
i=1
The limit in (2.8) may or may not exist. We denote the limit as
f (x)dx
a
That is,
f (x)dx := lim
a
f (i )(xi xi1 )
(2.9)
i=1
In particular, if the limit exist, we say that f is Riemann integrable on [a, b].
We have the following on the Riemann integral.
Theorem 23 If f is continuous on [a, b] then it is Riemann integrable.
85
Theorem 24 Suppose that f and g are Riemann integrable on [a, b]. Let c
be a constant. Then
1. f + g is Riemann integrable and
g(x)dx
f (x)dx +
(f (x) + g(x))dx =
cf (x)dx = c
f (x)dx
3. If f g on [a, b] then
b
a
f (x)dx
g(x)dx
a
In particular, if f 0 on [a, b]
f (x)dx 0
Theorem 25
b
b
f (x)dx
|f (x)|dx
a
a
f (x)dx =
f (x)dx +
g(x)dx
c
f (x)dx
a
f (t)dt =
a
f (y)dy =
a
f (z)dz =
f ()d =
a
f (x)d(x) =
86
f (x)dx
f (x)dx
a
lim
b a
and write as
f (x)dx
f (x)dx
without regard the existence of limit. We say that this integral as an innite
integral. In particular, if the limit exist, we say that the innite integral
converges and diverges otherwise. Similarly, we can dene the innite integral
2.6.3
f (x)dx, and
f (x)dx.
f (x)dx = 1.
f (x)dx if it is
f (x) =
1
,
ba
axb
otherwise
87
The gamma function and the gamma distribution
We introduce the gamma function. By an integration by parts and by induction we get the identity
xn1 ex dx = (n 1)!.
() :=
x1 ex dx.
Then, we have
Theorem 27
1. (1) = 1,
2. ( + 1) = (), and
3. (n + 1) = n!, n = 1, 2, 3, . . ..
Proof.
1. Using the fundamental theorem of calculus, we know that
b
0
ex dx = ex
]b
0
= eb + 1.
Now, we have
(1) =
ex dx = lim
b 0
ex dx = 1.
( + 1) =
0
x e dx =
]
x ex
0
x ex dx = ().
88
Now, for > 0 use the identity
ex dx =
to see that
f (x) =
ex , x > 0
0
otherwise
1
x1 ex ,
()
f (x) =
x>0
otherwise
f (x) =
ex , x > 0
0
otherwise
f (x) =
1
x1 ex ,
()
x>0
otherwise
f (x) =
1
x1 e ,
()
x
x>0
otherwise
89
Remark 5
1. We observe that
1
E() = (1, ).
x
1
x1 e ,
()
f (x) =
x>0
otherwise
f (x) =
1
r
( r2 )2 2
x 2 1 e 2 ,
r
x>0
otherwise
E(X) :=
(x)f (x)dx.
MX (t) = EetX :=
etx f (x)dx.
EX =
0
1 1 x
x e dx
()
90
Observe that
1 1 x
x e dx
()
0
1 +11 x
=
x
e dx
()
0
( + 1)
1
=
x+11 ex dx.
()
( + 1)
0
EX =
Since
1
x+11 ex
( + 1)
1
x+11 ex dx = 1.
( + 1)
So we get
EX =
( + 1)
= .
()
If X (, ) then
1.
EX =
2.
V ar(X) = 2
3.
MX (t) = (1 t) , 1 t > 0.
Remark 6 We observe that
1.
X (, ) MX (t) = (1 t) , 1 t > 0
91
2.
t
X E() MX (t) = (1 )1 , 1
>0
3.
X 2 (r) MX (t) = (1 2t) 2 , 1 2t > 0
r
4.
X (, ) X Z, Z (, 1)
By using the moment generating function we can prove the following
Theorem 29 Suppose X1 , . . . , Xn are independent random variables.
1. If Xi (i , ) (i = 1, 2, . . . , n) then Y =
n
i=1
Xi (
n
i=1
i=1
Xi (n, 1 ).
2 ( ni=1 ri ).
4. X (, )
i , ).
n
i=1
Xi
92
1. Since Nt P (t), we see that
P (W > t) = P (Nt = 0),
{
et
t>0
=
0
otherwise
So
d
d
P (W t) =
(1 et )
dt
dt
{
et
t>0
=
0
otherwise
So W E().
2. Since Nt P (t), we see that
P (Wk > t) = P (Nt k 1)
=
(t)x et
x!
x=0
k1
So
d
P (Wk t)
dt
(t)x et
d k1
=
dt x=0
x!
=
=
k1
x x1 t
[xt e tx et ]
x!
x=0
k1
k1
x+1
x
tx1 et +
tx et
(x
1)!
x!
x=0
x=0
k
tk1 et
=
(k 1)!
k k1 t
=
t e
(k)
So Wk (k, 1 ).
93
The normal distribution
We have considered the following
Denition 27 Let IR, 2 (0, ). Then the random variable X is
called a normal distribution and denoted X N (, 2 ) if X has the probability density function
{
1
(x )2
f (x) = exp
, x IR.
2 2
2
We prove that f is indeed a probability density function.
Proof.
1. It is clear that f (x) > 0 for all x IR.
2. Let
1 2
1
e 2 z dz.
2
Then I is a positive constant and we get
I=
1 1 z 2 1 1 w2
e 2 dz
e 2 dw
=
2
2
1 1 (z2 +w2 )
=
e 2
dzdw.
2
1 1 r2
e 2 rdrd
2
0
0
1 2 1 r2
=
d
e 2 rdr
0
2 0
et dt
= 1.
Hence
and if we let z =
1 2
1
e 2 z dz = 1,
2
then we get
I=
(x)2
1
e 22 dx = 1.
2
94
Theorem 30 X N (, 2 ) Z =
Z N (0, 1).
Proof. () Let Z =
X
.
Then
P (Z z) =
=
=
=
X
P
z
P (X z + )
z+
(x)2
1
e 22 dx
2
z
1 1 t2
e 2 dt
1 2
1
P (Z z) = e 2 z
z
2
That is,
Z N (0, 1)
()Conversely, let Z N (0, 1) and let X = Z + . Then
P (X x) = P (Z + x)
(
)
x
= P Z
x
1 2
1
e 2 z dz
=
2
x
(t)2
1
e 22 dt
=
2
and
fX (x) =
(x)2
1
P (X x) = e 22
x
2
Hence,
X N (, 2 )
95
Theorem 31
X N (, 2 )
1
(x )2
f (x) = exp
x IR
2 2
2
, Z N (0, 1).
X = Z + or Z = X
1
MX (t) = exp(t + 2 t2 ).
2
Proof. If Z N (0, 1) then the moment generating function of Z is
MZ (t) = Ee
=
=
=
=
tZ
1 2
1
etz e 2 z dz
2
1 2
1
e 2 z +tz dz
2
{
}
1
1
1 2
2
exp (z t) + t dz
2
2
2
}
{
}
{
1 2
1
1
2
exp (z t) dz
exp t
2
2
2
{
}
1 2
exp t
2
1
1
exp (z t)2 dz = 1.
2
2
96
Theorem 32 X N (, 2 )
1.
EX =
2.
V ar(X) = 2
3.
{
EZ
(2k)!
k!2k
m = 2k k = 1, 2, . . .
m = 2k + 1.
Proof.
X N (, 2 ) X = Z + , Z N (0, 1).
Then
EX = E(Z + ) = E(Z) +
V (X) = V (Z + ) = 2 V (Z)
On the other hand
1 2
1
z e 2 z dz
2
]
[ 0
1
12 z 2
12 z 2
=
dz +
ze
dz
ze
0
2
[ 0
]
1
w
w
=
e dw +
e dw
0
2
= 0,
EZ =
and
1 2
1
z 2 e 2 z dz
2
[
]
2
2 12 z 2
=
z e
dz
2 0
V (Z) =
97
[
1 2
2
zd(e 2 z )
=
2 0
([
)
]
2
12 z 2
12 z 2
e
=
ze
+
dz
0
0
2
1 1 z2
=
e 2 dz
2
= 1.
Z N (0, 1) MZ (t)
1
= exp{ t2 }
2
1 1 2 k
=
( t)
k=0 k! 2
=
=
1 2k
t
k
k=0 k!2
(2k)!
k=0
EZ
(2k)!
k!2k
k!2k
1 2k
t
(2k)!
m = 2k k = 1, 2, . . .
m = 2k + 1.
Remark 7 Suppose that Z N (0, 1). Then one can evaluate the probability
(z) = P (Z z)
and form a normal probability table.
Example 30 If X N (, 2 ), then
(
X
b
a
P (a < X < b) = P
<
<
(
)
(
)
b
a
=
98
Example 31 The graph of the probability density function of N (, 2 ) is
1. symmetric with respect to .
2. has the maximum value at .
3. is a bell shaped.
One can use the moment generating function of the normal distribution
to get the following results.
Theorem 33
1. If X N (, 2 ), then aX + b N (a + b, a2 2 ).
2. If X1 , . . . , Xn are independent and Xi N (i , i2 ) for i = 1, 2, . . . , n,
then Y = a1 X1 + + an Xn N (a1 1 + an n , a21 12 + + a2n n2 ).
Remark 8 Using the gamma function and the property of normal distribution we can get the following identity.
( )
1
=
2
In deed,
( )
1
2
=
=
=
=
=
=
t 2 1 et dt
1
1
et dt, t = z2
0
t
12 z 2
e
2dz
0
1 2
1
e 2 z dz
2 2
0
2
1 1 z2
e 2 dz
2
0
2
1 1 z2
e 2 dz
99
Theorem 34 If Z N (0, 1), then Z 2 2 (1).
Proof.Let be the distribution function of Z N (0, 1), and let Y = Z 2 .
Then Y has the distribution function
P (Y y) = P (Z 2 y)
= P ( y Z y)
= ( y) ( y)
Hence Y has the probability density function
g(y) =
=
=
=
=
Now using
( )
1
2
P (Y y)
y
[2( y) 1]
y
1
( y)
y
1 2
1 1
e 2 ( y)
y 2
1
1
1
y 2 e 2 y
2
That is, Y = X 2
1
,2
2
= 2 (1).
100
2.6.4
Exercises
F (x) =
2
3
11
12
x<0
0x<1
1x<2
2x<3
03x
Then,
(a) Find P (X < 3).
(b) Find P (X = 1).
(c) Find P (X > 12 ).
(d) Find P (2 < X 4).
2. Consider F is a distribution function of a general random variable.
(a) If the distribution function F is continuous and strictly increasing
then F 1 (y), y [0, 1] is the unique real number x such that
F (x) = y.
(b) Unfortunately, the distribution does not, in general, have an inverse. One may dene, for
y [0, 1]
,
F 1 (y) = inf {F (x) y}.
xIR
101
(a) F 1 is nondecreasing
(b) F 1 (F (x)) x
(c) F (F 1 (y)) y
(d) F 1 (y) x if and only if y F (x)
(e) If Y has a U [0, 1] distribution then F 1 (Y ) is distributed as F
(f) If {X } is a collection of independent F -distributed random variables dened on the same sample space, then there exist random
variables Y such that Y is distributed as U [0, 1] and
F 1 (Y ) = X
with probability 1 for all .
4. It is known that there is a one to one correspondence between a random
variable X and a distribution function F . That is, one can prove the
following:
(a) Given a random variable X, dene the distribution function
FX (x) = P (X x)
then FX is monotone, FX () := limy FX (y) = 1 and FX () :=
limy FX (y) = 0, and FX is right continuous. See Theorem 22.
(b) Conversely, given a distribution function F satisfying,
i. F is monotone,
ii. F limy F (y) = 1 and limy F (y) = 0, and
iii. F is right continuous,
then one can dene a probability space (, P ) and a random variable X whose distribution function FX is equal to F . This will be
proved in a advanced probability theory.
5. Evaluate
x2 dx.
6. Evaluate
cos xdx.
102
7. Prove that if X (, ) then EX = .
8. Prove that if X (, ) then V (X) = 2 .
9. Prove that if X (, ) then
MX (t) = (1 t) , 1 t > 0.
10. We now introduce the beta distribution.
(a) Evaluate
(b) Evaluate
(c) Evaluate
1
0
1
0
1
0
x(1 x)dx.
x99 (1 x)dx.
x(1 x)99 dx.
B(a, b) :=
0
(a)(b)
.
(a + b)
f (x) =
(a+b) a1
x (1
(a)(b)
x)b1 ,
0<x<1
otherwise
103
2.7
It is natural that two or more random variables are involved when we try to
use a probabilistic model to a phenomenon in a model real life. We study
the properties that occurs when we model using two random variables.
2.7.1
Consider the joint distribution function of the two random variables X and
Y dened by
FX,Y (x, y) = P (X x, Y y), (x, y) IR2 .
We suppose that X and Y are absolutely continuous random variables in
the sense that there exists a function f : IR2 [0, ) such that
FX,Y (x, y) =
f (s, t)dsdt.
f (x, y)dxdy = c
1(0<x<y<1) dxdy
dxdy
= c
0
0
1
= c
ydy
0
c
=
.
2
Therefore c = 2.
104
Example 33 Find the constant c so that f (x, y) = ce(x +y )/2 , (x, y) IR2 is
the joint probability density function of the random variables X and Y .
2
f (x, y)dxdy = c
= c
Now, using the identity
ex
2 /2
e(x
2 +y 2 )/2
ex
dx =
2 /2
dx
dxdy
ey
2 /2
dy.
we see that
f (x, y)dxdy = c
e(x
2 +y 2 )/2
dxdy
= c2.
So c =
1
.
2
fX (x) =
f (x, y)dy
105
Proof. Clearly, fX 0.
P (X x) = P (X x, < Y < )
x
=
=
Therefore
f (s, t)ds dt
fX (s)ds.
d
P (X x) = fX (x).
dx
Denition 28 Suppose that (X, Y ) f (x, y). Then the probability density
function
fY (y) =
f (x, y)dx
fX (x) =
f (x, y)dy
2dy
x
= 2(1 x)1(0<x<1) ,
fY (y) =
f (x, y)dx
2dx
0
= 2y1(0<y<1) .
106
fX (x) =
f (x, y)dy
1
1
2
2
= ex /2 ey /2 dy
2
2
1 x2 /2
= e
, x IR
2
=
Remark 10
P (a < X1 < b) = P (a < X < b, < Y < )
=
a
b
=
a
f (x, y)dy dx
fX (x)dx
2.7.2
Conditional distribution
f (x, y)
, fX (x) > 0
fX (x)
107
1. f (y|x) 0,
for all y,
2.
f (y|x)dy =
f (x, y)
dy =
fX (x)
f (x, y)dy
= 1.
fX (x)
Denition 29 Suppose that (X, Y ) f (x, y). Then the probability density
function
f (x, y)
f (y|x) =
, fX (x) > 0
fX (x)
is called the conditional probability density function of Y given X = x.
Denition 30 Suppose that (X, Y ) f (x, y). Then
F (y|x) =
f (t|x)dt
f (t|x)dt.
A
2. Find P Y > 12 |X = x .
Solution. We know that
fX (x) = 2(1 x)1(0<x<1) .
Therefore, the conditional probability density function of Y given X = x is
f (x, y)
, f (x) > 0
f (x)
1
1(x,1) (y), 0 < x < 1.
=
1x
f (y|x) =
Therefore
1
P Y > |X = x =
2
{ 1
x1
1
2
1
dx2 = 1
1x
1
1
dy = 1x
1x
1
2
1
2
1
2
f (y|x)dy
<x<1
0 < x < 21
108
E[u(X, Y )|X = x] :=
E(Y |X = x) = yf (y|x)dy.
Then
E[E(Y |X)] :=
yf (y|x)dy fX (x)dx
(
yf (y|x)fX (x)dy dx
yf (x, y)dydx
= E(Y ).
109
Since
E[g(X)Y |X = x] :=
we get that 4.
yf (y|x)dy
:=
(y g(x))2 f (y|x)dy
y f (y|x)dy 2g(x)
= E(Y 2 |X = x) (g(x))2 .
yf (y|x)dy + g(x)
f (y|x)dy
110
Example 37 Suppose that (X, Y ) f (x, y) = 2 1(0<x<y<1) . Find E[Y |X =
x] and V ar[Y |X = x].
Solution. Notice that
{
f (y|x) =
1
,
1x
Therefore
E[Y |X = x] =
yf (y|x)dy
y
dy
x 1x
1
1 x2
=
1x
2
1+x
=
, 0<x<1
2
and
y2
dy
x 1x
1 + x + x2
.
=
3
E[Y 2 |X = x] =
Therefore
(
1 + x + x2
1+x
3
2
2
(1 x)
=
, 0 < x < 1.
12
V [Y |X = x] =
)2
f1 (x1 ) =
...
111
and a conditional probability density function of (X2 , . . . , Xn ) given X1 = x1
as
f (x1 , x2 , . . . , xn )
f (x2 , . . . , xn |X1 = x1 ) =
.
f1 (x1 )
Theorem 39
1. E(Y + Z|X = x) = E(Y |X = x) + E(Z|X = x)
2. E(Y h(X))2 E[(Y E(Y |X))2 ], h
3. V (Y ) = E[V (Y |X)] + V ar[E(Y |X)]
Proof. 1.
E(Y + Z|X = x)
:=
f (y, z|x)dz dy +
f (y, z|x)dy dx
yf (y|x)dy +
zf (z|x)dz
= E[Y |X = x] + E[Z|X = x]
2. Let g(x) = E[Y |X = x]. Then we get
h(X))2 ]
E{E[Y h(X))2 |X = x]}
E{E[(Y g(X) + g(X) h(X))2 |X = x]}
E{E[(Y g(X))2 |X = x]}
+E{E[(g(X) h(X))2 |X = x]}
+2E{E[(Y g(X)) (g(X) h(X))|X = x]}
= EV (Y |X = x) + E(g(X) h(X))2
+2E{(g(X) h(X)) E[Y g(X)|X = x]}
= EV (Y |X = x) + E(g(X) h(X))2 .
E[(Y
=
=
=
112
Therefore
E[(Y
=
=
=
=
h(X))2 ]
E{E[(Y h(X))2 |X = x]}
E[V (Y |X = x)] + E[(g(X) h(X)2 ]
E[E[(Y g(X))2 |X = x] + E[(g(X) h(X))2 ]
E(Y g(X))2 + E(g(X) h(X))2
E[(Y g(X))2 ].
Therefore
E(Y h(X))2 E[(Y E(Y |X = x))2 ],
for all h
3. Let
h(X) = 2 = E(Y )
Then
V (Y ) =
=
=
=
E[(Y 2 )2 ]
E[V (Y |X)] + E[(g(X) 2 )2 ]
E[V (Y |X)] + V (g(X))
E[V (Y |X)] + V [E(Y |X)].
fY (y) =
fX (x) =
113
1.
V (Y ) =
0
y 2 2ydy
)2
y2ydy
1
2
1
( )2 = ,
2
3
18
2.
V (Y |X) =
(1 x)2
,
12
1
E[(1 x)2 ]
12
1 1
2(1 x)3 dx1
=
12 0
1
=
,
24
E[V (Y |X)] =
3.
E(Y |X) =
1+x
,
2
V [E(Y |X)] = V (
1+X
)
2
1
V (X)
4
1
1
=
4 18
1
=
,
72
4.
E[V (Y |X)] + V [E(Y |X)]
1
1
+
=
24 72
1
=
18
= V (Y )
114
2.7.3
Suppose that
X; 1 = E(X), 12 = V (X)
and
Y ; 2 = E(Y ), 22 = V (Y )
Denition 33 The covariance of X and Y is dened as
Cov(X, Y ) = E[(X 1 )(Y 2 )]
and the correlation coecient of X and Y is dened as
:= Corr(X, Y ) =
Cov(X, Y )
1 2
Properties 1
1. Cov(aX + b, cY + d) = acCov(X, Y )
2. Cov(X, Y ) = E(XY ) E(X)E(Y )
3. 1 1 ;
= 1 P (Y = aX + b) = 1 with a > 0
= 1 P (Y = aX + b) = 1 with a < 0
115
4.
V (X + Y ) = V (X) + 2Cov(X, Y ) + V (Y )
Proof.
1.
Cov(aX + b, cY + d)
= E[{aX + b E)(aX + b)}{cY + d E(cY + d)}]
= E[ac(X E(X))(Y E(Y ))]
= ac Cov(X, Y )
2.
Cov(X, Y ) =
=
=
=
E[(X 1 )(Y 2 )]
E[XY 2 X 1 Y + 1 2 ]
E(XY ) 2 E(X) 1 E(Y ) + 1 2
E(XY ) 1 2
3. Let
h(t) = E{(Y 2 ) t(X 1 )}2 0
By Cauchy-Schwartz inequality, we get
|| 1
Now, putting, t = 21
then,
2
)
1
= E(Y 2 )2 2tE(X 1 )(Y 2 )
+t2 E(X 1 )2
2
2
= 22 2 1 2 + 2 12
1
1
2
2
= 2 (1 )
h(
116
Therefore,
2
(X 1 )}2 ] = 0
1
2 = 1 E[{(Y 2 )
P [Y 2
2
(X 1 ) = 0] = 1
1
V (Y |x) =
22 (1
)
2
22 if 2 0
0 if 2 1
117
Proof.
1. Since E(Y |X = x) = a + bx,
E[E(Y |X = x)] = E(Y ).
Therefore
2 = a + b1
Now, notice that
1 2 + 1 2
= EXY
= E(E(XY |X = x))
= E(XE(Y |X = x))
= E(X(a + bX))
= E[aX + bX 2 ]
= a1 + b(12 + 21 ),
we get
1 2 + 1 (a + b1 ) = a1 + b12 + b21 ,
2
b= .
1
Therefore
a = 2
That is,
E(Y |x) = 2 +
2
1
1
2
(x 1 .)
1
2. Since
V (Y ) = E[V (Y |x)] + V ar[E(Y |x)],
we get
E[V (Y |x)] = V (Y ) V [E(Y |x)]
= 22 V (a + bx)
= 22 b2 V (X)
2
= 22 2 22 12
1
2
= 2 (1 2 ).
118
M (t1 , t2 , . . . , tn ) = E[exp(
ti Xi )] = E(et X )
2.7.4
Stochastic Independence
119
(2) Independence between two classes of events
Denition 37 {A1 , . . . , Am } and {B1 , . . . , Bn } are independent
P (Ai Bj ) = P (Ai )P (Bj ), i = 1, . . . , m, j = 1, . . . , n
(3) Independence between two discrete random variables.
Denition 38 Let X be a discrete random variable with A = {x1 , . . . , } and
let Y be a discrete random variable with B = {y1 , . . . , }. Then X, Y are
independent
{(X = x1 )(X = x2 ) } and {(Y = y1 )(Y = y2 ) }are independent
P [(X = xi ) (Y = yi ] = P [X = xi ]P [Y = yi ],
i = 1, . . . , j = 1, . . .
f (x, y) = fX (x) fY (y), x, y
(4) Independence between two random variables.
Denition 39 Suppose f (x, y) is the joint probability density function of X
and Y . Then X and Y are independent
f (x, y) = fX (x)fY (y),
for all x, y
Theorem 41 Suppose f (x, y) is the joint probability density function of
X and Y . Then X and Y are (stochastically) independent f (x, y) =
g(x)h(y), for all x,y for some g, h
Proof. () Suppose X and Y are independent. Then
f (x, y) = fX (x)fY (y),
for all x, y by denition.
() Suppose that f (x, y) = g(x)h(y), x, y. Then,
fX (x) =
f (x, y)dx2
= g(x)
= c1 g(x)
h(y)dy
120
with c1 =
h(y)dy. Similarly,
f2 (y) = c2 h(y),
with c2 =
g(x)dx. Then
fX (x)fY (y) = c1 c2 g(x)h(y) = c1 c2 f (x, y)
Since
c1 c2 =
g(x)dx
h(y)dy
g(x)h(y)dxdy
f (x, y)dxdy
= 1
Therefore, f (x, y) fX (x)fY (y) for all x, y
Example 39
1. Let (X, Y ) f (x, y) where
{
f (x, y) =
f (x, y) =
121
Proof. Using independence and the Fubini theorem, we see that
E[u(X)v(Y )] :=
v(y)f2 (y)[
= E[u(X)]
u(x)f1 (x)dx]dy
v(y)f2 (y)dy
= E[u(X)]E[v(Y )]
122
P [ 12 < X < 12 , X 2 > 14 ] = 0
P [ 12 < X < 12 ]P [X 2 > 41 ] =
1
2
1
2
1
4
E[exp(t1 X + t2 Y )]
E[et1 X et2 Y ]
E[et1 X ]E[et2 Y ](independence)
MX (t1 )MY (t2 ).
()
M (t1 , t2 ) = MX (t1 )MY (t2 )
Then,
123
Denition 40 X1 , . . . , Xn are independent
f (x) =
for (x1 , x2 , x3 ) C
0 otherwise
1
4
where
C = {(1, 0, 0), (0, 1, 0), (0, 0, 1), (1, 1, 0)}
Example 41 Let X1 , X2 , X3 : IID with pdf
f (x) = 2x1(0,1) (x).
1. Distribution of Y = max(X1 , X2 , X3 ) ?
Solution. (distribution function method)
G(y) =
=
=
=
P (Y y)
P (X1 y, X2 y, X3 y)
P (X1 y)P (X2 y)P (X3 y)
[P (X1 y)]3
[
]3
2xdx
This is equal to
G(y) =
So, pdf of Y is
g(y) =
1
0
124
2. Distribution of Z = min(X1 , X2 , X3 ) ?
Solution. (distribution function method)
G(z) =
=
=
=
P (Z z)
P (min(X1 , X2 , X3 ) z)
1 P (min(X1 , X2 , X3 ) > z)
1 [P (X1 > z)]3 .
This is equal to
2 3
1 (1 z )
G(z) =
1
0
So, pdf of Z is
{
g(y) =
125
2.7.5
1 1 2
is denoted by (X, Y )T N2 (, ) with = (1 , 2 )T and =
.
sym.
22
if
1
1
f (x, y) =
exp[
Q(x, y)]
2(1 2 )
2 12 22 (1 )2
where
Q(x, y) = (
x 1 2
x 1 y 2
y 2 2
) 2(
)(
)+(
).
1
1
2
2
f (x, y)dxdy = 1.
fX (x) =
f (x, y)dy
[
]
[
x1 2
1
1
exp 2(1
)
w)
(w2 2 x
2) (
1
1
=
dw.
exp
2(1 2 )
21 1 2
Let
w=
y 2
.
2
Then
[
x1 2
1
)
exp 2(1
2) (
1
fX (x) =
21 1 2
(
)
[
x 1 2
1
w
exp
2(1 2 )
1
(
)2 ]
2
x 1
+
dw.
2
2(1 )
1
126
That is,
[
1)
exp (x
212
fX (x) =
21
1
2 1 2
(
)
[
1
x 1 2 ]
exp
w
dw
2(1 2 )
1
]
[
1
1 x 1 2
exp (
) .
=
2 1
1 2
Therefore X N (1 , 12 ). Therefore
f (x, y)dxdy =
f1 (x)dx = 1.
). Then
1. X N (1 , 12 ); Y N (2 , 22 )
2. Y |X=x N (2 + 21 (x 1 ), 22 (1 2 )).
Solution.
f (Y |y) =
f (x, y)
fX (x)
). Then
1. The mgf is
M(X,Y ) (t1 , t2 )
1
= exp{1 t1 + 2 t2 + (12 t21 + 21 2 + 22 t22 )}.
2
127
2.
1/2
f (x) = |2|
1
exp (x )T 1 (x )
2
|| = 12 22 (1 2 ),
1 12
=
1 2 sym.
1 2
1
22
1
M (t) = exp(T t + tT t)
2
X VZ +
Here, Z N2 (0, I2 ), V 2 = , V is symmetric.
Proof.
M(X,Y ) (t1 , t2 )
= E exp{t1 X + t2 Y }
= E[E exp{t1 X + t2 Y }|X].
Now, Y |X=x is a normal distribution, we get
E[exp{t1 X + t2 Y }|X = x]
= exp(t1 x)E[exp{t2 Y }|X = x]
(
128
That is,
E[exp{t1 X + t2 Y }|X = x]
(
= exp(t1 x) exp {2 +
)
1
+ t22 (1 2 )22
2{
2
(x 1 )}t2
2
}
x
= exp (t1 1 + 2 t2 )
1
{
}
1
exp t1 1 + t2 2 + (1 2 )22 t22 .
2
Therefore
M(X,Y ) (t1 , t2 )
= E[E[exp{t1 X + t2 Y }|X]]
(
2
= E[exp(t1 X) exp {2 + (X 1 )}t2
2
)
1 2
+ t2 (1 2 )22 ]
2 {
}
X
= E[exp (t1 1 + 2 t2 )
1
{
}
1
exp t1 1 + t2 2 + (1 2 )22 t22 ]
2
= MZ (t1 1 + 2 t2 )
{
}
1
exp t1 1 + t2 2 + (1 2 )22 t22
2
with Z N (0, 1)
1
= exp{ (t1 1 + 2 t2 )2 }
2
{
}
1
exp t1 1 + t2 2 + (1 2 )22 t22
2
1 22
= exp{t1 1 + t2 2 + (1 t1 + 21 2 t1 t2 + 22 t22 )}
2
}
{
1
= exp T + tT t .
2
Since X V Z + ,
MX (t)
129
= E exp(tT X)
= E exp(tT V Z + tT )
= exp(tT )MZ (V T t)
1
= exp(tT ) exp{ (V T t)T (V T t)}
2
1
= exp(tT + tT t).
2
X Y 2
(W1 , W2 ) =
,
1
2
). Then
(
N2 0,
1
1.
))
Solution.
M(W1 ,W2 ) (t1 , t2 )
= E exp(t1 W1 + t2 W2 )
X
Y 2
= E exp(t1
+ t2
)
1
2
1
2
t1
t2
= exp( t2 )E exp( X + Y )
1
2
1
2
1
2
t1 t2
= exp( t2 )M(X,Y ) ( , )
1
2
1 2
[t
1
2
t2
1
= exp( t2 ) exp
1 + 2
1
2
1
2
t1
t1 t2
t1
1
+ {12 ( )2 + 21 2 ( )( ) + 12 ( )2 }]
2
1
1 2
1
1 2
= exp[ (t1 + 2t1 t2 + t22 )]
[2
(
) ]
1 T 1
= exp t
t
1
2
(
1
1
))
130
Result 4 Suppose that (X, Y )T N2 (,
). Then
EX = 1 , EY = 2
V (X) = 12 , V (Y ) = 22
Cov(X, Y ) = 1 2 , Corr(X, Y ) =
Solution.
1.
EXY
2
M(X,Y ) (t1 , t2 )|t1 =t2 =0
t1 t2
= 1 2 + 1 2 .
2.
E
X 1 Y 2
1
2
2
1
=
exp[ (t21 + 2t1 t2 + t22 )]|t1 =t2 =0
t1 t2
2
= .
). Then if = corr(X, Y ) = 0
131
2.8
2.8.1
or
xn x, n .
Now, we consider a sequence {Xn } of random variables and a random variable
X. What is the meaning of
Xn X, n ?
Recall that a random variable X is a function dened on the sample space
. One can consider the point-wise convergence
Xn () X(), .
However, this convergence is known as not very useful in statistics and probability. We introduce a few convergences of a sequence of random variables
that are commonly used in statistics and probability.
Denition 43
1. A sequence {Xn } of random variables converges with probability one to
a random variable X if
P (Xn () X()) = 1.
We denote this convergence by
Xn wp1 X
or
Xn a.s X.
132
2. A sequence {Xn } of random variables converges in probability to a random variable X if for each > 0
lim P (|Xn X| > ) = 0.
133
Theorem 48
1. If Xn converge weakly to X then Xn converge in distribution to X.
2. Converse of the above statement is not true in general. However, if X
is a constant, then the converse is also true. That is,
Xn P c Xn D c.
Example 42 Let Z1 , . . . , Zn be IID N (0, 1). Consider the sample variance
Z1 + + Zn
.
Zn :=
n
Then
Zn D 0
Therefore
Zn P 0
Solution. Note that
1
Zn N (0, ).
n
Fn (x) =
e 2 dt
2n1
nx
v2
1
e 2 dv, nt = v.
=
2
Therefore
Fn (x)
That is, if
F (x) =
then for x = 0 we get
1
2
1
0
1
x<0
x=0
x > 0.
x = 0
x = 0.
Fn (x) F (x).
134
2.8.2
n
X
D Z N (0, 1)
n ) D N (0, 2 )
n(X
Remark 18 Proof of the central limit theorem need the more advanced stus.
The following theorem is useful.
Theorem 50 (Slutsky theorem) Let Xn D X and Vn P c. Then
1.
Xn + Vn D X + c
2.
Xn Vn D cX
Example 43 Let Z1 , . . . , Zn be IID N (0, 1) and let
Vn = Z12 + + Zn2 .
Let Z N (0, 1), Vn and Z are independent. Let
Z
Tn =
Vn /n
then
Tn D N (0, 1).
135
Proof. By the weak law of large numbers
Vn P
EZ12 = 1
n
Therefore
Vn P
1
n
136
2.8.3
Exercises
n ) D Z N (0, 1).
n(X
h (Yn ) P h ().
(g) Prove that
p =
Prove that
i=1
Xi
Chapter 3
Basic thinking in Statistics
3.1
3.1.1
Sampling Theory
Descriptive statistics
3.1.2
Sampling theory
Consider the random variables X1 , . . . , Xn having the joint probability density function
f (x1 , . . . , xn ; )
that intrinsically contains an unknown parameter .
In a statistical inference, one try to infer the unknown parameter based
on random variables X1 , . . . , Xn that can be obtained as a model of a statistical experiment.
137
138
Example 44 Suppose that X1 , . . . , Xn is a sequence of independent and
identically B(1, p), (0 < p < 1), distributed random variables. Then the
joint probability density function is
f (x1 , . . . , xn ; p) = f (x1 ; p) f (xn ; p)
= px1 (1 p)1x1 pxn (1 p)1xn .
Suppose that we want determine
E(X1 + + Xn )
then we see that
E(X1 + + Xn )
=
x1 =0
x1 =0
(x1 + + xn )f (x1 , . . . , xn ; p)
xn =0
xn =0
= np
and this is not a known number unless p is a known number.
Example 45 Suppose that X1 , . . . , Xn is a sequence of independent and
identically N (, 1), IR, distributed random variables. Then the joint
probability density function is
f (x1 , . . . , xn ; ) = f (x1 ; ) f (xn ; )
{
}
{
}
1
(x1 )2
1
(xn )2
= exp
exp
.
2
2
2
2
Suppose that we want determine
E(X1 + + Xn )
then we see that
E(X1 + + Xn ) =
139
The unknown constant, that is the constant that we wish to know, such
as p or appears in the previous examples is called a parameter. The only
fact that we know about p is that p is a constant that belongs to the interval
[0, 1], or (0, 1). Similarly, the only fact that we know about is that is a
constant that belongs to the interval (, ). We say that the interval that
contain parameter is a parameter space. In general, if is a parameter and
is the allowable interval to contain , then is called a parameter space.
We need the denition of statistic based on the random variables X1 , ..., Xn
whose joint probability density function
f (x1 , ..., xn ; )
which is determined by a parameter .
Denition 45 Suppose that X1 , ..., Xn are random variables with the joint
probability density function f (x1 , ..., xn ; ) which is determined by a parameter . Then we say that T (X1 , ..., Xn ) is a statistic if it is a function
of X1 , ..., Xn which does not depend on the unknown parameter . That is, a
statistic is an observable random variables which is used for statistical inference about the unknown parameter .
We know the meaning of saying that X1 , ..., Xn is IID f (x). That is,
X1 , . . . , Xn is a sequence of independent and identically f (x) = f (x; ),
, distributed random variables. In this case the joint probability density
function is given by
f (x1 , . . . , xn ; ) = f (x1 ; ) f (xn ; ).
We use the following denition of a random sample in a statistical inference.
Denition 46 We say that X1 , ..., Xn is a random sample from a population
with the probability density function f (x; ) if and only if X1 , ..., Xn is IID
f (x; ).
Example 46 Let X1 , ..., Xn be a random sample from a population with
mean and variance 2 . Then
=X
n := X1 + + Xn
X
n
140
is called the sample mean and
S2 =
n
1
(Xi Xn )2
n 1 i=1
is the sample variance. These are statistics that will be used for statistical
inference about and 2 . The statistic Sn2
Sn2
n
1
=
(Xi Xn )2 ,
n i=1
n
X
;
/ n
n
X
/ n
N (0, 1).
3.1.3
Order statistics
141
Note that, if Xi are of continuous type then, with probability 1, X(1) < ... <
X(n) . Assume that Xi are continuous for the rest of the section.
Result 6 Let X1 , ..., Xn be a random sample from a distribution with probability density function f (x)I(x A). Then, the joint probability density
function of Y1 = X(1) , ..., Yn = X(n) is
g(y1 , ..., yn ) = n!f (y1 ) f (yn ), y1 < ... < yn , yi A.
Result 7 (Marginal distribution of X(k) , 1 k n)
Under the conditions in Result 1, the probability density function of X(k) is
{
gk (y) =
n!
[F (y)]k1 [1
(k1)!(nk)!
F (y)]nk f (y)
yA
otherwise
gij (x, y) =
142
2. Let Y = F 1 (U ). Then,
FY (y) := P [F 1 (U ) y] = P [U F (y)] = F (y);
Therefore, Y = F 1 (U ) X.
Remark 20
1. We can drop the assumption that F is strictly increasing by dening
F 1 (y) := inf{x : F (x) y}.
2. The second part of the theorem is used to simulate rv with distribution
function F , whereas the rst part is used in the study of order statistic.
Example 48 U1 , ..., Un : a random sample from U (0, 1). Then U(k) , (1
k n) (k, n k + 1).
Proof. The U(k) has probability density function
gk (y) =
n!
y r1 (1 y)nr , 0 < y < 1
(k 1)!(n k)!
Example 49 X1 , ..., Xn : a random sample from a distribution with the distribution function F . Suppose that F is continuous and strictly increasing.
Then F (X(k) ) (k, n k + 1).
Proof. Note that Ui = F (Xi ) i = 1, ..., n ; IID U (0, 1) by the PIT. Moreover,
F is increasing, hence it preserve order; U(j) = F (X(j) ) 1 j n. Therefore,
F (X(k) ) (k, n k + 1).
143
Example 50 Let U(1) < ... < U(n) be the order statistics based on a random
sample U1 , ..., Un from U (0, 1). Then, prove that
[
] [
U(1)
U(2)
,
U(2)
U(1)
]2
U(n1)
, ...,
U(n)
]n1
, U(n)
]n
Zn = (X(n) X(n1) ) = ln
U(n1)
U(n)
U(1)
U(2)
U(1)
U(n1)
, ...,
U(2)
U(n)
]n1
, U(n)
]n
3.1.4
144
Denote 2 2 (n).
Remark 21 Note that the random variable 2 has the probability density
function
n
x
1
f (x) = n n x 2 1 e 2 I(x > 0).
( 2 )2 2
and, hence, 2 (n) ( n2 , 2).
Theorem 52 (Fundamental results of normal sampling ) Suppose that X1 , ..., Xn
: IID N (, 2 ). Write
2 + + (Xn X)
2
X1 + + Xn
(X1 X)
2
X=
, S =
.
n
n1
Then,
2
N (, ), or X
N (0, 1)
1. X
n
/ n
n
2
(n 1)S 2
i=1 (Xi X)
2.
=
2 (n 1),
2
2
Denition 49 Suppose that Z N (0, 1), V 2 (n), and Z, V : independent. Then, T has the (Students) t distribution with n degree of freedom
if
Z
T
,
V /n
and denoted by T t(n).
Remark 22
1. If T t(n), then T has the probability density function
(
n+1
1
2
( )(
f (x) =
) n+1 , x IR
n
n 2 1 + x2 2
n
.
2. If T t(n) then T t(n). That is the distribution of T is symmetric
with respect to 0.
145
Example 51 Let X1 , . . . , Xn be IID N (, 2 ), (, 2 ) IR (0, ).Let
n
1
n )2
(Xi X
S =
n 1 i=1
2
n
X
t(n 1).
/ n
Therefore we have
(
n t/2 (n 1)
n + t/2 (n 1)
P X
<<X
n
n
= 1 ,
) ( )m
m+n
m 2
2
n
( ) ( )
m
2 n2
x 2 1
m
1+
mx
n
) m+n , x > 0
2
m
1
m )2 ,
(Xi X
m 1 i=1
146
and
2
S2 =
n
1
(Yi Yn )2
n 1 i=1
n 2
2
m
Xi X
S1 /12
i=1 ( 1 ) /m 1
= n Y Y
F (m 1, n 1)
2
i
2
S2 /22
i=1 ( ) /n 1
2
Therefore we get
2
2
S2
22
S2
P
= 1 ,
2 F/2 (m 1, n 1) < 2 < F1/2 (m 1, n 1)
2
1
S1
S1
147
3.1.5
Exercises
g2 (y) =
3!
F (y)[1
1!1!
F (y)]f (y)
yA
otherwise.
4. Let X(1) < ... < X(n) be the order statistics from E() , > 0. Then,
nX(1) , (n 1)(X(2) X(1) ), ..., (X(n) X(n1) ) are IID E(). Use the
change of variable technique to prove this result.
Hint. Note that
(a) If we let
T :
Z1 = nX(1)
Z2 = (n 1)(X(2) X(1) )
..
.
Zn = (X(n) X(n1) ),
X(1) =
1
Z1
n
148
1
1
Z2 + Z1
n1
n
X(2) =
..
.
1
1
1
X(n) = Zn + Zn1 + +
Z2 + Z1 .
2
n1
n
(c) Then, T is 1 1 from A := {(x1 , ..., xn ) : 0 < x1 < x2 < < xn }
onto B := {(z1 , ..., zn ) : z1 > 0, z2 > 0, ..., zn > 0} with
1
n
1
n
0
1
n1
0
0 0
1
n
1
n1
|JT 1 | = | det
..
.
| =
.
n!
i=1
i=1
= n exp[
i=1
zi ]
I(zi > 0)
i=1
i=1
5. Suppose U1 , ..., Un : a random sample from U (0, 1). Find the probability
density function of the range R = U(n) U(1) .
6. Suppose U1 , ..., Un : a random sample from U (0, 1). Find the joint probability density function of
X1 = U(1) , X2 = U(2) U(1) , ..., Xn = U(n) U(n1) , Xn+1 = 1 U(n) .
7. Let X1 , ..., Xn be a random sample from a distribution with the distribution function F . Suppose that F is continuous (and strictly increasing). Let Y1 = min1in Xi , and Yn = max1in Xi .
149
(a) What are the joint and marginal distribution of Z1 = F (Y1 ) and
Zn = F (Yn )?
(b) Prove that
Z1
Zn
150
3.2
3.2.1
Point estimation
151
3. If we use to infer the unknown parameter then can be an estimator
or an estimate. In this case we have to understand the meaning in the
context.
Example 53 Suppose that X1 , . . . Xn are random sample from B(1, p), 0 <
p < 1. In this case, p is the population success ratio.
1. The function of random variables
p = p(X1 , . . . , Xn ) =
n
1
Xi
n i=1
0+1+1
2
=
3
3
is a point estimate of p.
Example 54 Suppose that X1 , . . . Xn are random sample from N (, 1),
IR. In this case, is the population mean and the population median.
Firstly, we use the sample mean.
1. The function of random variables
=
(X1 , . . . , Xn ) =
n
1
Xi
n i=1
=
(3, 2, 5) =
3 2 + 5
=0
3
is a point estimate of .
Secondly, we use the sample median.
1. The function of random variables
=
(X1 , . . . , Xn ) = median1in Xi
is a point estimator of and it is a sample median.
152
2. Suppose that n = 3 and the observed values are 3, 2, 5. In this case,
=
(3, 2, 5) = 2
is a point estimate of .
There are many other estimators to infer an unknown parameter. Classical point estimators includes the method of moment estimator , the maximum
likelihood estimator , and the Bayesian estimator..
We introduce the method of moment estimator and the maximum likelihood estimator in this section.
Method of moment estimator
Denition 52 (Method of moment estimator) Let = g(m1 , m2 , . . . , mk ) be
a function of the population moments mr = EX r , r 1. Then, the method of
X2 , . . . , Xk ) where Xr = 1 ni=1 Xir
moment estimator of is M M E := g(X,
n
are the sample moments.
Properties 3
1. n M M E P if g is continuous.
n
1
n
=
Xi = X
n i=1
p(1d
p)
n (
)
1
n 2
Xi X
n i=1
153
is the method of moment estimator of p(1 p). In this case, by the
property
X2 = X
of the Bernoulli random variable, we see that
MME
p(1d
p)
n (
)
1
n 2
Xi X
n i=1
n
(
)
1
n 2
=
Xi2 X
n i=1
n
(
)
1
n 2
Xi X
n i=1
(
)
n 1 X
n
= X
= pM M E 1 pM M E .
3. Noth that p2 = (Ep X1 )2 is a function of the population mean. Therefore
the function of sample moment
pc2
MME
n
1
Xi
n i=1
)2
= (
pM M E )2 .
x1 e
and the population mean and the population variance are given by
=
and
2 = 2 .
154
Therefore we have
= /
and
= 2 / 2 .
Hence
and
2/ 1
2
M M E = X
(Xi X)
n i=1
n
1
MME
2 /X.
=
(Xi X)
n i=1
f (xi ; 1 ) >
i=1
f (xi ; 2 )
i=1
f (xi ; ).
i=1
as a function of .
l() := log L() =
log f (xi ; ).
i=1
155
(b) May not be unique even when it exists.
4. Likelihood equation: M LE is a solution of
= 0,
l()
:=
where l()
l()
xi log + (n
i=1
=
l()
i=1
xi
l() =
Therefore
i=1
2
xi ) log(1 )
i=1
n
n ni=1 xi 1
:=
xi
1
n i=1
xi
n ni=1 xi
< 0,
(1 )2
n
1
M LE =
Xi .
n i=1
i=1
=n2
l()
l() =
exp((xi ))
i=1 1 + exp((xi ))
exp((xi ))
< 0,
2
i=1 (1 + exp((xi )))
156
Therefore M LE is the unique solution of
exp((xi ))
n
= .
2
i=1 1 + exp((xi ))
i=1
n
f (xi ; )
f (xi ; g 1 ())
i=1
g()
= L(M LE , x)
= L(g 1 (g(M LE )), x).
Therefore
1
i=1
With
xi
1
= ,
L() =
i=1
exi .
157
l() = n log
xi
i=1
n
l() = n
xi
i=1
l() = n < 0,
2
Therefore
n
M LE = n
i=1
So
M LE =
1
M LE
xi
i=1
xi
We review the weak law of large numbers that useful in verifying the consistency of an estimator.
158
Theorem 54 Let X1 , . . . , Xn be a random sample from a population with
the mean . Then the sample mean
n
n = 1
X
Xi
n i=1
n | > = 0.
lim P |X
p =
Xi
is
1. a method of moment estimator of p,
2. an maximum likelihood estimator of p,
3. an unbiased estimator of p, and
4. a consistent estimator of p.
Example 61 Suppose that X1 , . . . , Xn be a random sample from N (, 1),
IR. Then, the estimator
n
i=1
Xi
is
1. a method of moment estimator of ,
2. an maximum likelihood estimator of ,
3. an unbiased estimator of , and
4. a consistent estimator of .
159
Example 62 Suppose that X1 , . . . , Xn be a random sample from N (, 2 ),
IR and 2 > 0. Then, the estimator
n
i=1
Xi
is
1. a method of moment estimator of ,
2. an maximum likelihood estimator of ,
3. an unbiased estimator of , and
4. a consistent estimator of .
The estimator
Sn2
i=1 (Xi
Xn )2
is
1. a method of moment estimator of 2 ,
2. an maximum likelihood estimator of 2 ,
3. a consistent estimator of 2 , and
4. not an unbiased estimator of . That is, we have
Sn2 =
=
n [
]
1
n 2
Xi + X
n i=1
n
n [
]
1
n 2 ,
X
[Xi ]2
n i=1
i=1
ESn2 =
]
1[ 2
n1 2
n 2 =
.
n
n
160
3.2.2
Interval estimation
N (0, 1)
/ n
Suppose that the population variance 2 is a known number. Let z/2 be
/2th upper quantile of the standard normal distribution. Then,
(
P z/2
n
X
< z/2 = 1 .
<
/ n
Therefore we get
(
+ z/2
z/2 < < X
X
n
n
= 1 .
z/2 , X
+ z/2
X
n
n
x z/2 , x + z/2
n
n
1.96
1.96
x < < x +
n
n
= 0.95
161
is wrong. This probability is 0 or 1. If
(
1.96
1.96
x < < x +
n
n
=1
1.96
1.96
x , x +
n
n
1.96
1.96
x < < x +
n
n
1.96
1.96
x , x +
n
n
=0
does not contain the parameter . The 95% condence interval mans that
about 95% of the observed intervals do contain the unknown parameter .
In general we get the following.
Denition 54 Suppose that X1 , . . . , Xn be a random sample from a population with the probability density function f (x; )), . If the statistics
1 (X) = u (X1 , . . . , Xn ) and 2 (X) = v (X1 , . . . , Xn ) satisfy
(
1 (X) , 2 (X)
162
2. Suppose that 2 is unknown. Then
n t/2 (n 1) S
X
n
is the 100(1 )% condence interval for where
S2 =
n (
)
1
n 2
Xi X
n 1 i=1
n (
)
1
n 2
Xi X
n i=1
we see that
n t/2 (n 1) Sn
X
n1
is also the 100(1 )% condence interval for .
3. These condence intervals are exact in the sense that we derive them
using the exact distributions.
Proof.
1. Suppose that 2 is known. In this case we can use
Z=
X
N (0, 1)
/ n
163
(a)
/ n
(b)
(n1)S 2
2
N (0, 1)
2 (n 1)
(n 1)S 2
X
/
T =
/(n 1)
/ n
2
Then we get
X
t(n 1)
S/ n
Now we can use this to derive the desired condence interval. Indeed,
let t/2 (n 1) be /2th upper quantile of the t distribution with n 1
degree of freedom. Then,
T =
n
X
< t/2 (n 1) = 1 .
t/2 (n 1) <
S/ n
Therefore we get
(
t/2 (n 1) S , X
+ t/2 (n 1) S
X
n
n
= 1 .
)
164
2. Suppose that 2 is unknown. Then
n z/2 S
X
n
is the 100(1 )% condence interval for where
S2 =
n (
)
1
n 2 .
Xi X
n 1 i=1
/ n
to derive the condence interval because we dont have any information
about the distribution. If n is large, we may apply the cental limit
theorem to get the approximate distribution. Indeed, by the central
limit theorem, we have
)
(
)
(
n D N 0, 2 as n
n X
or
X
N (0, 1) for n large
/ n
Now use the relation to get the approximate condence interval.
2. Suppose that 2 is unknown. Apply the central limit theorem to get
n
X
D N (0, 1) as n
/ n
and apply the law of large number to get
Sn2 P 2 or Sn2 P 2
165
Now, apply the Slutskys theorem to get
X
D N (0, 1) as n
S/ n
or
X
N (0, 1) for n large
S/ n
2
z/2
2n
z/2
1+
2.
2
z/2
4n2
2
z/2
p z/2
3.
p(1
p)
n
n
p (1 p)
n
z/2
z/2
166
Proof.
1. Apply the central limit theorem to get
n(
p p) D N (0, p(1 p))
Therefore we get
p p
or
p(1p)
n
p p
p(1p)
n
D N (0, 1)
Now,
p p
< z/2
p(1p)
n
2 )
z/2
2
1+
2
z/2
p 2 p +
p + p2 < 0
2n
[1]
2. Apply the weak law of large numbers and the central limit theorem to
get
p P p
and
p p
p(1p)
n
D N (0, 1)
or
p(1
p)
n
p p
p(1
p)
n
D N (0, 1)
167
3. Apply the central limit theorem to get
n(
p p) D N (0, p(1 p))
The following example shows how to nd the condence interval for the
dierence of means under the assumption that the population variances are
known.
Example 66 Suppose that X1 , . . . , Xm is a random sample from a population with the unknown mean 1 and the known variance 12 , (1 , 12 )
IR (0, ) and Y1 , . . . , Yn is a random sample from a population with the
unknown mean 2 and the known variance 22 , (1 , 22 ) IR (0, ). Suppose that (X1 , . . . , Xm ) and (Y1 , . . . , Yn ) are independent.
1. Suppose that X1 , . . . , Xm and Y1 , . . . , Yn are normal samples. In this
case,
2
2
Y z/2 1 + 2
X
m
n
is the 100(1)% condence interval for the dierence of mean 1 2 .
This interval is exact.
2. Suppose that X1 , . . . , Xm and Y1 , . . . , Yn are general samples. In this
case,
2
2
Y z/2 1 + 2
X
m
n
is the 100(1)% condence interval for the dierence of mean 1 2 .
This interval is approximate that is valid for m and n large.
168
Proof.
1. Note that
N 1 , 1
X
m
(
2
Y N 2 , 2
n
)
)
2
2
Y N 1 2 , 1 + 2
X
n
m
or
Y (1 2 )
X
12
n
22
m
N (0, 1).
12
X N 1 ,
m
Y D N 2 , 2
n
2
2
Y N 1 2 , 1 + 2
X
n
m
or
Y (1 2 )
X
12
n
22
m
The following example shows how to nd the condence interval for the
population variance.
169
Example 67 Suppose that X1 , . . . , Xn is a random sample from N (, 2 ),
(, 2 ) IR (0, ).
1. Suppose that is known. In this case, the 100(1 )% condence
interval for 2 is obtained from
n
21/2
i=1
(n) <
(Xi )2
< 2/2 (n) :
2
(Xi )2
2 (n) .
2
i=1
21/2 (n 1) <
Xi X
2
)2
< 2/2 (n 1)
Xi X
2
)2
2 (n 1)
The following example shows how to nd the condence interval for the
ratio of the two population variances.
Example 68 Suppose that X1 , . . . , Xm is a random sample from N (1 , 12 ),
(1 , 12 ) IR (0, ) and Y1 , . . . , Yn is a random sample from N (2 , 22 ),
(1 , 22 ) IR (0, ). Suppose that (X1 , . . . , Xm ) and (Y1 , . . . , Yn ) are independent.
1. Suppose that 1 and 2 are known. In this case, the 100(1 )%
2
condence interval for the ratio of variance 22 is obtained from
1
(Xi 1 )2 /12
< F/2 (m, n)
2
2
i=1 (Yi 2 ) /2
170
since we know that
m
2
2
i=1 (Xi 1 ) /1
n
2
2
i=1 (Yi 2 ) /2
F/2 (m, n)
m (
i=1
Xi X
i=1
Yi Y
F1/2 (m 1, n 1) < (
n
and
)2
)2
/12
/22
< F/2 (m 1, n 1)
i=1
Xi X
i=1
Yi Y
m (
)2
)2
/12
/22
F/2 (m 1, n 1)
171
3.2.3
Exercises
f (1; ) = P (X
f (2; ) = P (X
f (3; ) = P (X
f (4; ) = P (X
Total
= 1) |
= 2) |
= 3) |
= 4) |
=0
=1
=2
1
4
1
4
1
4
1
4
1
2
0
0
1
3
1
3
1
3
1
2
0
1
=3
0
1
2
0
1
2
i=1
(Xi )2
172
Now, suppose that X1 , . . . , Xn
is called the least square estimator of .
is a random sample from N (, 1), IR. Prove that the least square
estimator of is the same as maximum likelihood estimator of .
7. Suppose that X1 , . . . , Xn is a random sample from N (, 2 ), (, 2 )
IR (0, ). Find the maximum likelihood estimator of the population
variance 2 .
173
3.3
3.3.1
3.3.2
Denition 55 In the language of statistics, the claim or the research hypothesis that we wish to establish is called the alternative hypothesis H1 .
The opposite statement, one that nullies the research hypothesis, is called
174
the null hypothesis H0 . The word null in this context means that the
assertion we are seeking to establish is actually void.
When our goal is to establish an assertion with substantive support obtained from the sample, the negation of the assertion is taken to be the null
hypothesis H0 and the assertion itself is taken to be the alternative hypothesis
H1 .
Example 69 In our trac problem
1. The alternative hypothesis is
H1 : The mean trac time is reduced
and
2. the null hypothesis is
H0 : The mean trac time is the same as before.
Our initial question, Is there strong evidence in support of the claim?
now translate to, Is there strong evidence for rejecting H0 ? The rst version
typically appears in the statement of a practical problem, whereas the second
the correspondence between the two formulation of a question.
Before claiming that a statement is established statistically, adequate
evidence from data must be produced to support it. A close analogy can
be made to a court trial where the jury clings to the null hypothesis of not
guilty unless there is convincing evidence of guilt. The intent of the hearings
is to establish the assertion that the accused is guilty, rather then to prove
that he or she is innocent.
Court Trial
Statistical Testing
Requires strong
evidence
Guilt
Conjecture(research
to establish :
hypothesis)
Null hypothesis (H0 ) Not guilty.
Conjecture is false.
Alternative
hypothesis(H1 )
Guilty.
Conjecture is true.
Attitude :
Uphold not guilty Retain the null
unless there is
hypothesis unless it
a strong evidence
makes the sample
of guilt.
data very unlikely
to happen.
175
False rejection of H0 is more serious error than failing to reject H0 when H1
is true.
Once H0 and H1 are formulated, our goal is to analyze the sample data
in order to choose between them.
A decision rule, or a test of the null hypothesis, species a course
of action by stating what sample information is to be used and how it is to
be used in making a decision. Bear in mind that we are to make one of the
following two decisions:
Rule 1 Either reject H0 and conclude that H1 is substantiated or retain H0
and conclude that H1 is fail to be substantiated.
Rejection of H0 amounts to saying that H1 is substantiated, whereas non
rejection or retention of H0 means that H1 fails to be substantiated. A
key point is that a decision to reject H0 must be based on strong evidence.
Otherwise, the claim H1 could not be established beyond a reasonable doubt.
In our problem of evaluating the new trac system, let be the population mean trac time. Because is claimed to be lower than 60 minutes,
we formulate the alternative hypothesis as H1 : < 60. According to the
description of the problem, the researcher does not care to distinguish between the situations that = 60 and > 60 for the claim is false in either
statement of no dierence. Accordingly, we formulate the
Example 70
Test H0 : = 60 versus H1 : < 60.
3.3.3
176
c, where R stands
This decision rule is conveniently expressed as R : X
c]
for the rejection of H0 . Also, in this context, the set of outcomes [X
is called the rejection region or critical region, and the cuto point c is
called the critical value.
The cuto point c must be specied in order to fully describe a decision
rule. To this end, we consider the case when H0 holds, that is, = 60.
Rejection of H0 would then be a wrong decision, amounting to a false acceptance of the claim- a serious error. For an adequate protection against this
c] is very small when = 60. For
kind of error, we must ensure that P [X
example, suppose that we wish to hold a low probability of = 0.05 for a
wrong rejection of H0 . Then our task is to nd the c that makes
c] = 0.05 when = 60
P [X
is approximately normal with
We know that, for large n, the distribution
of X
Z=
5/ 49
has the N (0, 1) distribution.
5
c = 60 1.645
49
= 58.8
60.0
177
R : Z 1.645
=
Z=
/ n
5/ 49
and set the rejection region as R : Z 1.645(See the Figure). This form
is more convenient because the cuto 1.645 is directly read o the normal
table, whereas the determination of c involves additional numerical work.
whose value serves to determine the
Denition 56 The random variable X
action is called the test statistic. A test of the null hypothesis is a
for which
course of action specifying the set of values of a test statistic X,
H0 is to be rejected. The set is called the rejection region of the test.
A test is completely specied by a test statistic and the rejection region.
3.3.4
H1 True
< 60
Reject H0
Correct decision
Retain H0
Wrong rejection of H0
(Type I error)
Correct decision
Wrong retention of H0
(Type II error)
178
Denition 57 Type I error is the error of rejection of H0 when H0 is true.
Type II error is the error of non rejection of H0 when H1 is true.
= Probability of making a Type I error
(also called the level of signicance)
= Probability of making a Type II error
In our problem of evaluating the new trac system, the rejection region
c; thus,
is of the form R : X
c] when = 60 (H0 true)
= P [X
> c] when < 60 (H1 true)
= P [X
Of course, the probability depends on the numerical value of that
prevails under H1 . Figure shows the Type I error probability as the shaded
area under the normal curve that has = 60 and the Type II error probability
as the shaded area under the numerical that has = 58.8, a case of H1
being true
F igure The error probabilities and
H0 true
A normal curve
c
Reject H0
= 60.0
c
x
Reject H0
H1 true
A normal curve
= 58.8
179
H0 is the more serious error, we hold at a predetermined low level such as
0.10, 0.05, or 0.01 when choosing a rejection region. We will not pursue the
evaluation of , but we do note that if the turns out to be uncomfortably
large, the sample size must be increased.
3.3.5
Performing a test
When determining the rejection region for this example, we assumed that
= 5 seconds, the same standard deviation as with the original
money
is N (60, 5/ 49) and
machines. Then, when = 60, the distribution of X
58.86 was arrived at by xing = 0.05 and
the rejection region R : X
referring
60
X
Z=
5/ 49
to the standard normal distribution.
In practice, we are usually not sure about the assumption that = 5, the
standard deviation of the trac times using the new system, is the same as
with the original system. But that does not cause any problem as long as the
sample size is large. When n is large (n 30), the normal approximation
remains valid even if is estimated by the sample standard deviation
for X
S. Therefore, for testing H0 : = 0 versus H1 : < 0 with level of
signicance , we employ the test statistic
0
X
Z=
S/ n
and set the rejection region R : Z z . This test is commonly called a
large sample normal test or a Z-test.
Example 71 Referring to the new trac system trac times, suppose that,
from the measurements of a random sample of 49 trac times, the sample
mean and standard deviation are found to be 58.5 and 4.2 minutes, respectively. Test the null hypothesis H0 : = 60 versus H1 : < 60 using a
2.5% level of signicance and state whether or not the claim ( < 60) is
substantiated.
Solution. Because n = 49 and the null hypothesis species that has
the value 0 = 60, we employ the test statistic
60
X
Z=
S/ 49
180
The rejection region should consist of small values of Z because H1 is left
sided. For a 2.5% level of signicance, we take = 0.025, and since z0.025 =
1.96, the rejection region is(see Figure) R : Z 1.96.
F igure The rejection region for Z
A normal curve
0.025
1.96
0.0
With the observed values x = 58.5 and s = 4.2, we calculated the test
statistic
58.5 60
z=
= 2.5
4.2/ 49
Because this observed z is in R, the null hypothesis is rejected at the level
of signicance = 0.025. We conclude that the claim of a reduction in the
mean trac time is strongly supported by the data.
3.3.6
The p-value
Our test in Example was based on the xed level of signicance = 0.025,
and we rejected H0 because the observed z = 2.5 fell in the rejection region
R : Z 1.96. A strong evidence against H0 emerged due to the fact that
a small was used. The natural question at this point is: How small an
could we use and still arrive at the conclusion of rejecting H0 ? To answer
this question, we consider the observed z = 2.5 itself as the cuto point
(critical value) and calculate the rejection probability
P [Z 2.5] = 0.0062
The smallest possible that would permit rejection of H0 , on the basis
of the observed z = 2.5, is therefore 0.0062. It is called the signicance
probability or p-value of the observed z. This very small p-value, 0.0062,
signied a strong rejection of H0 or that the result is highly statistically
signicant.
Denition 58 The p-value is the probability, calculated under H0 , that the
test statistic takes a value equal to or more extreme than the value actually
181
observed. The p-value serves as a measure of strength of evidence against
H0 . A small p-value means that the null hypothesis is strongly rejected or the
result is highly statistically signicant.
Our illustrations of the basic concepts of hypotheses tests thus far focused
on a problem where the alternative hypothesis is of the form H1 : < 0 ,
called a left-sided alternative. If the alternative hypothesis in a problem
states that the true is lager than its null hypothesis value of 0 , we formulate
the right-sided alternative H1 : > 0 and use a right-sided rejection region
R : Z z .
We illustrate the right-sided case in an example after summarizing the
main steps that are involved in the conduct of a statistical test.
1. Formulate the null hypothesis H0 and the alternative hypothesis H1 .
2. Test criterion: State the test statistic and the form of the rejection
region.
3. With a specied , determine the rejection region.
4. Calculate the test statistic from the data.
5. Draw a conclusion: State whether or not H0 is rejected at the specied
and interpret the conclusion in the context of the problem. Also, it is
a good statistical practice to calculate the p-value and strengthen the
conclusion.
182
Chapter 4
The uniform distribution
1. Let X1 , . . . , Xn be a random sample from U (0, ), > 0. Let X(n) =
max1in Xi . Then X(n) D . That is
{
Fn (x) F (x) =
0, x <
1, x
Fn (x) =
0,
( )n
x
1,
x<0
, 0x<
x .
Fn (x) F (x) :=
0, x <
1, x
Gn (x) F (z) :=
183
0,
z<0
1 ez/ , z 0
184
where Gn id the distribution function of n( X(n) ).
Solution. The distribution function of Zn = n( X(n) ) is
(
)
z
Gn (z) = P X(n) +
n
(
)
z
= 1 P X(n) <
n
0,
z<0
( z )n
n
=
, 0 z < n
1
1,
z n
0,
1 1
1,
(
Gn (z) G(z) :=
z
n
)n
z<0
, 0 z < n
z n
0,
z<0
1 ez/ , z 0,
Notice that
E(X(n) )2 = V ar(X(n) ) + (EX(n) )2 .
Markov inequality gives
X(n) P .
Continuous mapping theorem for convergence in probability gives
X(n) P .
Bibliography
[1] Brockwell, P. J., Davis, R. A. (1991) Time Series: Theory and Methods. Second Edition, Springer-Verlag, New York.
[2] Etter, D.R., Reis, T.F., Visser, L. G. (2003) Michigan status report,
17th Eastern Black Bear Workshop, Mount Olive, New Jersey, March
2-3, 2003
[3] Hogg, R., Craig, A. T. (1978) Introduction to mathematical Statistics.
Fourth Edition, Collier Macmillan Publishers, London.
[4] Johnson, R. A., Bhattacharyya, G. K. (1987) Statistics: Principle and
Methods. Third Edition. Wieley Series in Probability and Statistics.
John Weley and Sons, Inc., New York.
[5] Mayer, A. D., Sykes A. M. (1996) Statistics, Modular Mathematics
Series, Adward Arnold.
[6] McColl, J. H. (1995) Probability, Modular Mathematics Series, Adward Arnold.
[7] Ross, S. (1997) Introduction to probability model. Sixth Edition, Academic Press, New York.
[8] Ross, S. (1998) A rst cource in probability. Fifth Edition, Prentice
Hall, Upper Saddle River, New Jersey 07458.
[9] R. J. Adler, An introduction to continuity, extrema, and related topics
for general Gaussian processes. IMS Lecture Notes-Monograph Series
12. Institute of Mathmatical Statistics, Hayward, CA, 1990.
[10] J. Bae, An Empirical CLT for Stationary Martingale Dierences. J.
Korean Math. Soc. 32 (1995), 427-446.
185
186
[11] Bae, J. (1996). An Empirical LIL for stationary martingale dierences
: An invariance principle approach. J. Korean Math. Soc. 33 4 909928.
[12] Bae, J. and Kim, S. (2002). The Uniform CLT for the Kaplan-Meier
Integral Process under Bracketing Entropy. Submitted.
[13] J. Bae and S. Levental, Uniform CLT for Markov Chains and Its
Invariance Principle: A Martingale Approach. J. Theoret. Probab. 8
(1995), 549-570.
[14] R. F. Bass, Law of the iterated logarithm for set-indexed partial sum
processes with nite variance. Z. Wahrsch. verw. Gebiete 70 (1985),
591-608.
[15] Cantelli, F. P.(1933) Sulla determinazione empirica delle leggi di probabilita. Giornale dellIstituto Italiano degli Attuari 4, 421-424.
[16] DeHardt, J. Generalizations of the Glivenko-Cantelli theorem. Ann.
Math. Statist. 42 (1971), 2050-2055.
[17] M. D. Donsker, An Invariance principle for certain probability limit
theorems. Memoirs of the American Mathematical Society. 6 (1951),
1-12.
[18] P. Doukhan, P. Massart, and E. Rio, Invariance principles for absolute
regular empirical process. Ann. Inst. H. Poincare Probab. Statist. 31
(1995), 393-427.
[19] R. M. Dudley, Donsker classes of functions. In Statistics and Related Topics(Proc. Symp. Ottawa, 1980). North-Holland, Amsterdam,
(1981), 341-352.
[20] R. M. Dudley, A Course on Empirical Processes. Lecture notes in
Math. 1097 Springer-Verlag, New York, 1984.
[21] Dudley, R. M. and Philipp, W. (1983). Invariance principles for sums
of Banach space valued random elements and empirical processes. Z.
Wahrsch. verw. Gebiete 62 509-552.
[22] R. Durrett, Probability: Theory and Examples. Wadsworth, Belmont,
CA, 1991.
187
[23] Glivenko, V. (1933) Sulla determinazione empirica della leggi di probabilita. Giornale dellIstituto Italiano degli Attuari 4, 92-99.
[24] D. Freedman, On Tail Probabilities for Martingales. Ann. Probab. 3
(1975), 100-118.
[25] M.I. Gordin and B. A. Lifsic, B. A, The central limit theorem for
stationary Markov Processes. Sov. Math. Dokl. 19 (1978), 392-394.
[26] J. Homann-Jrgensen, Stochastic processes on Polish spaces. Aarhus
Universitet. Matematisk Institut. 1991.
[27] Kaplan, E. L. and Meier, P. Nonparametric estimation from incomplete observations. J.Amer.Statist.Assoc. 53 (1958), 457-481.
[28] Kuelbs, J and Dudley, R. M. (1980). Log Log law for Empirical Measures. Ann. Probab. 8 405-418.
[29] S. Levental, A Uniform CLT for Uniformly Bounded Families of Martingale Dierences. J. Theoret. Probab. 2 (1989), 271-287.
[30] M. Ossiander, A Central Limit Theorem under Metric Entropy with
L2 Bracketing. Ann. Probab. 15 (1987), 897-919.
[31] Pisier, G. (1975/1976). Le theoreme de la limite centrale et la loi
du logarithme itere dans les espaces de Banach. Seminaire MaureySchwarz 1975-1976 exposes Nos. 3 et 4.
[32] D. Pollard, Convergence of Stochastic Processes. Springer series in
Statistics. Springer-Verlag, New York, 1984.
[33] D. Pollard, Empirical Processes: Theory and Applications. Regional
conference series in Probability and Statistics 2, Inst. Math. Statist.,
Hayward, CA, 1990
[34] Strassen, V. (1964). An invariance principle for the law of the iterated
logarithm. Z. Wahrsch. verw. Gebiete 3 211-226.
[35] Stute, W and Wang, J. L. The strong law under random censorship.
Ann. Statist. 21 (1993), 1591-1607.
188
[36] Stute, W. (1995). The central limit theorem under random censorship.
Ann. Statist. 23 422-439.
[37] Van de Geer, S. Empirical Processes in M-Estimation. Cambridge in
Statististical and Probabilistic Mathematics. Cambridge University
Press, Cambridge, United Kingdom. (2000).
[38] Van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence
and Empirical Processes with Applications to Statistics. Springer series
in Statistics. Springer-Verlag, New York.
[39] K. Ziegler, Functional Central Limit Theorem for Triangular Arrays
of Function-Indexed Processes under Uniformly Integrable Entropy
Conditions. J. Multivariate. Anal. 62 (1997), 233-272.
Index
F distribution, 145
p-value, 180
t distribution, 144
190
element, 26
elementary event, 24
empirical distribution function, 18
event, 24
expectation, 69
exponential distribution, 88
family of events, 48
nite set, 28
Frequency probability, 33
gamma distribution, 88
gamma function, 87
gamma integration, 89
geometric Brownian motion, 11
geometric random variable, 61
hypergeometric random variable, 65
independent, 45
independent increment, 10
independent trial, 10
indicator function, 18
inferential statistics, 137
innite integral, 84
innite set, 28
intersection, 27
interval estimation, 160
pairwise independent, 46
parameter, 139
parameter space, 139
partition, 48
point estimate, 150
point estimation, 150
point estimator, 150
Poisson random variable, 62
population success ratio, 151
joint distribution function, 103
probability, 21, 36
joint moment generating function, 118 probability density function, 9, 84
joint probability density function, 103 probability mass function, 57
probability theory, 21
Kaplan-Meier estimator, 19
proper subset, 27
Law of large numbers, 19
law of large numbers, 18, 132
Law of the iterated logarithm, 20
law of the iterated logarithm, 18
least square estimator, 172
191
random variable, 9, 55
realization, 55
regression analysis, 116
regression curve, 116
regression line, 116
rejection region, 176
Riemann integrable, 84
Riemann integral, 84
Riemann sum, 84
sample median, 151
sample space, 23
sample success ratio, 151
set, 26
set build form, 27
set theory, 26
sets are equal, 27
Slutsky theorem, 134
standard Brownian motion, 11
standard deviation, 73
standard normal distribution, 9
state arithmetic, 21
stationary increment, 10
statistic, 21
strong law of large numbers, 132
subjective probability, 35
subset, 27
symmetric dierence, 53
tabular form, 27
test statistic, 177
theorem of total probability, 49
total event, 24
unbiased estimator, 157
uncountable set, 29
union, 27
variance, 73