Académique Documents
Professionnel Documents
Culture Documents
1 Bootstrap principle
1/90
Outline
4.1 The bootstrap principle
4.2 Basic methods
4.2.1 Nonparametric and parametric bootstrap
4.2.2 Bootstrapping samples in regression
and se()
2/90
E (XY ) E (X )E (Y )
.
X Y
Xi Yi nX
,
P
n
2
2
i=1 (Xi X )
i=1 (Yi Y )
r = qP
n
i=1
3/90
n1
n4
3
(n 2)(n)
(1 2 ) 2 (1 r 2 ) 2 (1 r ) 2 n
1
1
2(n 1)( 2 )(n 2 )
1 1 + r
1 1
2 F1 ( , ; n ;
),
(1)
2 2
2
2
4/90
R
d
T1 (F ) = R xdF (x) = EFR(X ) = is the mean of X = F .
T2 (F ) = x 2 dF (x) [ xdF (x)]2 = VarF (X ) = 2 is the
variance of
R RX = F .
RR
RR
T3 (F ) =
x1 x2 dF (x1 , x2 )
x1 dF (x1 , x2 )
x2 dF (x1 , x2 )
5/90
# of Xi s x
or F (x) =
.
n
R
Pn
As an example,
= T1 (F ) = xd F (x) = n1 i=1 Xi . So the
sample mean is a plug-in estimator of the population mean .
h P i2
P 2
Xi
Xi
As another example,
2 = T2 (F ) =
is a
n
plug-in estimator of 2 .
CHAPTER 4: Bootstrap Methods
6/90
)T1 (F )
T1 (F
q
n
n1 T2 (F )
7/90
)T1 (F )
T1 (F
q
,
n
n1 T2 (F )
then
)T1 (F
)
T1 (F
q
,
n
n1 T2 (F )
8/90
9/90
10/90
P(x)
= 13 ; x = x1 , x2 , x3 .
Suppose T (F ) = 13 (X1 + X2 + X3 ) is used to estimate .
Our objective is to bootstrap the distribution of R(X , F ) = .
Note that X = {X1 , X2 , X3 }; and X = {X1 , X2 , X3 } is a
bootstrap sample consisting of elements drawn from F . There
are
= 10
nn = 27 possible outcomes for X but consist of only 2n1
n
11/90
3/3
4/3
8/3
5/3
9/3
13/3
6/3
10/3
14/3
18/3
3/3 3
4/3 3
8/3 3
5/3 3
9/3 3
13/3 3
6/3 3
10/3 3
14/3 3
18/3 3
P ( )
1/27 0.037
3/27 0.111
3/27 0.111
3/27 0.111
6/27 0.222
3/27 0.111
1/27 0.037
3/27 0.111
3/27 0.111
1/27 0.037
obs. Frequency
38/1000
100/1000
116/1000
112/1000
245/1000
105/1000
38/1000
104/1000
108/1000
34/1000
12/90
14/90
Nonparametric bootstrap
I
them as Xi = {Xi1
in = F for i = 1, , B.
15/90
Parametric bootstrap
I
empirical cdf F .
To estimate the distribution of R(X , F (x|)), one can draw B
producing B
i.i.d. samples, each of size n, from F (x|),
parametric bootstrap samples. Denote them as
, , X } iid
Xi = {Xi1
in = F (x|) for i = 1, , B.
i = 1, , B} is then
The empirical cdf of {R(Xi , F (x|)),
16/90
17/90
fitted responses yi = xT
i = yi yi .
i and residuals
2. Bootstrap residuals from {
1 , , n } to get {
1 , , n }.
Note {
1 , , n } are not i.i.d. but roughly so if the
regression model is correct.
3. Create a bootstrap sample of responses: Yi = yi + i for
i = 1, , n.
4. Fit the regression model to {(x1 , Y1 ), , (xn , Yn )} to get
bootstrap estimate ( ,
) of (, ).
5. Repeat this process B times to obtain {(1 ,
1 ), (B ,
B )},
from which an empirical cdf can be built for inference.
CHAPTER 4: Bootstrap Methods
18/90
19/90
and se()
and se()
q
are the two basic
20/90
and se()
and se()
by
bias()
4 Compute = B 1 B
r =1 r and estimate
q
P
B
1
2
= ;
compute seB ()
=
bB ()
r =1 (r )
B1
by seB ().
21/90
and se()
and se()
= EF [( )2 ] may be
A bootstrap estimate of MSE(
)
P
2.
= 1 B (r )
obtained as MSEB ()
r =1
B
22/90
and se()
R package boot
The package boot in R contains many functions for implementing
bootstrap methods. The function boot() is the one for generating
bootstrap samples and various bootstrap estimates about .
library(boot)
boot(data, statistic, R, sim="ordinary", stype="i",
strata=rep(1,n), L=NULL, m=0, weights=NULL,
ran.gen=function(d, p) d, mle=NULL, ...)
See the
The argument statistic is a function defined by .
examples later for illustration.
The functions boot.array() and freq.array() are useful for finding
which original observations and how many times they are included
in each bootstrap sample.
CHAPTER 4: Bootstrap Methods
23/90
and se()
0.01
yi
127.6
0.48
1.44
124.0
92.3
0.71
0.71
110.8
113.1
0.95
1.96
103.9
83.7
1.19
0.01
101.5
128.0
0.01
1.44
130.1
91.4
0.48
1.96
122.0
86.2
24/90
and se()
1
0
= 0.185.
z=matrix(0,13,2)
z[,1]=c(0.01,0.48,0.71,0.95,1.19,0.01,0.48,1.44,0.71,1.96,0.01,1.44,1.96)
z[,2]=c(127.6,124.0,110.8,103.9,101.5,130.1,122.0,92.3,113.1,83.7,128.0,91.4,86
temp=lm(z[,2]~z[,1])
temp$coef[2]/temp$coef[1]
-0.1850722
and sd().
25/90
and se()
26/90
and se()
"t"
"call"
> boot1$t0
> t(boot1$t)
"R"
"stype"
"data"
"strata"
"seed"
"weights"
"statistic"
[1] -0.001099678
> sd(boot1$t)
[1] 0.01011771
CHAPTER 4: Bootstrap Methods
27/90
and se()
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
2
9
10
11
5
11
6
10
8
7
1
7
12
9
1
8
4
4
7
4
3
9
9
5
7
1
8
4
4
4
3
12
4
4
5
7
10
10
5
9
9
13
3
1
11
7
13
9
4
7
3
1
12
7
4
4
3
1
3
11
5
10
2
12
4
For example the 1st bootstrap sample is {(x2 , y2 ), (x9 , y9 ), , (x7 , y7 ), (x12 , y12 )}.
> boot.array(boot1, indices=F) #frequencies of obs used in boot. samples.
[1,]
[2,]
[3,]
[4,]
[5,]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
1
1
0
0
1
1
2
1
1
2
2
1
0
2
0
1
3
1
0
2
1
3
0
0
0
0
0
0
1
5
2
0
1
1
0
2
0
1
0
2
0
2
1
0
0
2
0
3
0
1
0
2
1
1
2
3
1
0
1
0
0
1
1
2
0
E.g. in bootstrap sample (x1 , y1 ) appeared once, (x7 , y7 ) appeared twice, etc..
> freq.array(boot.array(boot1,indice=T)) #same as boot.array(boot1, indices=F)
CHAPTER 4: Bootstrap Methods
28/90
and se()
boot1
30
0
10
20
Density
40
50
density.default(x = boot1$t)
0.24
0.22
N = 1999
0.20
0.18
0.16
Bandwidth = 0.001472
29/90
and se()
boot2
30/90
and se()
4
0
12
40
90
100
110
120
130
2.0
Normal QQ
2
13
12
0.19
0.18
0.17
0.16
0.0
90
100
110
Fitted values
1.0
1.5
120
130
Residuals vs Leverage
1
13 0.5
1.2
0.6
Standardized residuals
0.20
Standardized residuals
30
10
0
0.21
2
12
0.5
Theoretical Quantiles
ScaleLocation
13
Fitted values
20
Density
50
Residuals
60
2
7
0.5
Residuals vs Fitted
1.0
density.default(x = boot2$t)
Standardized residuals
Cook's distance
12
0.00
0.05
0.10
0.15
0.5
0.20
0.25
Leverage
31/90
and se()
32/90
and se()
40
0
10
10
20
30
Density
30
20
Density
40
50
50
60
60
density.default(x = boot3$t)
0.22
0.21
0.20
0.19
0.18
0.17
0.16
0.15
1.00
0.98
0.96
0.94
0.92
33/90
and se()
34/90
35/90
36/90
37/90
38/90
boot.ci(boot.out = boot1)
Intervals :
Level
Normal
95%
(-0.2002, -0.1670 )
Basic
(-0.1963, -0.1626 )
Level
Percentile
BCa
95%
(-0.2076, -0.1738 )
(-0.2047, -0.1731 )
Calculations and Intervals on Original Scale
Warning message:
In boot.ci(boot1) : bootstrap variances needed for studentized intervals
CHAPTER 4: Bootstrap Methods
39/90
boot.ci(boot.out = boot2)
Intervals :
Level
Normal
95%
(-0.2003, -0.1704 )
Basic
(-0.2006, -0.1710 )
Level
Percentile
BCa
95%
(-0.1992, -0.1695 )
(-0.1978, -0.1677 )
Calculations and Intervals on Original Scale
Warning message:
In boot.ci(boot2) : bootstrap variances needed for studentized intervals
CHAPTER 4: Bootstrap Methods
40/90
Percentile
(-0.2010, -0.1681 )
41/90
Basic
(-1.0010, -0.9740 )
Level
Percentile
BCa
95%
(-0.9954, -0.9685 )
(-0.9935, -0.9539 )
Calculations and Intervals on Original Scale
Warning message:
In boot.ci(boot4) : bootstrap variances needed for studentized intervals
CHAPTER 4: Bootstrap Methods
42/90
43/90
Basic
(-0.2021, -0.1692 )
Level
Percentile
BCa
95%
(-0.2010, -0.1681 )
(-0.1987, -0.1653 )
Calculations and Intervals on Original Scale
Warning message:
In boot.ci(boot3) : bootstrap variances needed for studentized intervals
CHAPTER 4: Bootstrap Methods
44/90
se()
In many cases,
.
(2 ) z1 2 seB (),
(2 ) + z1 2 seB ()
45/90
46/90
47/90
= P h/2 + ()
+
(
)
1/2
h
i
1
1 h
= P
h/2 + ()
+ ()
.(3)
1/2
I
Hence 1 h/2 +()
and 1 h1/2 +()
/2
1/2
with being the quantile of the ideal bootstrap
distribution P () of which can be estimated by
([B+1])
, the sample quantiles (order statistic) from B
bootstrap replicates of .
48/90
P 1 h/2 +()
+(
)
=1
1/2
(4)
noting that H has a symmetric pdf so that h/2 = h1/2 .
([B+1]/2)
/2
, 1/2
, ([B+1](1/2))
can serve as an approximate 100(1 )% C.I. for , which is
called the (basic) percentile bootstrap CI.
49/90
= where
By the bootstrap principle, h ( )
is the sample
quantile of . Using this approximation,
(5)
h
i
/2
, 2
2
,
2
([B+1] )
1/2
/2
([B+1](1 ))
2
50/90
For these two CIs to work well, it requires the cdf H there to
be free from . This implies that a stronger transformation
and to find the CI
is in need to get a pivotal quantity for ,
based on the pivot.
51/90
52/90
Namely,
I
(7)
()
()
zp
+ c0 zp
1 + a()
P
!
= p (1 p) = 1
+ c0 zp
+ c0 + zp
()
()
()
1 a(c0 zp )
1 a(c0 + zp )
!
=1
53/90
+ c0 zp
()
,
L=
1 a(c0 zp )
#
+ c0 + zp
()
U=
, not computable as unknown;
1 a(c0 + zp )
( )()
1+a()
= 1 2 )
+ c0 N(0, 1).
!
)()
c
+
z
0
p
P ( ) U = P
+c0
+c0
1a(c0 + zp )
1 + a()
c0 + zp
denoted
+c0
= pU .
1a(c0 + zp )
CHAPTER 4: Bootstrap Methods
54/90
z
0
p
P ( ) L = P
+c0
+c0
1a(c0 zp )
1 + a()
c0 zp
denoted
+c0
= pL ,
1a(c0 zp )
55/90
p , p ([B+1]p
,
)
([B+1]p )
L
0.063
, 0.992
= (0.063[B+1])
, (0.992[B+1])
which can be read off from the B bootstrap replicates of .
The value of c0 is determined by the relative position of
among the bootstrap replicates of , while the value of a is
determined by the skewness of the bootstrap replicates of .
56/90
a=
P n 3
i=1 J i
,
Pn 2 3/2
6
i=1 J i
where J =
1
n
Pn
i=1 i .
Note
(n1) J i are related to EIF (empirical influence function). In
57/90
P n 3
i=1 J i
,
Pn 2 3/2
6
i=1 J i
3 Compute pU =
4
c0 +z1
2
1a(c0 +z1
PB
I (r <)
.
B
Also
P
where J = n1 ni=1 i .
r =1
c0 z1
2
+c
and
p
=
+c
0
0 .
L
)
1a(c0 z )
1
([B+1]p
)
U
([B+1]p
)
L
of is ([B+1]p ) , ([B+1]p ) or p , p .
L
58/90
59/90
T (F
)T (F )
)
V (F
R(X , F ) respectively.
60/90
= P 1 2 (G ) V (F ) 2 (G ) V (F ) = 1
) is the quantile of G
. These quantiles are
where (G
unknown but can be estimated under the bootstrap principle,
) (G
).
so (G
This gives the 100(1 )% studentized bootstrap CI of :
q
q
) V (F ), T (F ) (G
) V (F )
T (F ) 1 2 (G
2
q
q
c ),
c )
) Var(
) Var(
= 1 2 (G
2 (G
) is the quantile of G
.
where (G
CHAPTER 4: Bootstrap Methods
61/90
F )T (F )
.
w.r.t. the cdf G
the quantile of R(X , F ) = T (
)
V (F
d ( )
Var
B
62/90
63/90
1
0
!2
c 1 )
c 0 )
d 0 , 1 )
Var(
Var(
2Cov(
+
2
2
1
0
0 1
64/90
boot5
Q=(boot5$t[,1]-boot5$t0[1])/sqrt(boot5$t[,2])
sort(Q)[(1999+1)*0.975];
sort(Q)[(1999+1)*0.025]
quant=quantile(Q,prob=c(0.025,0.975),type=6)
c(boot5$t0[1]-sqrt(boot5$t0[2])*quant[2],boot5$t0[1]-sqrt(boot5$t0[2])*quant[1])
c(boot5$t0[1]-sqrt(boot5$t0[2])*1.96, boot5$t0[1]+sqrt(boot5$t0[2])*1.96)
par(mfrow=c(1,3))
plot(density(boot5$t[,1]), ylim=c(0,60),lwd=2)
hist(boot5$t[,1], breaks=50, freq=F, add=T)
plot(density(boot5$t[,2]^0.5), ylim=c(0,1000),lwd=2)
hist(boot5$t[,2]^0.5, breaks=50, freq=F, add=T)
plot(density(Q),ylim=c(0,0.18),lwd=2)
hist(Q, breaks=50, freq=F, add=T)
CHAPTER 4: Bootstrap Methods
65/90
0.185 4.360 7.4663 106 , 0.185 (6.129) 7.4663 106
= [0.1970, 0.1683].
I
V (F )
66/90
Basic
(-0.1963, -0.1626 )
Studentized
(-0.1970, -0.1683 )
Level
Percentile
BCa
95%
(-0.2076, -0.1738 )
(-0.2047, -0.1731 )
Calculations and Intervals on Original Scale
CHAPTER 4: Bootstrap Methods
67/90
[1] 4.360584
97.5%
4.360584
> c(boot5$t0[1]-sqrt(boot5$t0[2])*quant[2],
boot5$t0[1]-sqrt(boot5$t0[2])*quant[1])
-0.1969873
-0.1683238
68/90
density.default(x = Q)
0.10
Density
Density
0.00
10
200
0.05
20
400
30
Density
600
40
800
50
60
1000
0.15
>
>
>
>
>
>
>
0.24
0.22
N = 1999
0.20
0.18
0.16
Bandwidth = 0.001472
0.002
N = 1999
0.004
0.006
Bandwidth = 8.967e05
10
N = 1999
Bandwidth = 0.5123
69/90
70/90
calculate j1 , , jB2 .
3. For each j = 1,
1 , calculate
PB 2 , B
PB2
(jk j )2 with j = B12 k=1
jk .
s (j ) = B211 k=1
71/90
72/90
73/90
1 (X , F ) q] be the cdf of R
1 (X , F ). The
Let F1 (q, F ) = P[R
100(1 )% CI for based on the bootstrap distribution of
1 (X , F ) is fashioned after the statement
R
1 (X , F ) F 1 (1/2, F )] = 1 .
P[F11 (/2, F ) R
1
74/90
3 For j = 1, , B0 :
(a) Let Fj be the empirical cdf of Xj . Draw B1 bootstrap samples
Xj1 , , XjB
from Fj .
1
(b) Compute R0 (Xjk , Fj ) for k = 1, , B1 .
(c) Compute
1 (Xj, F ) = F0 (R0 (Xj, F ), F ) = 1
R
B1
PB1
k=1 I [R0 (Xjk , Fj )
R0 (Xj, F )].
1 (X , F ), , R
1 (X , F ).
4 Denote as F1 the empirical cdf of R
1
B0
1 ({x1 , , xn }, F ) = F0 (R0 ({x1 , , xn }, F ), F ) and
5 Use R
quantiles of F1 to construct the CI for following the
1 (X , F ) F 1 (1 /2)] 1 .
statement P[F11 (/2) R
1
CHAPTER 4: Bootstrap Methods
75/90
76/90
1
0 .
R1 (X , F ),
77/90
[0.02156509,
(0.02195345)]
= [0.2066373, 0.1631188]
knowing = 0.1850722.
78/90
79/90
system elapsed
0.12 2273.27
> res6
$theta
-0.1850722
$qL
2.5%
0.01925
$qU 97.5%
0.997
$L
-0.2066373
$U
-0.1631188
$R1.hat
[1] 0.174 0.864 0.851 0.278
NA 0.509 0.358 0.836 ......
$R0.star
[1] -1.376886e-02 4.653109e-03 6.934917e-03 ......
CHAPTER 4: Bootstrap Methods
80/90
40
30
Density
0.0
10
0.5
20
1.0
Density
1.5
50
2.0
60
Histogram of res6$R1.hat
0.0
0.2
0.4
0.6
res6$R1.hat
0.8
1.0
0.04
0.02
N = 1000
0.00
0.01
0.02
0.03
Bandwidth = 0.001716
81/90
82/90
83/90
84/90
, where
0
, or
85/90
bB (X
j=1
j=1 j X ], is
j
unlikely to be 0 in ordinary bootstrap. This is caused by the
Monte Carlo variation in generating bootstrap samples.
) = 0 exactly if each value occurs in the
I However, bB (X
combined collection of bootstrap samples with the same
relative frequency as it does in the observed sample.
CHAPTERI
4: Bootstrap
UnlikeMethods
the percentile-based methods, the studentized
I
86/90
87/90
88/90
89/90
Questions?
4.1 The bootstrap principle
4.2 Basic methods
4.2.1 Nonparametric and parametric bootstrap
4.2.2 Bootstrapping samples in regression
and se()
90/90