Vous êtes sur la page 1sur 50

Research Method

Lecture 15-3

Truncated regression
and
Heckman Sample
Selection Corrections
1

Truncated regression
Truncated regression is different from
censored regression in the following way:
Censored regressions: dependent variable
may be censored, but you can include the
censored observations in the regression
Truncated regressions: A subset of
observations are dropped, thus, only the
truncated data are available for the
regression.
2

Reasons data truncation


happens
Example 1 (Truncation by survey design):
The Gary Negative income experiment
data, which is used extensively in the
economic literature, samples only those
families whose income is less than 1.5
times the 1976 poverty line. In this case,
families whose incomes are greater than
that threshold are dropped from the
regression due to the survey design.
3

Example 2 (Incidental Truncation):


In the wage offer regression of
married women, only those who are
working has wage information. Thus,
the regression cannot include
women who are not working. In this
case, it is the peoples decision, not
the surveyors decision, that
determines the sample selection.

When applying OLS to a


truncated data causes a bias
Before learning the techniques to
deal with truncated data, it is
important to know when applying
OLS to a truncated data would cause
a bias.

Suppose that you consider the following


regression:
yi=0+1xi+ui
And suppose that you have a random
sample of size N. We also assume that all
the OLS assumptions are satisfied. (The
most important assumption is E(ui|xi)=0)

Now, suppose that, instead of using


all the N observations, you select a
subsample of the original sample,
then run OLS using this sub-sample
(truncated sample) only.
Then, under what conditions, would
this OLS be unbiased. And under
what conditions, would this OLS be
biased?
7

A: Running OLS using only the


selected subsample (truncated
data) would not cause a bias if
(A-1) Sample selection is randomly done.
(A-2) Sample selection is determined solely by
the value of x-variable. For example,
suppose that x is age. Then if you select
sample if age is greater than 20 years old,
this OLS is unbiased.

B: Running OLS using only the


selected subsample (truncated
data) would cause bias if
(B-1) Sample selection is determined by the
value of y-variable. For example, suppose
that y is the family income, and further
suppose that you select the sample if y is
greater than certain threshold. Then this
OLS is biased.

(B-2) Sample selection is correlated with u i.


For example, if you are running wage regression:
wage=0+1(educ)+u, where u contains
unobserved ability. If sample is selected based
on the unobserved ability, this OLS is biased.
In practice, this situation happens when the
selection is based on the survey participants
decision. For example, in wage regression, a
persons decision whether to work or not
determines if the person is included in the data
or not. Since the decision is likely to be based on
unobserved factors which is contained in u, the
selection is likely to be correlated with u.

10

Understanding why these


conditions indicate running
OLS on the truncated data
is unbiasednes/biasedness
Now, we know the conditions under which
OLS using a truncated data would be cause
biased or not.
Now let me explain why these conditions
cause/does not cause biases.
(There are some repetition in the explanations, but they
are more elaborate containing very important information.
So please read them carefully.)
11

Consider the following regression


yi=0+1xi+ui
Suppose that this regression satisfies
all the OLS assumptions.
Now, let si be a selection indicator: If
si=1, then this person is included in
the regression. If si=0, then this
person is dropped from the data.

12

Then running OLS using the selected


subsample means you run OLS using
only the observation with s i=1.
This is equivalent to running the
following regression.
siyi=0si+1sixi+siui
In this regression, sixi is the explanatory
variable, and siui is the error term.
The crucial condition under which this
OLS is unbiased is the zero conditional
mean assumption: E(siui|sixi)=0. Thus we
need check under what conditions this is
satisfied.
13

To check E(siui|sixi)=0, it is sufficient to


check if E(siui|xi, si)=0. (If the latter is
zero, the former is also zero.)
But, further notice that E(siui|
xi,si)=siE(ui|xi,si) since si is a function of
si which is in the conditional set. Thus,
it is sufficient to check the condition
which ensures E(ui|xi, si)=0.
To simplify the notation from now, I
drop i-subscript. So I will check the
condition under which E(u|x, s)=0
14

Condition under which running


OLS on the selected subsample
(truncated data) is unbiased.
(A-1) Sample selection is done
randomly. In this case, s is
independent of u and x. Then we
have E(u|x,s)=E(u|x). But since the
original regression satisfy OLS
conditions, we have E(u|x)=0.
Therefore, in this case, this OLS is
unbiased.
15

(A-2) Sample is selected based solely


on the value of x-variable. For
example, if x is age, and you select the
person if the age is greater than 20 years
old. Then s=1 if x20, and s=0 if x<20. In
this case, s is a deterministic function of x.
If s is a
Thus we have
deterministic
E(u|x, s)=E(u|x, s(x))
function of x,
you can drop
=E(u|x).
s(x) from the
But E(u|x)=0 since the original regression
conditioning set.
satisify all the OLS conditions. Therefore,
in this case, OLS is unbiased.
16

Condition under which running OLS


on the selected subsample
(truncated data) is biased.
(B-1) Sample selection is based on
the value of y-variable. For example,
y is monthly family income, and you
select families whose income is smaller
than $500. Then, s=1 if y<500.
Checking if E(u|x, s)=0 is equivalent to checking
if E(u|x, s=1)=0 and E(x|x,s=0)=0. So we
check this.
17

E(u|x,

Since, the set {u


s=1)=E(u|x, y500)
500-0-1x} directly
depends on u, you
=E(u|x, 0+1x+u 500)
cannot drop this from
conditioning set.
=E(u|x, u 500-0-the
1x)
Thus, this is not
E(u|x)
equal to E(u|x) which
means that this is not
equal to zero.

Thus, E(u|x,s=1) 0.
Similarly, you can show that E(u|x,s=0) 0.
Thus E(u|x,s) 0. Thus, this OLS is biased.

18

(B-2) Sample selection is correlated with u i.


This happens when it is the peoples decision,
not the surveyor's decision, that determines the
sample selection. This type of truncation is
called the incidental truncation. The bias
that arises from this type of sample selection is
called the Sample Selection Bias.
The leading example is the wage offer regression
of married women: wage= 0+1edu+ui. When
the woman decides not to work, the wage
information is not available. Thus, this women
will be dropped from the data. Since it is the
womans decision, this sample selection is likely
to be based on some unobservable factors
which are contained in ui.
19

For example, the women decides to


work if the wage offer is greater than her
reservation wage. This reservation wage
is likely to be determined by some
unobserved factors in u, such as
unobserved ability, unobserved family
backgrounds etc. Thus the selection
criteria is likely to be correlated with u.
This in turn means that s is correlated
with u.
Now, mathematically, it can be shown as
follows.
20

If s is correlated with u, then you


cannot drop s from the conditioning
set. Thus we have
E(u|x,s)E(u|x)
This means that E(u|x,s) 0. Thus, this
OLS is biased.
Again, this type of bias is called the
Sample Selection Bias.
21

A slightly more complicated


case
Suppose, x is IQ, and the survey participant
responds to your survey if IQ>v.
In this case, the sample selection is based on xvariable and a random error v. Then, if you run
OLS using only the truncated data, will it cause
a bias?
Answer
Case 1: If v is independent of u, then it does not
cause a bias.
Case 2: If v is correlated with u, then this is the
same case as (B-2). Thus, the OLS will be
biased.
22

Estimation methods when


data is truncated.
When you have (B-1) type truncation, then
we use truncated regression
When you have (B-2) type truncation
(incidental truncation), then we use the
Heckman Sample Selection Correction
method. This is also called the Heckit
model.
I will explain these methods one by one.
23

The Truncated Regression


When the data truncation is (B-1)
type, you apply the Truncated
Regression model.
To explain again, (B-1) type
truncation happens because the
surveyor samples people based on
the value of y-variable.

24

Suppose that the following regression


satisfies all the OLS assumptions.
yi=0+1xi+ui, ui~N(0,2)
But, you sample only if yi<ci. (This
means yu drop observations if yici
by survey design.)
In this case, you know the exact
value of ci for each person.
25

Family income
per month

Example of (B-1) type data


truncation

$500

These
observat
ions are
dropped
from the
data.
True
regression

Biased regression
when applying OLS to
truncated data

Educ of
household
head
26

As can be seen, running OLS on the


truncated data will cause biases.
The model that produces unbiased
estimate is based on the Maximum
Likelihood Estimation.

27

The estimation method is as follows.


For each observation, we can write
ui=yi-0-1xi. Thus, the likelihood
contribution is the height of the
density function.
However, since we select sample only
if yi<ci, we have to use the density
function of u conditional on yi<ci.
The conditional density function is
given in the next slide.
28

f (ui | yi c ) f (ui | 0 1 xi ui ci ) f (ui | ui ci 0 1 xi )

f (ui )
f (ui )
f (ui )

u
c 0 1 xi
c 0 1 xi
P (ui ci 0 1 xi )
P( i i
) ( i
)

ci 0 1 xi

1
2 2

ui2

2
e 2
2

1 i
2

1
e
ci 0 1 xi 2
(
)

ui

ui


ci 0 1 xi

29

Thus, the likelihood contribution for i th


observation is obtained by plugging in
ui=yi-0-1xi in the conditional density
function. This is given by
1 yi 0 1 xi

Li
c 0 1 xi
( i
)

The likelihood function is given by


n

L( 0 , 1 , ) Li
i 1

The values of 0,1, that maximizes L is


the estimators of the Truncated
Regression.
30

The partial effects


The estimated 1 shows the effect of
x on y. Thus, you can interpret the
parameters as if they were OLS
parameters.

31

Exercise
We do not have a suitable data for truncated
regression. Therefore, let us truncate the
data by ourselves to check how the truncated
regression works.
EX1. Use JPSC_familyinc.dta to estimate the
following model using all the observation.
(family income)=0+1(husband educ)+u
Family income is in 10,000 yen.
32

EX2. Then run the OLS using only the


observations whose familyinc<800.
How did the parameters change?
EX2. Run the truncated regression
model for the data truncated from
above at 800 (data which drops all
the obs with familyinc800). How did
the parameters change? Did the
truncated regression recover the
parameters of the original regression?
33

. reg familyinc huseduc


Source

SS

df

MS

Model
Residual

38305900.9
1 38305900.9
318850122 7693 41446.7856

Total

357156023 7694 46420.0705

familyinc

Coef.

huseduc
_cons

32.93413
143.895

Std. Err.

1.083325
15.09181

30.40
9.53

Number of obs
F( 1, 7693)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

7695
924.22
0.0000
0.1073
0.1071
203.58

P>|t|

[95% Conf. Interval]

0.000
0.000

30.81052
114.3109

OLS using all the


observations

35.05775
173.479

. reg familyinc huseduc if familyinc<800


Source
Model
Residual
Total

SS

df

MS

Number of obs
F( 1, 6272)
Prob > F
R-squared
Adj R-squared
Root MSE

11593241.1
1 11593241.1
120645494 6272 19235.5699
132238735 6273

familyinc

Coef.

huseduc
_cons

20.27929
244.5233

21080.621

Std. Err.
.8260432
11.33218

t
24.55
21.58

=
=
=
=
=
=

6274
602.70
0.0000
0.0877
0.0875
138.69

P>|t|

[95% Conf. Interval]

0.000
0.000

18.65996
222.3084

21.89861
266.7383

Obs with
familyinc800 are
dropped. The
parameter on
huseduc is biased
towards zero.
34

Truncated regression model


with the upper truncation
limit equal to 800: Obs with
familyinc800 are
automatically dropped from
this regression.

. truncreg familyinc huseduc, ul(800)


(note: 1421 obs. truncated)
Fitting full model:
Iteration
Iteration
Iteration
Iteration

0:
1:
2:
3:

log
log
log
log

likelihood
likelihood
likelihood
likelihood

=
=
=
=

-39676.782
-39618.757
-39618.629
-39618.629

Truncated regression
Limit: lower =
-inf
upper =
800
Log likelihood = -39618.629

Number of obs = 6274


Wald chi2(1) = 569.90
Prob > chi2 = 0.0000

familyinc

Coef.

Std. Err.

huseduc
_cons

24.50276
203.6856

1.0264
13.75721

/sigma

153.1291

1.805717

P>|z|

[95% Conf. Interval]

23.87
14.81

0.000
0.000

22.49105
176.7219

26.51446
230.6492

84.80

0.000

149.59

156.6683

Bias seems to be corrected, but not


perfect in this example.

35

Heckman Sample Selection


Bias Correction (Heckit
Model)
Most common reason for data truncation is (B2) type: the incidental truncation.
This data truncation usually occurs because
sample selection is determined by the
peoples decision, not the surveyors decision.
Consider the wage regression example. If the
person has chosen to work, the person has
self-selected into the sample. If the
person has decided not to work, the person
has self-selected out of the sample.
Bias caused by this type of truncation is called
the Sample Selection Bias.
36

Bias correction for this type of data


truncation is done by the Heckman
Sample Selection Correction Method. It
is also called the Heckit model.
Consider the wage regression model. In
Heckit, you have wage equation and sample
selection equation.

Wage eq: yi=xi+ui and ui~N(0,u2)


Selection equ: si*=zi+ei,and ei~N(0,1)
Such that the person work if si*>0. That is
si=1 if si*>0, and si=0 if si*0.
37

In the above equations, I am using the


following vector notations. =(0,1,2,
,k)T. xi=(1,xi1, xi2,,xik) and =(0, 1,..,
m)T and zi=(1, zi1, zi2,..,zim).
We assume that xi and zi are exogenous in a
sense that E(ui|xi, zi)=0.
Further, assume that xi is a strict subset of
zi. That is, all the x-variables are also a part
of zi. For example, xi=(1, experi, agei), and
zi=(1, experi, agei, kidslt6i).
We require that zi contains at least one
variable that is not contained in x i.
38

The structural error, ui, and the


sample selection si are correlated
only if ui and ei are correlated.
In other words, the sample selection
causes a bias only if ui and ei are
correlated.
Let use denote the correlation
between ui and ei by =corr(ui, ei).

39

The data requirement of the Heckit


model is as follows.
1. yi is available only for the
observations who are currently working.
2: However, xi and zi are available
both for those who are working, and for
those who re not working.
40

Now, I will describe the Heckit model.


First, the expected value of yi given the fact
that the person has participated in the labor
force (i.e., si=1) is written as

E ( yi | si 1, zi ) E ( yi | si* 0, zi )
E ( yi | zi ei 0, zi )
E ( yi | ei zi , zi )
E ( xi ui | ei zi , zi )
xi E (ui | ei zi , zi )
Using a result of bivariate normal distribution,
the last term can be shown to be E(ui|ei>zi,zi)=
( zi ) / ( zi ) . But the term,
( z i ) / ( z i )
,
41
is the inverse mills ratio, (zi).

Thus, we have
E ( yi | si 1, zi )
xi E (ui | ei zi , zi )
xi ( zi )

Heckman showed that sample


selection bias can be viewed as an
omitted variable bias, where the
omitted variable is (zi).

42

Important thing to note is that, (zi) can be


easily estimated. How? Note that the selection
equation is simply a probit model of a labor
force participation.
So, estimate the sample selection equation by
probit to estimate . Then compute
.

(
z

Then, you can correct the bias by including


( z )
in the wage regression, then estimate the
model using OLS.
Heckman showed that this method corrects
for the sample selection bias. This method is
the Heckit model.
Next slide summarizes the Heckit model.
i

43

Heckman Two-step Sample


Selection Correction Method
(Heckit model)
Wage eq:

yi=xi+ui and ui~N(0,u2)

Selection equ: si*=zi+ei,and ei~N(0,1)


Such that the person work if si*>0. The person does
not work if si*0.
Assumption 1: E(ui|xi, zi)=0
Assumption 2: xi is a strict subset of zi.
If ui and ei are correlated, OLS estimation of wage
equation (using only the observations who are
working) is biased.
44

First step: Estimate sample selection equation


parameters
using Probit.
Then, compute

.
( z )
Second step: Plug in
in the wage
) equation using
( z the
equation, then estimate
OLS. That is: estimate the following.
i

In this model, is the coefficient for


. If
yi xi ( zi ) error
0, then the sample selection bias is
( z )
present. If =0, then it is evidence that
sample selection bias is not present.
i

45

Note, when you exactly follow this


procedure, you get the correct
coefficients, but you dont get the
correct standard errors. For the
exact formula of standard error,
consult Wooldridge (2002).
The Stata automatically computes
the correct standard errors.

46

Exercise
Using Mroz.dta estimate the wage
offer equation using Heckit model.
The explanatory variables for wage
offer equation are educ exper
expersq. The explanatory variables
for the sample selection equation is
educ, exper, expersq, nwifeinc, age,
kidslt6, kidsge6.

47

. **********************************************
. * Estimating heckit model manually
*
. **********************************************
. ***************************
. * First create selection *
. * Variable
*
. ***************************
.
gen s=0 if wage==.
(428 missing values generated)

Estimating Heckit
Manually. (note: you will
not get the correct
standard errors.

.
replace s=1 if wage~=.
(428 real changes made)
.
.
.
.
.

*******************************
*Next, estimate the probit
*
*selection equation
*
*******************************
probit s educ exper expersq nwifeinc age kidslt6 kidsge6

Iteration
Iteration
Iteration
Iteration
Iteration

0:
1:
2:
3:
4:

log
log
log
log
log

likelihood
likelihood
likelihood
likelihood
likelihood

=
=
=
=
=

-514.8732
-405.78215
-401.32924
-401.30219
-401.30219

Probit regression

Number of obs
LR chi2(7)
Prob > chi2
Pseudo R2

Log likelihood = -401.30219


s

Coef.

educ
exper
expersq
nwifeinc
age
kidslt6
kidsge6
_cons

.1309047
.1233476
-.0018871
-.0120237
-.0528527
-.8683285
.036005
.2700768

The first step:


The probit
selectdion
equation

Std. Err.
.0252542
.0187164
.0006
.0048398
.0084772
.1185223
.0434768
.508593

z
5.18
6.59
-3.15
-2.48
-6.23
-7.33
0.83
0.53

P>|z|
0.000
0.000
0.002
0.013
0.000
0.000
0.408
0.595

=
=
=
=

753
227.14
0.0000
0.2206

[95% Conf. Interval]


.0814074
.0866641
-.003063
-.0215096
-.0694678
-1.100628
-.049208
-.7267473

.180402
.1600311
-.0007111
-.0025378
-.0362376
-.636029
.1212179
1.266901

48

.
.
.
.

*******************************
*Then create inverse lambda *
*******************************
predict xdelta, xb

The second
step:

. gen lambda =normalden(xdelta)/normal(xdelta)


.
.
.
.

*************************************
*Finally, estimate the Heckit model *
*************************************
reg lwage educ exper expersq lambda
Source

SS

df

MS

Model
Residual

35.0479487
188.279492

4 8.76198719
423 .445105182

Total

223.327441

427 .523015084

lwage

Coef.

educ
exper
expersq
lambda
_cons

.1090655
.0438873
-.0008591
.0322619
-.5781032

Std. Err.
.0156096
.0163534
.0004414
.1343877
.306723

t
6.99
2.68
-1.95
0.24
-1.88

Number of obs
F( 4, 423)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.008
0.052
0.810
0.060

=
=
=
=
=
=

428
19.69
0.0000
0.1569
0.1490
.66716

Note the
standard errors
are not correct.

[95% Conf. Interval]


.0783835
.0117434
-.0017267
-.2318889
-1.180994

.1397476
.0760313
8.49e-06
.2964126
.024788

49

. heckman lwage educ exper expersq, select(s=educ exper expersq nwifeinc age kidslt6 kidsge6) twostep
Heckman selection model -- two-step estimates
(regression model with sample selection)

Coef.
lwage

Std. Err.

Number of obs
Censored obs
Uncensored obs

=
=
=

753
325
428

Wald chi2(3)
Prob > chi2

=
=

51.53
0.0000

P>|z|

[95% Conf. Interval]

educ
exper
expersq
_cons

.1090655
.0438873
-.0008591
-.5781032

.015523
.0162611
.0004389
.3050062

7.03
2.70
-1.96
-1.90

0.000
0.007
0.050
0.058

.0786411
.0120163
-.0017194
-1.175904

.13949
.0757584
1.15e-06
.019698

educ
exper
expersq
nwifeinc
age
kidslt6
kidsge6
_cons

.1309047
.1233476
-.0018871
-.0120237
-.0528527
-.8683285
.036005
.2700768

.0252542
.0187164
.0006
.0048398
.0084772
.1185223
.0434768
.508593

5.18
6.59
-3.15
-2.48
-6.23
-7.33
0.83
0.53

0.000
0.000
0.002
0.013
0.000
0.000
0.408
0.595

.0814074
.0866641
-.003063
-.0215096
-.0694678
-1.100628
-.049208
-.7267473

.180402
.1600311
-.0007111
-.0025378
-.0362376
-.636029
.1212179
1.266901

lambda

.0322619

.1336246

0.24

0.809

-.2296376

.2941613

rho
sigma
lambda

0.04861
.66362875
.03226186

.1336246

mills

Heckit
estimated
automati
cally.

Note H0:=0
cannot be
rejected. So
there is little
evidence that
sample selection
bias is present.
50

Vous aimerez peut-être aussi