Lecture 15-3 Cross Section and Panel (Truncated Regression, Heckman Sample Selection)

Research Method
Lecture 15-3
Truncated regression
and
Heckman Sample
Selection Corrections
1
Truncated regression is different from
censored regression in the following way:
Censored regressions: dependent variable
may be censored, but you can include the
censored observations in the regression
Truncated regressions: A subset of
observations are dropped, thus, only the
truncated data are available for the
regression.
2
Reasons data truncation

happens
Example 1 (Truncation by survey design):
The Gary Negative income experiment
data, which is used extensively in the
economic literature, samples only those
families whose income is less than 1.5
times the 1976 poverty line. In this case,
families whose incomes are greater than
that threshold are dropped from the
regression due to the survey design.
3
Example 2 (Incidental Truncation):

In the wage offer regression of
married women, only those who are
working has wage information. Thus,
the regression cannot include
women who are not working. In this
case, it is the peoples decision, not
the surveyors decision, that
determines the sample selection.
When applying OLS to a

truncated data causes a bias
Before learning the techniques to
deal with truncated data, it is
important to know when applying
OLS to a truncated data would cause
a bias.
Suppose that you consider the following

regression:
yi=0+1xi+ui
And suppose that you have a random
sample of size N. We also assume that all
the OLS assumptions are satisfied. (The
most important assumption is E(ui|xi)=0)
Now, suppose that, instead of using

all the N observations, you select a
subsample of the original sample,
then run OLS using this sub-sample
(truncated sample) only.
Then, under what conditions, would
this OLS be unbiased. And under
what conditions, would this OLS be
biased?
7
A: Running OLS using only the

selected subsample (truncated
data) would not cause a bias if
(A-1) Sample selection is randomly done.
(A-2) Sample selection is determined solely by
the value of x-variable. For example,
suppose that x is age. Then if you select
sample if age is greater than 20 years old,
this OLS is unbiased.
B: Running OLS using only the

selected subsample (truncated
data) would cause bias if
(B-1) Sample selection is determined by the
value of y-variable. For example, suppose
that y is the family income, and further
suppose that you select the sample if y is
greater than certain threshold. Then this
OLS is biased.
(B-2) Sample selection is correlated with u i.

For example, if you are running wage regression:
wage=0+1(educ)+u, where u contains
unobserved ability. If sample is selected based
on the unobserved ability, this OLS is biased.
In practice, this situation happens when the
selection is based on the survey participants
decision. For example, in wage regression, a
persons decision whether to work or not
determines if the person is included in the data
or not. Since the decision is likely to be based on
unobserved factors which is contained in u, the
selection is likely to be correlated with u.
10
Understanding why these

conditions indicate running
OLS on the truncated data
is unbiasednes/biasedness
Now, we know the conditions under which
OLS using a truncated data would be cause
biased or not.
Now let me explain why these conditions
cause/does not cause biases.
(There are some repetition in the explanations, but they
are more elaborate containing very important information.
So please read them carefully.)
11
Consider the following regression

yi=0+1xi+ui
Suppose that this regression satisfies
all the OLS assumptions.
Now, let si be a selection indicator: If
si=1, then this person is included in
the regression. If si=0, then this
person is dropped from the data.
12
Then running OLS using the selected

subsample means you run OLS using
only the observation with s i=1.
This is equivalent to running the
following regression.
siyi=0si+1sixi+siui
In this regression, sixi is the explanatory
variable, and siui is the error term.
The crucial condition under which this
OLS is unbiased is the zero conditional
mean assumption: E(siui|sixi)=0. Thus we
need check under what conditions this is
satisfied.
13
To check E(siui|sixi)=0, it is sufficient to

check if E(siui|xi, si)=0. (If the latter is
zero, the former is also zero.)
But, further notice that E(siui|
xi,si)=siE(ui|xi,si) since si is a function of
si which is in the conditional set. Thus,
it is sufficient to check the condition
which ensures E(ui|xi, si)=0.
To simplify the notation from now, I
drop i-subscript. So I will check the
condition under which E(u|x, s)=0
14
Condition under which running

OLS on the selected subsample
(truncated data) is unbiased.
(A-1) Sample selection is done
randomly. In this case, s is
independent of u and x. Then we
have E(u|x,s)=E(u|x). But since the
original regression satisfy OLS
conditions, we have E(u|x)=0.
Therefore, in this case, this OLS is
unbiased.
15
(A-2) Sample is selected based solely

on the value of x-variable. For
example, if x is age, and you select the
person if the age is greater than 20 years
old. Then s=1 if x20, and s=0 if x<20. In
this case, s is a deterministic function of x.
If s is a
Thus we have
deterministic
E(u|x, s)=E(u|x, s(x))
function of x,
you can drop
=E(u|x).
s(x) from the
But E(u|x)=0 since the original regression
conditioning set.
satisify all the OLS conditions. Therefore,
in this case, OLS is unbiased.
16
Condition under which running OLS

on the selected subsample
(truncated data) is biased.
(B-1) Sample selection is based on
the value of y-variable. For example,
y is monthly family income, and you
select families whose income is smaller
than $500. Then, s=1 if y<500.
Checking if E(u|x, s)=0 is equivalent to checking
if E(u|x, s=1)=0 and E(x|x,s=0)=0. So we
check this.
17
E(u|x,
Since, the set {u

s=1)=E(u|x, y500)
500-0-1x} directly
depends on u, you
=E(u|x, 0+1x+u 500)
cannot drop this from
conditioning set.
=E(u|x, u 500-0-the
1x)
Thus, this is not
E(u|x)
equal to E(u|x) which
means that this is not
equal to zero.
Thus, E(u|x,s=1) 0.
Similarly, you can show that E(u|x,s=0) 0.
Thus E(u|x,s) 0. Thus, this OLS is biased.
18
(B-2) Sample selection is correlated with u i.

This happens when it is the peoples decision,
not the surveyor's decision, that determines the
sample selection. This type of truncation is
called the incidental truncation. The bias
that arises from this type of sample selection is
called the Sample Selection Bias.
The leading example is the wage offer regression
of married women: wage= 0+1edu+ui. When
the woman decides not to work, the wage
information is not available. Thus, this women
will be dropped from the data. Since it is the
womans decision, this sample selection is likely
to be based on some unobservable factors
which are contained in ui.
19
For example, the women decides to

work if the wage offer is greater than her
reservation wage. This reservation wage
is likely to be determined by some
unobserved factors in u, such as
unobserved ability, unobserved family
backgrounds etc. Thus the selection
criteria is likely to be correlated with u.
This in turn means that s is correlated
with u.
Now, mathematically, it can be shown as
follows.
20
If s is correlated with u, then you

cannot drop s from the conditioning
set. Thus we have
E(u|x,s)E(u|x)
This means that E(u|x,s) 0. Thus, this
OLS is biased.
Again, this type of bias is called the
Sample Selection Bias.
21
A slightly more complicated

case
Suppose, x is IQ, and the survey participant
responds to your survey if IQ>v.
In this case, the sample selection is based on xvariable and a random error v. Then, if you run
OLS using only the truncated data, will it cause
a bias?
Answer
Case 1: If v is independent of u, then it does not
cause a bias.
Case 2: If v is correlated with u, then this is the
same case as (B-2). Thus, the OLS will be
biased.
22
Estimation methods when

data is truncated.
When you have (B-1) type truncation, then
we use truncated regression
When you have (B-2) type truncation
(incidental truncation), then we use the
Heckman Sample Selection Correction
method. This is also called the Heckit
model.
I will explain these methods one by one.
23
The Truncated Regression

When the data truncation is (B-1)
type, you apply the Truncated
Regression model.
To explain again, (B-1) type
truncation happens because the
surveyor samples people based on
the value of y-variable.
24
Suppose that the following regression

satisfies all the OLS assumptions.
yi=0+1xi+ui, ui~N(0,2)
But, you sample only if yi<ci. (This
means yu drop observations if yici
by survey design.)
In this case, you know the exact
value of ci for each person.
25
Family income
per month
Example of (B-1) type data

truncation
$500
These
observat
ions are
dropped
from the
data.
True
regression
Biased regression
when applying OLS to
truncated data
Educ of
household
head
26
As can be seen, running OLS on the

truncated data will cause biases.
The model that produces unbiased
estimate is based on the Maximum
Likelihood Estimation.
27
The estimation method is as follows.

For each observation, we can write
ui=yi-0-1xi. Thus, the likelihood
contribution is the height of the
density function.
However, since we select sample only
if yi<ci, we have to use the density
function of u conditional on yi<ci.
The conditional density function is
given in the next slide.
28
f (ui | yi c ) f (ui | 0 1 xi ui ci ) f (ui | ui ci 0 1 xi )
f (ui )
f (ui )
f (ui )
u
c 0 1 xi
c 0 1 xi
P (ui ci 0 1 xi )
P( i i
) ( i
)
ci 0 1 xi
1
2 2
ui2
2
e 2
2
1 i
2
1
e
ci 0 1 xi 2
(
)
ui
ui

ci 0 1 xi
29
Thus, the likelihood contribution for i th

observation is obtained by plugging in
ui=yi-0-1xi in the conditional density
function. This is given by
1 yi 0 1 xi
Li
c 0 1 xi
( i
)
The likelihood function is given by

n
L( 0 , 1 , ) Li
i 1
The values of 0,1, that maximizes L is

the estimators of the Truncated
Regression.
30
The partial effects

The estimated 1 shows the effect of
x on y. Thus, you can interpret the
parameters as if they were OLS
parameters.
31
Exercise
We do not have a suitable data for truncated
regression. Therefore, let us truncate the
data by ourselves to check how the truncated
regression works.
EX1. Use JPSC_familyinc.dta to estimate the
following model using all the observation.
(family income)=0+1(husband educ)+u
Family income is in 10,000 yen.
32
EX2. Then run the OLS using only the

observations whose familyinc<800.
How did the parameters change?
EX2. Run the truncated regression
model for the data truncated from
above at 800 (data which drops all
the obs with familyinc800). How did
the parameters change? Did the
truncated regression recover the
parameters of the original regression?
33
. reg familyinc huseduc

Source
SS
df
MS
Model
Residual
38305900.9
1 38305900.9
318850122 7693 41446.7856
Total
357156023 7694 46420.0705
familyinc
Coef.
huseduc
_cons
32.93413
143.895
Std. Err.
1.083325
15.09181
30.40
9.53
Number of obs
F( 1, 7693)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
7695
924.22
0.0000
0.1073
0.1071
203.58
P>|t|
[95% Conf. Interval]
0.000
0.000
30.81052
114.3109
OLS using all the

observations
35.05775
173.479
. reg familyinc huseduc if familyinc<800

Source
Model
Residual
Total
SS
df
MS
Number of obs
F( 1, 6272)
Prob > F
R-squared
Adj R-squared
Root MSE
11593241.1
1 11593241.1
120645494 6272 19235.5699
132238735 6273
familyinc
Coef.
huseduc
_cons
20.27929
244.5233
21080.621
Std. Err.
.8260432
11.33218
t
24.55
21.58
=
=
=
=
=
=
6274
602.70
0.0000
0.0877
0.0875
138.69
P>|t|
0.000
0.000
18.65996
222.3084
21.89861
266.7383
Obs with
familyinc800 are
dropped. The
parameter on
huseduc is biased
towards zero.
34
Truncated regression model

with the upper truncation
limit equal to 800: Obs with
familyinc800 are
automatically dropped from
this regression.
. truncreg familyinc huseduc, ul(800)

(note: 1421 obs. truncated)
Fitting full model:
Iteration
Iteration
Iteration
Iteration
0:
1:
2:
3:
log
log
log
log
likelihood
likelihood
likelihood
likelihood
=
=
=
=
-39676.782
-39618.757
-39618.629
-39618.629
Limit: lower =
-inf
upper =
800
Log likelihood = -39618.629
Number of obs = 6274

Wald chi2(1) = 569.90
Prob > chi2 = 0.0000
familyinc
Coef.
Std. Err.
huseduc
_cons
24.50276
203.6856
1.0264
13.75721
/sigma
153.1291
1.805717
P>|z|
23.87
14.81
0.000
0.000
22.49105
176.7219
26.51446
230.6492
84.80
0.000
149.59
156.6683
Bias seems to be corrected, but not

perfect in this example.
35
Heckman Sample Selection

Bias Correction (Heckit
Model)
Most common reason for data truncation is (B2) type: the incidental truncation.
This data truncation usually occurs because
sample selection is determined by the
peoples decision, not the surveyors decision.
Consider the wage regression example. If the
person has chosen to work, the person has
self-selected into the sample. If the
person has decided not to work, the person
has self-selected out of the sample.
Bias caused by this type of truncation is called
the Sample Selection Bias.
36
Bias correction for this type of data

truncation is done by the Heckman
Sample Selection Correction Method. It
is also called the Heckit model.
Consider the wage regression model. In
Heckit, you have wage equation and sample
selection equation.
Wage eq: yi=xi+ui and ui~N(0,u2)

Selection equ: si*=zi+ei,and ei~N(0,1)
Such that the person work if si*>0. That is
si=1 if si*>0, and si=0 if si*0.
37
In the above equations, I am using the

following vector notations. =(0,1,2,
,k)T. xi=(1,xi1, xi2,,xik) and =(0, 1,..,
m)T and zi=(1, zi1, zi2,..,zim).
We assume that xi and zi are exogenous in a
sense that E(ui|xi, zi)=0.
Further, assume that xi is a strict subset of
zi. That is, all the x-variables are also a part
of zi. For example, xi=(1, experi, agei), and
zi=(1, experi, agei, kidslt6i).
We require that zi contains at least one
variable that is not contained in x i.
38
The structural error, ui, and the

sample selection si are correlated
only if ui and ei are correlated.
In other words, the sample selection
causes a bias only if ui and ei are
correlated.
Let use denote the correlation
between ui and ei by =corr(ui, ei).
39
The data requirement of the Heckit

model is as follows.
1. yi is available only for the
observations who are currently working.
2: However, xi and zi are available
both for those who are working, and for
those who re not working.
40
Now, I will describe the Heckit model.

First, the expected value of yi given the fact
that the person has participated in the labor
force (i.e., si=1) is written as
E ( yi | si 1, zi ) E ( yi | si* 0, zi )
E ( yi | zi ei 0, zi )
E ( yi | ei zi , zi )
E ( xi ui | ei zi , zi )
xi E (ui | ei zi , zi )
Using a result of bivariate normal distribution,
the last term can be shown to be E(ui|ei>zi,zi)=
( zi ) / ( zi ) . But the term,
( z i ) / ( z i )
,
41
is the inverse mills ratio, (zi).
Thus, we have
E ( yi | si 1, zi )
xi E (ui | ei zi , zi )
xi ( zi )
Heckman showed that sample

selection bias can be viewed as an
omitted variable bias, where the
omitted variable is (zi).
42
Important thing to note is that, (zi) can be

easily estimated. How? Note that the selection
equation is simply a probit model of a labor
force participation.
So, estimate the sample selection equation by
probit to estimate . Then compute
.
(
z
Then, you can correct the bias by including

( z )
in the wage regression, then estimate the
model using OLS.
Heckman showed that this method corrects
for the sample selection bias. This method is
the Heckit model.
Next slide summarizes the Heckit model.
i
43
Heckman Two-step Sample

Selection Correction Method
(Heckit model)
Wage eq:
yi=xi+ui and ui~N(0,u2)
Selection equ: si*=zi+ei,and ei~N(0,1)

Such that the person work if si*>0. The person does
not work if si*0.
Assumption 1: E(ui|xi, zi)=0
Assumption 2: xi is a strict subset of zi.
If ui and ei are correlated, OLS estimation of wage
equation (using only the observations who are
working) is biased.
44
First step: Estimate sample selection equation

parameters
using Probit.
Then, compute
.
( z )
Second step: Plug in
in the wage
) equation using
( z the
equation, then estimate
OLS. That is: estimate the following.
i
In this model, is the coefficient for

. If
yi xi ( zi ) error
0, then the sample selection bias is
( z )
present. If =0, then it is evidence that
sample selection bias is not present.
i
45
Note, when you exactly follow this

procedure, you get the correct
coefficients, but you dont get the
correct standard errors. For the
exact formula of standard error,
consult Wooldridge (2002).
The Stata automatically computes
the correct standard errors.
46
Exercise
Using Mroz.dta estimate the wage
offer equation using Heckit model.
The explanatory variables for wage
offer equation are educ exper
expersq. The explanatory variables
for the sample selection equation is
educ, exper, expersq, nwifeinc, age,
kidslt6, kidsge6.
47
. **********************************************
. * Estimating heckit model manually
*
. **********************************************
. ***************************
. * First create selection *
. * Variable
*
. ***************************
.
gen s=0 if wage==.
(428 missing values generated)
Estimating Heckit
Manually. (note: you will
not get the correct
standard errors.
.
replace s=1 if wage~=.
(428 real changes made)
.
.
.
.
.
*******************************
*Next, estimate the probit
*
*selection equation
*
*******************************
probit s educ exper expersq nwifeinc age kidslt6 kidsge6
Iteration
Iteration
Iteration
Iteration
Iteration
0:
1:
2:
3:
4:
log
log
log
log
log
likelihood
likelihood
likelihood
likelihood
likelihood
=
=
=
=
=
-514.8732
-405.78215
-401.32924
-401.30219
-401.30219
Probit regression
Number of obs
LR chi2(7)
Prob > chi2
Pseudo R2
Log likelihood = -401.30219

s
Coef.
educ
exper
expersq
nwifeinc
age
kidslt6
kidsge6
_cons
.1309047
.1233476
-.0018871
-.0120237
-.0528527
-.8683285
.036005
.2700768
The first step:

The probit
selectdion
equation
Std. Err.
.0252542
.0187164
.0006
.0048398
.0084772
.1185223
.0434768
.508593
z
5.18
6.59
-3.15
-2.48
-6.23
-7.33
0.83
0.53
P>|z|
0.000
0.000
0.002
0.013
0.000
0.000
0.408
0.595
=
=
=
=
753
227.14
0.0000
0.2206

.0814074
.0866641
-.003063
-.0215096
-.0694678
-1.100628
-.049208
-.7267473
.180402
.1600311
-.0007111
-.0025378
-.0362376
-.636029
.1212179
1.266901
48
.
.
.
.
*******************************
*Then create inverse lambda *
*******************************
predict xdelta, xb
The second
step:
. gen lambda =normalden(xdelta)/normal(xdelta)

.
.
.
.
*************************************
*Finally, estimate the Heckit model *
*************************************
reg lwage educ exper expersq lambda
Source
SS
df
MS
Model
Residual
35.0479487
188.279492
4 8.76198719
423 .445105182
Total
223.327441
427 .523015084
lwage
Coef.
educ
exper
expersq
lambda
_cons
.1090655
.0438873
-.0008591
.0322619
-.5781032
Std. Err.
.0156096
.0163534
.0004414
.1343877
.306723
t
6.99
2.68
-1.95
0.24
-1.88
Number of obs
F( 4, 423)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.008
0.052
0.810
0.060
=
=
=
=
=
=
428
19.69
0.0000
0.1569
0.1490
.66716
Note the
standard errors
are not correct.

.0783835
.0117434
-.0017267
-.2318889
-1.180994
.1397476
.0760313
8.49e-06
.2964126
.024788
49
. heckman lwage educ exper expersq, select(s=educ exper expersq nwifeinc age kidslt6 kidsge6) twostep
Heckman selection model -- two-step estimates
(regression model with sample selection)
Coef.
lwage
Std. Err.
Number of obs
Censored obs
Uncensored obs
=
=
=
753
325
428
Wald chi2(3)
Prob > chi2
=
=
51.53
0.0000
P>|z|
educ
exper
expersq
_cons
.1090655
.0438873
-.0008591
-.5781032
.015523
.0162611
.0004389
.3050062
7.03
2.70
-1.96
-1.90
0.000
0.007
0.050
0.058
.0786411
.0120163
-.0017194
-1.175904
.13949
.0757584
1.15e-06
.019698
educ
exper
expersq
nwifeinc
age
kidslt6
kidsge6
_cons
.1309047
.1233476
-.0018871
-.0120237
-.0528527
-.8683285
.036005
.2700768
.0252542
.0187164
.0006
.0048398
.0084772
.1185223
.0434768
.508593
5.18
6.59
-3.15
-2.48
-6.23
-7.33
0.83
0.53
0.000
0.000
0.002
0.013
0.000
0.000
0.408
0.595
.0814074
.0866641
-.003063
-.0215096
-.0694678
-1.100628
-.049208
-.7267473
.180402
.1600311
-.0007111
-.0025378
-.0362376
-.636029
.1212179
1.266901
lambda
.0322619
.1336246
0.24
0.809
-.2296376
.2941613
rho
sigma
lambda
0.04861
.66362875
.03226186
.1336246
mills
Heckit
estimated
automati
cally.
Note H0:=0
cannot be
rejected. So
there is little
evidence that
sample selection
bias is present.
50

Lecture 15-3 Cross Section and Panel (Truncated Regression, Heckman Sample Selection)

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Lecture 15-3 Cross Section and Panel (Truncated Regression, Heckman Sample Selection)

Transféré par

Droits d'auteur :

Formats disponibles

Research Method

Reasons data truncation

Example 2 (Incidental Truncation):

When applying OLS to a

Suppose that you consider the following

Now, suppose that, instead of using

A: Running OLS using only the

B: Running OLS using only the

(B-2) Sample selection is correlated with u i.

Understanding why these

Consider the following regression

Then running OLS using the selected

To check E(siui|sixi)=0, it is sufficient to

Condition under which running

(A-2) Sample is selected based solely

Condition under which running OLS

Since, the set {u

(B-2) Sample selection is correlated with u i.

For example, the women decides to

If s is correlated with u, then you

A slightly more complicated

Estimation methods when

The Truncated Regression

Suppose that the following regression

Example of (B-1) type data

As can be seen, running OLS on the

The estimation method is as follows.

f (ui | yi c ) f (ui | 0 1 xi ui ci ) f (ui | ui ci 0 1 xi )

Thus, the likelihood contribution for i th

The likelihood function is given by

The values of 0,1, that maximizes L is

The partial effects

EX2. Then run the OLS using only the

. reg familyinc huseduc

357156023 7694 46420.0705

[95% Conf. Interval]

OLS using all the

. reg familyinc huseduc if familyinc<800

[95% Conf. Interval]

Truncated regression model

. truncreg familyinc huseduc, ul(800)

Number of obs = 6274

[95% Conf. Interval]

Bias seems to be corrected, but not

Heckman Sample Selection

Bias correction for this type of data

Wage eq: yi=xi+ui and ui~N(0,u2)

In the above equations, I am using the

The structural error, ui, and the

The data requirement of the Heckit

Now, I will describe the Heckit model.

Heckman showed that sample

Important thing to note is that, (zi) can be

Then, you can correct the bias by including

Heckman Two-step Sample

yi=xi+ui and ui~N(0,u2)

Selection equ: si*=zi+ei,and ei~N(0,1)

First step: Estimate sample selection equation

In this model, is the coefficient for

Note, when you exactly follow this

Log likelihood = -401.30219

The first step:

[95% Conf. Interval]

. gen lambda =normalden(xdelta)/normal(xdelta)