Vous êtes sur la page 1sur 4

Heckman Selection Models

Vartanian: SW 541

Were examining the age when people first get married. We have a number of people with missing values
because they have never been married. We want to see if not being married is non-random process, and correct
for this non-randomness of not being married in our regression results.

First, well examine if age of first marriage is missing for many people in our sample. I do this in Stata with the
following commands.

. generate nomarr=agemarr~=.

. tab nomarr

nomarr | Freq. Percent Cum.


------------+-----------------------------------
0 | 1,172 29.43 29.43
1 | 2,811 70.57 100.00
------------+-----------------------------------
Total | 3,983 100.00

Here, we see that of our 3,983 observations, 2,811 are married and 1,172 are not ever married, and therefore
will not have a valid value for agemarr. (It seems like these numbers should be just the opposite that 2811 are
not married, but this is a quirk with Stata.)

Next, we can run a probit analysis (which is what is used in the Heckman models instead of a logit analysis) to
see if any factors are affecting the likelihood of being married, and thus having an age at first marriage. (Ive
chosen these variables somewhat randomly.)

. probit nomarr income male norelig bigcity kds


note: norelig != 0 predicts success perfectly
norelig dropped and 5 obs not used
Iteration 0: log likelihood = -2403.0562
Iteration 1: log likelihood = -2363.3512
Iteration 2: log likelihood = -2363.2058
Iteration 3: log likelihood = -2363.2058
Probit estimates Number of obs = 3966
LR chi2(4) = 79.70
Prob > chi2 = 0.0000
Log likelihood = -2363.2058 Pseudo R2 = 0.0166
------------------------------------------------------------------------------
nomarr | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
income | 4.52e-06 6.33e-07 7.15 0.000 3.28e-06 5.76e-06
male | .0104852 .0424389 0.25 0.805 -.0726934 .0936639
bigcity | -.2310436 .0436178 -5.30 0.000 -.316533 -.1455543
kds | .0123415 .012217 1.01 0.312 -.0116033 .0362862
_cons | .3618608 .0644555 5.61 0.000 .2355303 .4881912
------------------------------------------------------------------------------
From this, we can see that several factors affect the likelihood of being married, including income and
living in big cities.

Next, well run the model with possible selection bias, then run a Heckman selection model.
C:\WP60\LECT2.PHD\Heckman Selection\Heckman Selection Models.doc 1
OLS model
. regress agemarr income male norelig bigcity kds
Source | SS df MS Number of obs = 2804
-------------+------------------------------ F( 5, 2798) = 48.64
Model | 3961.61695 5 792.323389 Prob > F = 0.0000
Residual | 45582.367 2798 16.2910533 R-squared = 0.0800
-------------+------------------------------ Adj R-squared = 0.0783
Total | 49543.984 2803 17.6753421 Root MSE = 4.0362
------------------------------------------------------------------------------
agemarr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
income | .0000127 2.18e-06 5.81 0.000 8.39e-06 .0000169
male | 1.524309 .1530466 9.96 0.000 1.224213 1.824404
bigcity | 1.409821 .1618219 8.71 0.000 1.092519 1.727124
kds | .1954122 .0457291 4.27 0.000 .105746 .2850784
_cons | 21.02824 .2366657 88.85 0.000 20.56418 21.49229

Heckman Selection Model:


Heckman selection model Number of obs = 3971
(regression model with sample selection) Censored obs = 1167
Uncensored obs = 2804

Wald chi2(4) = 234.46


Log likelihood = -10257.34 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
agemarr |
income | .0000126 2.52e-06 4.99 0.000 7.64e-06 .0000175
male | 1.54123 .1530358 10.07 0.000 1.241285 1.841174
bigcity | 1.428649 .1786608 8.00 0.000 1.07848 1.778818
kds | .1910291 .0458917 4.16 0.000 .101083 .2809752
_cons | 21.07386 .4386082 48.05 0.000 20.21421 21.93352
-------------+----------------------------------------------------------------
select |
income | 4.52e-06 6.34e-07 7.13 0.000 3.27e-06 5.76e-06
male | .0084382 .042425 0.20 0.842 -.0747133 .0915897
bigcity | -.2317217 .0436151 -5.31 0.000 -.3172057 -.1462377
kds | .0127582 .0122102 1.04 0.296 -.0111734 .0366898
_cons | .3631359 .0644777 5.63 0.000 .2367619 .4895099
-------------+----------------------------------------------------------------
/athrho | -.0259488 .162368 -0.16 0.873 -.3441843 .2922866
/lnsigma | 1.395891 .0135187 103.26 0.000 1.369395 1.422387
-------------+----------------------------------------------------------------
rho | -.025943 .1622587 -.3312078 .2842381
sigma | 4.038571 .0545964 3.93297 4.147009
lambda | -.1047727 .6555158 -1.38956 1.180015
------------------------------------------------------------------------------
LR test of indep. eqns. (rho = 0): chi2(1) = 0.02 Prob > chi2 = 0.8841

If we look at this closely, we see that the selection variables appear to affect the likelihood of being
censored out of the sample but that the hypothesis that rho=0 is accepted (or we fail to reject rho=0). If
we then examine the coefficient estimates for the two models, well see that they are very similar. In this
case, the OLS model is not biased.

C:\WP60\LECT2.PHD\Heckman Selection\Heckman Selection Models.doc 2


Example 2.

Were examining the wages of wives. There are over 800 wives with 0 wages. We would like to determine if
this is a random process where some wives work and some do not, or a non-random process. First, our OLS
models give us the following information.
. regress wageswf income kids youngest
Source | SS df MS Number of obs = 1726
-------------+------------------------------ F( 3, 1722) = 188.40
Model | 34064.8457 3 11354.9486 Prob > F = 0.0000
Residual | 103787.983 1722 60.2717669 R-squared = 0.2471
-------------+------------------------------ Adj R-squared = 0.2458
Total | 137852.828 1725 79.9146831 Root MSE = 7.7635
------------------------------------------------------------------------------
wageswf | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
income | .0001151 4.86e-06 23.70 0.000 .0001056 .0001246
kids | .2038852 .1834453 1.11 0.267 -.1559138 .5636843
youngest | -.0856795 .0406674 -2.11 0.035 -.1654423 -.0059168
_cons | 4.429739 .3830775 11.56 0.000 3.678393 5.181085
We next run a Heckman model with a limited number of selection variables.
. heckman wageswf income kids youngest, select(youngest white)
Iteration 0: log likelihood = -8501.7992
Iteration 1: log likelihood = -7903.3723
Iteration 2: log likelihood = -7793.5076
Iteration 3: log likelihood = -7612.9477
Iteration 4: log likelihood = -7565.4426
Iteration 5: log likelihood = -7564.2854
Iteration 6: log likelihood = -7564.2536
Iteration 7: log likelihood = -7564.2529
Heckman selection model Number of obs = 2560
(regression model with sample selection) Censored obs = 834
Uncensored obs = 1726
Wald chi2(3) = 565.28
Log likelihood = -7564.253 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
wageswf |
income | .0001151 4.85e-06 23.72 0.000 .0001056 .0001246
kids | .203736 .1833006 1.11 0.266 -.1555266 .5629986
youngest | -.0864731 .04876 -1.77 0.076 -.182041 .0090947
_cons | 4.452173 .8532013 5.22 0.000 2.779929 6.124417
-------------+----------------------------------------------------------------
select |
youngest | .0459541 .0055746 8.24 0.000 .0350282 .0568801
white | -.0066306 .0583676 -0.11 0.910 -.1210289 .1077678
_cons | .3075549 .0536253 5.74 0.000 .2024512 .4126586
-------------+----------------------------------------------------------------
/athrho | -.0047056 .1600203 -0.03 0.977 -.3183397 .3089284
/lnsigma | 2.048273 .0170247 120.31 0.000 2.014905 2.081641
-------------+----------------------------------------------------------------
rho | -.0047056 .1600168 -.3080049 .2994619
sigma | 7.754496 .1320182 7.500014 8.017612
lambda | -.0364894 1.240864 -2.468538 2.395559
------------------------------------------------------------------------------
LR test of indep. eqns. (rho = 0): chi2(1) = 0.00 Prob > chi2 = 0.9795
------------------------------------------------------------------------------

From this, we find that the OLS coefficients are no different (statistically) than the Heckman corrections.
Lets use a few more selection variables to see what happens.

C:\WP60\LECT2.PHD\Heckman Selection\Heckman Selection Models.doc 3


. heckman wageswf income kids youngest, select(youngest white income kids)
Iteration 0: log likelihood = -9533.1633 (not concave)
Iteration 1: log likelihood = -8821.2528 (not concave)
Iteration 2: log likelihood = -8276.376 (not concave)
Iteration 3: log likelihood = -7667.963
Iteration 4: log likelihood = -7528.6591
Iteration 5: log likelihood = -7510.5426
Iteration 6: log likelihood = -7509.8196
Iteration 7: log likelihood = -7509.8163
Iteration 8: log likelihood = -7509.8163
Heckman selection model Number of obs = 2560
(regression model with sample selection) Censored obs = 834
Uncensored obs = 1726

Wald chi2(3) = 509.92


Log likelihood = -7509.816 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
wageswf |
income | .0001104 4.97e-06 22.22 0.000 .0001006 .0001201
kids | .1194365 .1863746 0.64 0.522 -.245851 .484724
youngest | -.1274718 .0423557 -3.01 0.003 -.2104875 -.0444561
_cons | 6.157827 .5279928 11.66 0.000 5.12298 7.192673
-------------+----------------------------------------------------------------
select |
youngest | .0334537 .0063111 5.30 0.000 .0210843 .0458232
white | -.0936068 .0593574 -1.58 0.115 -.2099452 .0227316
income | 7.26e-06 7.32e-07 9.92 0.000 5.83e-06 8.70e-06
kids | .070942 .0254388 2.79 0.005 .0210828 .1208011
_cons | .0018477 .0627088 0.03 0.976 -.1210593 .1247546
-------------+----------------------------------------------------------------
/athrho | -.3139508 .070147 -4.48 0.000 -.4514365 -.1764652
/lnsigma | 2.071723 .0204002 101.55 0.000 2.031739 2.111707
-------------+----------------------------------------------------------------
rho | -.304027 .0636632 -.4230791 -.174656
sigma | 7.938488 .1619469 7.62734 8.262329
lambda | -2.413515 .5342469 -3.46062 -1.366411
------------------------------------------------------------------------------
LR test of indep. eqns. (rho = 0): chi2(1) = 10.43 Prob > chi2 = 0.0012
------------------------------------------------------------------------------

With these new selection variables, we find that our OLS models were biased, and we are correct in
correcting these coefficient estimates with this Heckman correction model. We can see what happens to
the effects of Youngest, for example.

C:\WP60\LECT2.PHD\Heckman Selection\Heckman Selection Models.doc 4

Vous aimerez peut-être aussi