This example uses data in the file 2slseg.dta. It contains 2932 observations from a sample of young adult males in the U.S. in 1976. The variables are:
1. nearc2 =1 if lived near a 2 yr college in 1966 2. nearc4 =1 if lived near a 4 yr college in 1966 3. educ years of schooling, 1976 4. age age in years, 1976 5. smsa =1 if lived in an SMSA, 1976 (SMSA =Standard Metropolitan Statistical Area, basically indicates live in an urban area) 6. south =1 if live in southern U.S., 1976 7. wage hourly wage in cents, 1976 8. married =1 if married, 1976
This data set is used in the article Using Geographic Variation in College Proximity to Estimate the Returns to Schooling, by D. Card (1994) in L.N. Christophides et al.(ed.), Aspects of Labour Market Behaviour: Essays in Honour of John Vanderkamp and used in the textbook: Introductory Econometrics: A Modern Approach, second edition, by J effrey M. Wooldridge.
The goal is to estimate the percentage effect on the wage of getting an extra year of education, by estimating the coefficient on EDUC variable in a regression equation with the log of WAGE as the dependent variable, controlling for other factors as follows:
This will be referred to as the wage equation. It is commonly thought that EDUC is correlated with the error term in the wage equation (unobserved ability). This would result in OLS over-estimating the effect of EDUC on the log wage. It is hard to find instruments though. They need to be uncorrelated with the error term, yet help to predict years of schooling. In this example, some information on how far these young men lived from two types of colleges 10 years earlier is used as instruments. 2 Here is the do file without comments:
****************************************************************************** ** 2SLS. do : Mar ch 2007 ******************************************************************************
cl ear capt ur e l og usi ng " C: \ Document s and Set t i ngs\ cour ses\ 761 and 762\ w07\ 2SLS\ 2SLS. l og" , r epl ace use " C: \ Document s and Set t i ngs\ cour ses\ 761 and 762\ w07\ 2SLS\ 2SLSeg. dt a"
summar i ze
gen l wage=l og( wage)
** I V r egr essi on ( 2SLS) ** i vr eg l wage age mar r i ed smsa ( educ = near c2 near c4)
** gener al ver si on of Hausman t est ** pr edi ct i vr esi d, r esi dual s est st or e i vr eg r eg l wage educ age mar r i ed smsa hausman i vr eg . , const ant si gmamor e df ( 1)
** Wu ver si on of Hausman t est ** qui et l y r eg educ age mar r i ed smsa near c2 near c4 pr edi ct educhat , xb r eg l wage educ age mar r i ed smsa educhat
** over i dent i f i cat i on t est ** qui et l y r eg i vr esi d age mar r i ed smsa near c2 near c4 pr edi ct expl r esi d, xb mat r i x accumr ssmat = expl r esi d, noconst ant mat r i x accumt ssmat = i vr esi d, noconst ant scal ar nobs=e( N) scal ar x2=nobs*r ssmat [ 1, 1] / t ssmat [ 1, 1] scal ar pval =1- chi 2( 1, x2) scal ar l i st x2 pval
l og cl ose
3 Here is the same do file with comments about some of the commands inserted below them in italics:
****************************************************************************** ** 2SLS. do : Mar ch 2007 ******************************************************************************
cl ear capt ur e l og usi ng " C: \ Document s and Set t i ngs\ cour ses\ 761 and 762\ w07\ 2SLS\ 2SLS. l og" , r epl ace use " C: \ Document s and Set t i ngs\ cour ses\ 761 and 762\ w07\ 2SLS\ 2SLSeg. dt a"
summar i ze
gen l wage=l og( wage)
** I V r egr essi on ( 2SLS) ** i vr eg l wage age mar r i ed smsa ( educ = near c2 near c4)
This i vr eg command computes the 2SLS estimates. The dependent variable is l wage. The regressors that are assumed exogenous are left outside of the parentheses: age mar r i ed smsa. The regressors that are assumed endogenous are in the parentheses to the left of the equals sign. Theres just one in this example: educ. In the parentheses to the right of the equals sign are the instrumental variables, that are assumed exogenous and do not appear as regressors in the equation. Here they are near c2 and near c4. The key assumption is that distances from 2yr and 4yr colleges in 1966 are not correlated with the error in the wage equation, but do help to explain years of schooling in 1976.
** gener al ver si on of Hausman t est ** pr edi ct i vr esi d, r esi dual s
This post-estimation command stores the 2SLS residuals in a variable that I called i vr esi d..
est st or e i vr eg
This post-estimation command stores some of the 2SLS results for later use in a Hausman test.
r eg l wage educ age mar r i ed smsa
This command estimates the same equation by OLS in order to compute the Hausman test statistic.
hausman i vr eg . , const ant si gmamor e df ( 1)
This command computes the Hausman test statistic. The null hypothesis is that the OLS estimator is consistent. If accepted, we probably would prefer to use OLS instead of 2SLS. The option const ant is necessary to tell Stata to include the constant term in the comparison of both estimates. The si gmamor e option tells Stata to use the same estimate of the variance of the error term for both models. This is desirable here since the error term has the same interpretation in both models. The df ( 1) option tells Stata that the null distribution has one degree of freedom. Stata was able to figure this out when I left this option out, even though the Hausman test is comparing values of two 5- element (not one-element) vectors. It probably knew this by finding only one non-zero eigenvalue of the 5-by-5 covariance matrix estimate that it calls ( V_b- V_B) in the output. Its safer to impose the d.f. in the hausman command as above.
** Wu ver si on of Hausman t est ** qui et l y r eg educ age mar r i ed smsa near c2 near c4
The above OLS regression is done only to get the predicted value of educ to perform the Wu version of the Hausman test as described on p.82 of the Greene text, 5 th edition. To reduce the amount of output in the log file, its output is suppressed by preceding the command with qui et l y.
pr edi ct educhat , xb 4 r eg l wage educ age mar r i ed smsa educhat
This OLS regression takes the original wage equation and adds the OLS predicted values of all of the (suspected) endogenous variables. Here there is only one, educhat . It was predicted using the full set of exogenous variables. The Wu version of the Hausman test is the standard significance test for the coefficient(s) on these added variables. Since theres just one here, use a two-sided t-test.
** over i dent i f i cat i on t est ** qui et l y r eg i vr esi d age mar r i ed smsa near c2 near c4
The uncentred R-square of the above regression will be computed below to produce the overidentification test statistic, also known as the Sargan statistic. The dependent variable i vr esi d is the 2SLS residual vector, saved earlier.
pr edi ct expl r esi d, xb
The predicted values from the regression are saved in order to calculate the uncentred R-squared.
mat r i x accumr ssmat = expl r esi d, noconst ant mat r i x accumt ssmat = i vr esi d, noconst ant
Theres probably a neater way to do this, but I used these mat r i x accumcommands with a noconst ant option in order to compute two scalars, r ssmat (which is the sum of squares of expl r esi d) and t ssmat (which is the sum of squares of i vr esi d)
scal ar nobs=e( N)
e( N) is the sample size, which was automatically stored earlier. This command stores that value in a scalar variable nobs.
scal ar x2=nobs*r ssmat [ 1, 1] / t ssmat [ 1, 1]
This command computes the overidentification test statistic, called x2.
scal ar pval =1- chi 2( 1, x2)
This command computes the P-value using the Stata function chi 2( n, x) , which computes the area to the left of x under a chi-square distribution with n d.f.
scal ar l i st x2 pval
This prints out the values of x2 and pval .
l og cl ose 5 Now the log file:
. use " C: \ Document s and Set t i ngs\ cour ses\ 761 and 762\ w07\ 2SLS\ 2SLSeg. dt a"
. hausman i vr eg . , const ant si gmamor e df ( 1)
Not e: t he r ank of t he di f f er enced var i ance mat r i x ( 1) does not equal t he number of coef f i ci ent s bei ng t est ed ( 5) ; be sur e t hi s i s what you expect , or t her e may be pr obl ems comput i ng t he t est . Exami ne t he out put of your est i mat or s f or anyt hi ng unexpect ed and possi bl y consi der scal i ng your var i abl es so t hat t he coef f i ci ent s ar e on a si mi l ar scal e.
- - - - Coef f i ci ent s - - - - | ( b) ( B) ( b- B) sqr t ( di ag( V_b- V_B) ) | i vr eg . Di f f er ence S. E. - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - educ | . 1386543 . 0485886 . 0900657 . 0290232 age | . 0366522 . 0364856 . 0001666 . 0000537 mar r i ed | . 1937981 . 1759239 . 0178742 . 0057599 smsa | . 0976942 . 1962286 - . 0985344 . 0317522 _cons | 3. 184304 4. 326357 - 1. 142053 . 3680211 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - b = consi st ent under Ho and Ha; obt ai ned f r omi vr eg B = i nconsi st ent under Ha, ef f i ci ent under Ho; obt ai ned f r omr egr ess
Test : Ho: di f f er ence i n coef f i ci ent s not syst emat i c
chi 2( 1) = ( b- B) ' [ ( V_b- V_B) ^( - 1) ] ( b- B) = 9.63 Pr ob>chi 2 = 0.0019 ( V_b- V_B i s not posi t i ve def i ni t e)
. . ** Wu ver si on of Hausman t est ** . qui et l y r eg educ age mar r i ed smsa near c2 near c4
. . ** over i dent i f i cat i on t est ** . qui et l y r eg i vr esi d age mar r i ed smsa near c2 near c4
. pr edi ct expl r esi d, xb
. mat r i x accumr ssmat = expl r esi d, noconst ant ( obs=2932)
. mat r i x accumt ssmat = i vr esi d, noconst ant ( obs=2932)
. scal ar nobs=e( N)
. scal ar x2=nobs*r ssmat [ 1, 1] / t ssmat [ 1, 1]
. scal ar pval =1- chi 2( 1, x2)
. scal ar l i st x2 pval x2 = 5.9600396 pval = .01463371
. . l og cl ose l og: C: \ Document s and Set t i ngs\ cour ses\ 761 and 762\ w07\ 2SLS\ 2SLS. l og l og t ype: t ext cl osed on: 13 Mar 2007, 16: 28: 25 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
The two Hausman tests give identical information. The general version is in chi-square form, and equals 9.63, while the Wu version is a t-statistic, t = 3.11, which is the square root of 9.63. The have the same P- value of .002, indicating rejection of the consistency of OLS, providing support for using 2SLS.
The overidentification test has a P-value of .014, which is significant at 5% but not 1%. So at the 5% level we would reject the hypothesis that the instrumental variables near c2 and near c4 are exogenous. If no other instrumental variables are available, it is hard to know what to do about this. We could drop one of the two instruments, but we would not know if that solves the problem because we then have no overidentification restrictions left to test.