Vous êtes sur la page 1sur 9

STATA SESSION Testing results 1. reg wage educ educ2 : Commands for testing coefficients and models a.

test _b[educ] = 0 : tests for whether the coefficient is zero b. local sign_educ = sign(_b[educ]) c. for a single tailed test use : i. display "Ho: coef <= 0 p-value = " ttail(r(df_r),`sign_educ'*sqrt(r(F))) ii. display "Ho: coef > 0 p-value = " 1- ttail(r(df_r),`sign_educ'*sqrt(r(F))) d. for JB test do the following i. predict uhat, resid ii. summarize uhat, detail iii. calculate the j-bera coefficient : scalar jb = (formula) : as follows iv. scalar jb = (r(N)/6)*( (r(skewness)^2) + ((r(kurtosis)-3)^2)/4 ) v. display Jarque Bera = jb vi. scalar pvalue = chi2tail(2,jb) vii. display Jarque Bera p-value = pvalue viii. Note: The j-bera statistic follows a chi^2 with 2-df. We can therefore check using the chi^2 tables ix. Alternatively use sktest uhat e. After doing any regression, the next step is check whether the model is good enough. predict as we discussed Models exercise The models that you need to work on reg lwage educ exper expersq educ2 female nonwhite test educ = 0 test educ = educ2 qui reg lwage educ exper female nonwhite test educ = 0.05 qui reg lwage educ exper tenure female nonwhite test tenure = exper test educ exper wait : are we getting the fact that educ becomes significant if we drop educ2 and have other variables?

predict lwhat predict lresid, residual graph twoway scatter lwhat lwage twoway (scatter lwhat lwage) (lfit lwhat lwage) xi: reg lwage educ exper expersq educ2 nonwhite female i.area Heteroscedasticity Suppose the model you want to test xi: reg lwage educ exper expersq educ2 female nonwhite i.area hettest or estat hettest This is the Breusch Pagan/Cook Weisberg test. The null is constant variance. What do you see? There is heteroscedasticity here. (There are many many other tests : explore stata options) xi: reg lwage educ exper expersq educ2 female nonwhite i.area smsa construc ndurman trcommpu trade services profserv profocc clerocc servocc hettest or estat hettest % there is another test call the white test: estat imtest, white (explore) It seems that the heteroscecasticity issue is getting taken care of. xi: reg lwage educ exper expersq educ2 female nonwhite i.area smsa construc ndurman trcommpu trade services profserv profocc clerocc servocc, robust Using [,robust] resolves heteroscedasticity. % Note, once you use [,robust], you will not be allowed to use the hettest. (Is that not obvious??!) Other ways: By running auxiliary regression equations using the residuals (I would strictly recommend that you do this instead) Breusch-Pagan Test: reg y x1 x2 x3 x4. predict uhat, residual gen uhat2 = uhat^2

reg uhat2 z1 z2 z3 z4. %now calculate the LM = n*R2. This LM statistic follows a chi^2 distribution with p-1 degrees of freedom (p-1 is the number of slope coefficients in the above regression equation). If the LM statistic is greater than chi, then you reject the null and you conclude that there is heteroscedasticity. Glesjer Test: reg y x1 x2 x3 x4. predict uhat, residual You now have to generate the absolute value of the resudials. gen uhat2 = abs(uhat) reg uhat2 z1 z2 z3 z4. %now calculate the LM = n*R2. This LM statistic follows a chi^2 distribution with p-1 degrees of freedom (p-1 is the number of slope coefficients in the above regression equation). Use the following to obtain critical values for the chi^2 : gen chi = @chisq(0.95,p-1). If the LM statistic is greater than chi, then you reject the null and you conclude that there is heteroscedasticity. Harvey-Godfrey Test: reg y x1 x2 x3 x4. predict uhat, residual gen uhat2 = uhat^2 gen luhat = log(ut2) reg luhat z1 z2 z3 z4. %now calculate the LM = n*R2. This LM statistic follows a chi^2 distribution with p-1 degrees of freedom (p-1 is the number of slope coefficients in the above regression equation). If the LM statistic is greater than chi, then you reject the null and you conclude that there is heteroscedasticity. Park Test: reg y x1 x2 x3 x4.

predict uhat, residual gen uhat2 = uhat^2 gen luhat = log(ut2) %additionally, you need to create the log of all independent variables i.e., lz1, lz2, lz3, etc. reg luhat lz1 lz2 lz3 lz4. %now calculate the LM = n*R2. This LM statistic follows a chi^2 distribution with p-1 degrees of freedom (p-1 is the number of slope coefficients in the above regression equation). If the LM statistic is greater than chi, then you reject the null and you conclude that there is heteroscedasticity. Goldfeld-Quandt test: The basic idea here is that if there is homoscedasticity, then the variance for one part of the sample should be the same as the variance of the other part. The Goldfeld-Quandt test can be applied when you are able to identify a particular X variable that is to which the variance of the residual is strongly related. Steps : Identify one variable that is closely related to the variance of the disturbance term, and order (or rank) the observations of this variable in descending order. o Sort x Split the ordered sample into two equal sixed sub samples by omitting c central observations, so that the two sub samples will contain (n-c) observations. Run an OLS of Y on each of the sub sample and obtain the RSS for each equation. Suppose your sample size is 100 and you want to drop the middle 20 observations, then o reg y x in 1/40 o scalar rss1 = e(rmse)^2 o scalar df_rss1 = e(df_r) Calculate the F-statistic o Scalar FGQ = rss2/rss1 The above F-statistic is distributed with F(1/2(n-c)-k,1/2(n-c)-k) degrees of freedom. So you can obtain the critical value and compare o Scalar Fcrit = invFtail(df_rss2,df_rss1,FGQ) o Scalar pvalue = Ftail(df_rss2,df_rss1,FGQ) o Scalar list FGQ pvalue Fcrit Now check your result.

White test:

This is the most general test of heteroscedasticity. This is also an LM test but it has the advantages that (a) it does not assume any prior determination of heteroscedasticity, (b) unlike the Breusch-Pagan test, it does not depend on the normality assumption, and (c) it proposes a particylar chouce for the Z variables in the auxiliary regression equation. Suppose you have a model which has only 2 explanatory variables reg y x1 x2 predict uhat, residual gen uhat2 = uhat^2 reg uhat2 x1 x2 x1^2 x2^2 x1*x2 %now calculate the LM = n*R2. This LM statistic follows a chi^2 distribution with p-1 degrees of freedom (p-1 is the number of slope coefficients in the above regression equation). If the LM statistic is greater than chi, then you reject the null and you conclude that there is heteroscedasticity. Omitted Variable test reg lwage educ estat ovtest xi: reg lwage educ exper expersq educ2 female nonwhite i.area smsa construc ndurman trcommpu trade services profserv profocc clerocc servocc estat ovtest How do we know that we have included all variables we need to explain the dependent variable? Testing for omitted variable bias is important (in fact key: we usually would end up missing variables) since it is related to the assumption that the error term and the independent variables are not correlated (E(e|X) = 0). This leads to inconsistent estimates. In ovtest, the null is that there are no omitted variables. Another command is linktest. linktest basically regresses y on y-hat and y-hat^2. The table that we get will have _hat and _hatsq. If the p-value of _hatsq is not significant, we fail to reject the null and the model is okay. Therefore do the following exercise xi: reg lwage educ exper expersq educ2 female nonwhite i.area smsa construc ndurman trcommpu trade services profserv profocc clerocc servocc, robust

linktest Multicollinearity reg wage educ educ2 female north south west versus xi: reg wage educ educ2 i.female i.area Why is multicolinearity a problem? There could be two kinds of multicollinearity. The first kind is Exact Multicollinearity. This occurs when say an independent variable x2 = 2*x1. Now if you include both x2 and x1 in your model, stata will automatically drop one of them. So, this is not a big problem. Near exact Multicollinearity: This is a serious issue. Basically when multicollinearity is present, this may result in inflating standard errors and may tend to make independent variables insignificant (you may therefore get a high r-squared as your regression output but few significant coefficients). Stata will therefore try to drop one of the variables and then run the regression so as to prevent division by zero in the OLS estimate (matrix notations). Use the pwcorr command for eyeballing: it is quite useful. Vif is the command to use. It basically regresses each of the independent variable on other independent variables. If there is multicollinearity, Rsquared will be high [Var(bi) = 2/sii2(1- R2j)], which will tend to inflate the variance. Rule of thumb: If Vif is > 10 or 1/Vif <0.1; you are inviting trouble. In the above model you will surely find a high vif between educ and educ2. But we need not worry about that.

Serial Correlation Mainly a time series issue. Basically, do you see a pattern in the errors? If you have a tsset variable in your model, then post regression you may use the dwstat command. Other commands : corrgram, ac, pac, etc. In fact in stata, once you type [,robust] : it automatically takes care of heteroscedasticity and serial correlation of residuals. Dummy Variables reg wage educ educ2 female north south west versus xi: reg wage educ educ2 i.female i.area Interaction terms

There could be three kinds of interaction terms: between two dummies, between one dummy and one continuous variable, and between one dummy and one continuous variable. There are several ways in which you could construct an interaction dummy Generate a new variable x3 which is a function of x1 and x3 and then regress y with x1, x2, and x3 Using # : reg y x1 x2 c.x1#c.x2 Using ## : reg y c.x1##c.x2

Prefixing c. means the variable is continuous. Remember if you want to interact a continuous variable with a dummy variable, donot prefix using c. before a dummy variable. For a variable like Female you need not use i.Female For a variable like area you can therefore write the regression like o reg wage area#c.x2; which is equivalent to reg wage i.area#c.x2 o Depending on your model you may need to do : reg i.area##c.x2 o reg wage area#c.x2 x2 o What if you the following model: y = f(x1, D1, D2, x1.D1, x1.D2, x1.D2, D1.D2, x1.D1.D2)?

Suppose in a model you have simple terms and an interaction term. More specifically, suppose you estimate the following model Reg wage c.educ##female You will find that educ is significant but Female.educ is not. This says something only about the independent coefficients. How about overall significance of the term educ? Take a partial with respect to educ. What is the partial? It is a function specifically, a random variable in itself. As a result you need to check overall significance. The way of doing it is o margins, dydx (educ) Note that margins, dyxy (female) is basically : y(female =1) y(female=0). Now suppose the interaction is between two continuous variables, and the model is as simple as reg y x1 x2 x1.x2; then you could also use the command: o lincom x1 + x2*K; where K is a constant.

Summarizing Results in tables Reg [first model], robust Eststo model1 Reg [second model], robust Eststo model2 Reg [third model], robust

Eststo model3 Esttab, r2 ar2 se scalar(rmse) %look for other options Some more discussion a. Nice link: http://www.ats.ucla.edu/stat/stata/ b. Always use the command help c. We discussed last time that we can use predict command to check whether a model is good. After doing any regression, the next step is check whether the model is good enough. We also used predict uhat, resid d. To eyeball on whether there is a pattern in the errors, alternatively after regressing, you should use rvfplot, [yline(0)] : this gives a plot of the residual with the fitted values. e. Obsession with normality of error terms: i. What is the sample size ii. Outliers? You remove outliers but are they of some significance? Can they at all be removed? Or do you need some kind of a non-linear transformation? iii. Some sense of what the variable is, etc. What if the y is a count data? iv. Hence before deciding on whether a kdensity must be taken seriously, you need to ask yourself this question. f. Be careful of Dummy Variable trap!, intercept dummy, slope dummy g. Use avplot to check for outliers in your variables. h. Other cool post-regression commands: after a regression, you could do a dfbeta helps measure the influence of each independent variable on y; Dfits the influence of a particular observation; covratio the impact of a particular observation on the standard errors. i. Model selection test: Likelihood Ratio test i. Reg lwage educ exper female ii. Estimates store A iii. Reg lwage educ exper iv. Estimates store B v. Lrtest A (this automatically assumes B is nested in A). A high chi2 or a low pvalue means you should not have dropped female j. To test for residuals : useful plots are : kdensity, pnorm, qnorm k. Use test command to check for joint significance of coefficients (refer to the above section on model exercise) l. Note that stata temporarily stores values of coefficients as _b[varname]. You can save beta coefficients or standard errors by using _b[educ] and _se[educ] respectively. This is often useful. m. After you run a regression, to obtain the complete list (and full details of your model) just type ereturn list. n. General guidelines (rules of thumb if you will)

i. A sound theory, the right research question, and deep knowledge of the subject are important. This will give you some sense of whether you are missing some variables and you are progressing in the right direction. ii. If there are two dependent variables that tend to give you identical information, find a way of combining them into one variable. iii. Interactions are really really useful, especially between variables that have large effects. iv. Remember the following (dont take these as gospel truth though) 1. No significance and the sign is off you may want to drop the variable 2. Significance and the sign is okay keep that variable 3. No significance and the sign is okay you may want to keep it 4. Significance and the sign is off well, hmm.., add other variables, add stuff, remove stuff, change your model, and see if sign changes. If not, then review theory, intuition. o. Saving results for documentation : use the command outreg2 : syntax outreg2 using. . You can create word documents (What about pdf documents? Check).

Vous aimerez peut-être aussi