Vous êtes sur la page 1sur 7

Bader Alowais

EC552
Hw5
1- Using Data_Assignment3a, run a linear regression of y on x1, x2, x3, x4, and x5, where y
is annual salary, x1is GNP, x2 is housing starts, x3 is unemployment rate, x4 is prime rate
lag 6 months, and x5is customer line gains:

a. Find the correlation coefficients among the variable and comment on the significance
and the sign of the coefficients.
> cor(D)
x1 x2 x3 x4 x5 y
x1 1.0000000 -0.3496099 0.7265714 0.80989486 -0.67430404 0.21028644
x2 -0.3496099 1.0000000 -0.4633229 -0.74869250 0.60619956 0.49721783
x3 0.7265714 -0.4633229 1.0000000 0.70094493 -0.80885445 -0.26348210
x4 0.8098949 -0.7486925 0.7009449 1.00000000 -0.83081592 -0.05035153
x5 -0.6743040 0.6061996 -0.8088545 -0.83081592 1.00000000 0.01080028
y 0.2102864 0.4972178 -0.2634821 -0.05035153 0.01080028 1.00000000
The correlation coefficient of each variable with the other as explained in the table above if the
coefficient is:
1) Exactly -1 then it’s a perfect downhill (-) linear relationship.
2) If ≤ -0.70 then it’s a strong downhill (-) linear relationship.
3) If ≤ -0.50 then it’s a moderate downhill (-) linear relationship.
4) If ≤ -0.30 then it’s a weak downhill(-) linear relationship.
5) If = 0 then there is no linear relationship.
6) If ≥ 0.30 then it’s a weak uphill (+) linear relationship.
7) If ≥ 0.50 then it’s a moderate uphill (+) linear relationship.
8) If ≥ 0.70 then it’s a strong uphill (+) linear relationship.
9) If exactly 1 then it’s a perfect uphill (+) linear relationship.

b. Run the regression of y on the x variables and fully analyze the regression statistics and
the relationship among the variables:
Call:
lm(formula = y ~ x1 + x2 + x3 + x4 + x5, data = D)
Residuals:
Min 1Q Median 3Q Max
-878.32 -271.64 -18.25 179.29 1153.74
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5962.6555 2507.7241 2.378 0.03876 *
x1 4.8837 2.5125 1.944 0.08058 .
x2 2.3640 0.8436 2.802 0.01872 *
x3 -819.1287 187.7072 -4.364 0.00141 **
x4 12.0105 147.0496 0.082 0.93652
x5 -851.3927 292.1447 -2.914 0.01545 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 627.6 on 10 degrees of freedom
Multiple R-squared: 0.8228, Adjusted R-squared: 0.7341
F-statistic: 9.284 on 5 and 10 DF, p-value: 0.001615
Regression statistics:
1) Regression Equation:
y = 5962.6555 + 4.8837(x1) + 2.3640(x2) - 819.1287(x3) + 12.0105(x4) - 851.3927(x5)
(0.03876) (0.08058 ) (0.01872) (0.00141) (0.93652) (0.01545)
R^2= 0.7341 .
SE/ y = 628.6/1543.125= 0.4073552
2) F-statistics:​ p-value is less than 0.05 and since this is not simple regression then we can
go for the t values of coefficients.
3) Estimated Intercept ​= 5962.6555 Intercept is significant with 95% confidence therefore
we keep it in Regression equation.
4) Estimated​ β 1 = 4.8837 is not significant with 95% confidence therefore we don’t
consider it in Regression equation.
5) Estimated​ β 2 = 2.3640 is significant with 95% confidence therefore we keep it in
Regression equation.
6) Estimated​ β 3 = -819.1287 is significant with 95% confidence therefore we keep it in
Regression equation.
7) Estimated​ β 4 = 12.0105 is not significant with 95% confidence therefore we don’t
consider it in Regression equation.
8) Estimated​ β 5 = -851.3927 is significant with 95% confidence therefore we keep it in
Regression equation.
9) Goodness of fit​: Since it is not simple regression we look at adjusted R-squared=0.7341
which means 73.41% of change in dependent variable is explained by the independent
variables.
10) Residual Standard Error/Mean= 628.6/1543.125= 0.4073552 and since it's >0.1 then the
model is not good for forecasting.
By simply running the regression ​we can see that the intercept, x2, x3, and x5 coefficients are
significant and if assuming this regression is accurate then only those would be considered in our
regression line. However, to make sure that this regression is accurate we have to run some tests
and based on them we can correct if needed.
c. Do proper corrections for the estimation, if they are needed.
1) E( ε )=0 :
We can’t test for this once it’s only theoretical and to make sure this assumption is fine we
always keep the intercept.
2) E( εx )= 0:
This assumption is usually violated when we have macroeconomic variables causing a two ways
causality. We test for this through Granger Causality test and if there is a two ways causality we
can correct that with instrumental variables which we will do in part ​(f)​ of the question.
3) E( εt εt−1 )=0:
We first run Durbin Watson test to check for serial correlation, if we had a serial correlation we
do the necessary correction through Cochrane Orcutt:
dwt(reg)
lag Autocorrelation D-W Statistic p-value
1 -0.3410202 2.484497 0.858
Alternative hypothesis: rho != 0
DW= 2(1-( -0.3410202))= 2.6820404 then there is negative serial correlation and we have to
correct it:
reg1= cochrane.orcutt(reg)
summary(reg1)
Call:
lm(formula = y ~ x1 + x2 + x3 + x4 + x5, data = D)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6266.72436 1835.53672 3.414 0.0076994 **
x1 4.64416 1.83462 2.531 0.0321612 *
x2 2.21098 0.61058 3.621 0.0055624 **
x3 -829.62466 144.16821 -5.755 0.0002747 ***
x4 17.28393 106.94335 0.162 0.8751770
x5 -769.67635 238.64324 -3.225 0.0104020 *
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 523.841 on 9 degrees of freedom
Multiple R-squared: 0.8993 , Adjusted R-squared: 0.8433
F-statistic: 16.1 on 5 and 9 DF, p-value: < 2.969e-04
Durbin-Watson statistic
(original): 2.48450 , p-value: 4.093e-01
(transformed): 2.75305 , p-value: 5.219e-01
4) E( ε2 )= σ 2ε :
This assumption is violated when it’s cross-section data and we can test for that through
Goldfield, White or Bruech test if heteroscedasticity​ ​occurred then we can correct through the
scale method. Since we don’t know which variable caused that, we will use Bruech test:
bptest(regh)
studentized Breusch-Pagan test
data: regh
BP = 5.986, df = 5, p-value = 0.3076
We accept the null then heteroscedasticity so we correct by scaling:
ress= sum(regh$residuals)
> y=y/abs(ress)
>y
[1] 1.796851e+16 2.402328e+16 2.505433e+16 2.293715e+16 2.610986e+16 2.658103e+16
2.224264e+16 1.535874e+16 1.846415e+16
[10] 2.271687e+16 2.875940e+16 2.860642e+16 2.000920e+16 2.348174e+16 2.269851e+16
2.424050e+16
> x1= x1/abs(ress^2)
> x2= x2/abs(ress^2)
> x3= x3/abs(ress^2)
> x4= x4/abs(ress^2)
> x5= x5/abs(ress^2)
Then,
Call:
lm(formula = y ~ x1 + x2 + x3 + x4 + x5)
Residuals:
Min 1Q Median 3Q Max
-2.687e+15 -8.311e+14 -5.584e+13 5.485e+14 3.530e+15
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.824e+16 7.672e+15 2.378 0.03876 *
x1 4.884e+00 2.513e+00 1.944 0.08058 .
x2 2.364e+00 8.436e-01 2.802 0.01872 *
x3 -8.191e+02 1.877e+02 -4.364 0.00141 **
x4 1.201e+01 1.470e+02 0.082 0.93652
x5 -8.514e+02 2.921e+02 -2.914 0.01545 *
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.92e+15 on 10 degrees of freedom
Multiple R-squared: 0.8228, Adjusted R-squared: 0.7341
F-statistic: 9.284 on 5 and 10 DF, p-value: 0.001615
5) No Severe multicollinearity:
To test for severe multicollinearity we run the same cor test and drop the highest correlated:
x1 x2 x3 x4 x5 y
x1 1.0000000 -0.3496099 0.7265714 0.80989486 -0.67430404 0.21028644
x2 -0.3496099 1.0000000 -0.4633229 -0.74869250 0.60619956 0.49721783
x3 0.7265714 -0.4633229 1.0000000 0.70094493 -0.80885445 -0.26348210
x4 0.8098949 -0.7486925 0.7009449 1.00000000 -0.83081592 -0.05035153
x5 -0.6743040 0.6061996 -0.8088545 -0.83081592 1.00000000 0.01080028
y 0.2102864 0.4972178 -0.2634821 -0.05035153 0.01080028 1.00000000
As we can see the highest correlated with y is x2 so we drop x2 and run the regression again:
Call:
lm(formula = y ~ x1 + x3 + x4 + x5)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8751.9419 2860.4661 3.060 0.0120523 *
x1 7.8423 2.2653 3.462 0.0061039 **
x3 -982.2009 204.3965 -4.805 0.0007178 ***
x4 -210.2729 108.3255 -1.941 0.0809306 .
x5 -727.4130 319.3248 -2.278 0.0459438 *
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 668.0214 on 10 degrees of freedom
Multiple R-squared: 0.746 , Adjusted R-squared: 0.6444
F-statistic: 7.3 on 4 and 10 DF, p-value: < 4.998e-03
Durbin-Watson statistic
(original): 1.52187 , p-value: 3.353e-02
(transformed): 2.41680 , p-value: 4.919e-01
6) Stationary Variables:#####

d. What are the expected signs of the coefficients in this model. Are the empirical results on
the sign of the coefficients consistent with your prior expectations?
Y is annual salary, then x1(GNP) should be (+), x2 (housing starts) should be (+),
x3(unemployment rate) should be (-) , x4 (prime rate) should be (-), and x5 (customer line gains)
should be (-).
Yes it does.
e. Are the estimated coefficients significant at the 95% level of confidence?
Yes besides x4.
f. Suppose you initially run a regression of Y on X1, X2, X3 only and then decide to add X4
and X5. How would you find out if it is worth adding the variables X4 and X5? What test
do you use? Conduct the test.
When we run the regression without x4 and x5 we get:
summary(regf)
Call:
lm(formula = y ~ x1 + x2 + x3, data = D)
Residuals:
Min 1Q Median 3Q Max
-1410.1 -459.1 160.1 412.1 1520.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 195.6043 2308.3167 0.085 0.93387
x1 6.2084 1.9058 3.258 0.00686 **
x2 1.4954 0.6254 2.391 0.03406 *
x3 -469.7181 198.5697 -2.366 0.03569 *
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 859.3 on 12 degrees of freedom
Multiple R-squared: 0.6013, Adjusted R-squared: 0.5016
F-statistic: 6.031 on 3 and 12 DF, p-value: 0.009557
To find out if it worth adding x4 or x5 or both we run granger test testing for two ways
causality but since we have macroeconomic variables we probably would have some:
> grangertest(y~x1)
Granger causality test
Model 1: y ~ Lags(y, 1:1) + Lags(x1, 1:1)
Model 2: y ~ Lags(y, 1:1)
Res.Df Df F Pr(>F)
1 12
2 13 -1 0.0196 0.8911
Y does not granger cause x1.
> grangertest(x1~y)
Granger causality test
Model 1: x1 ~ Lags(x1, 1:1) + Lags(y, 1:1)
Model 2: x1 ~ Lags(x1, 1:1)
Res.Df Df F Pr(>F)
1 12
2 13 -1 0.6547 0.4342
X1 does not granger cause y.
> grangertest(y~x2)
Granger causality test
Model 1: y ~ Lags(y, 1:1) + Lags(x2, 1:1)
Model 2: y ~ Lags(y, 1:1)
Res.Df Df F Pr(>F)
1 12
2 13 -1 3.2833 0.09507 .
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Y granger cause x2.
> grangertest(x2~y)
Granger causality test
Model 1: x2 ~ Lags(x2, 1:1) + Lags(y, 1:1)
Model 2: x2 ~ Lags(x2, 1:1)
Res.Df Df F Pr(>F)
1 12
2 13 -1 2.7557 0.1228
X2 does not granger cause y.
> grangertest(y~x3)
Granger causality test
Model 1: y ~ Lags(y, 1:1) + Lags(x3, 1:1)
Model 2: y ~ Lags(y, 1:1)
Res.Df Df F Pr(>F)
1 12
2 13 -1 4e-04 0.9838
Y does not granger cause x3.
> grangertest(x3~y)
Granger causality test
Model 1: x3 ~ Lags(x3, 1:1) + Lags(y, 1:1)
Model 2: x3 ~ Lags(x3, 1:1)
Res.Df Df F Pr(>F)
1 12
2 13 -1 0.5464 0.474
X3 does not granger cause y.
We can say y Granger cause x2 however that’s not a two ways causality then a correction of
instrumental variable should not be necessary.

Vous aimerez peut-être aussi