Académique Documents
Professionnel Documents
Culture Documents
HW 2 Solutions
STAT 410
September 30, 2016
1)
Textbook 2.4. For part a also show that ^ is a linear estimator in the y data, being sure to identify the weights,
ci . Add the following part c to this exercise: (c) Find an estimator for ? Is it unbiased?
2
A)
To get the least squares estimator we want to minimize:
n n n n n
2 2 2 2 2
e = (yi x i ) = y 2 x i yi + x
i i i
with respect to . If we take the derivative of this with respect to and set this equation equal to 0, we can
solve for the that minimizes the least squares error. Taking the derivative with respect to , we obtain:
n
x i yi
n n
i=1
2 ^
0 = 2 x i yi + 2 x =
i n
i=1 i=1 2
x
i
i=1
which is the desired least squares estimator from the testbook. We have that ^ is a linear estimator in the y
data because
n
x i yi
n
i=1
^
= = ci yi
n
2 i=1
x
i
i=1
where
xi
ci =
n
2
x
i
i=1
B)
i.
Show that E(^|X) =
Solution
n n n n
x y x i E(yi ) x
2
x
2
i i i i
ii.
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 1/24
9/28/2017 HW 2 Solutions
2
Solution
n n n n
x y 2
x V ar(yi ) x
2 2
2
x
2
i i i i i
2
i=1 i=1 i=1 i=1
^
V ar( |x) = V ar = = = =
n n n n n
2 2 2 2 2 2 2 2
x ( x ) ( x ) ( x ) x
i i i i i
i=1 i=1 i=1 i=1 i=1
iii.
2
Solution
We already showed the mean and variance of ^|X above so all we need to show is that it is normally
distributed. Now,
n n n n
2
x i yi x x i i x i i
i
i=1 i=1 i=1 i=1
^
= = + = +
n n n n
2 2 2 2
x x x x
i i i i
i=1 i=1 i=1 i=1
and since we know that i n(0, 1) , the above is a linear combination of normally distributed random
variables and is therefore itself normally distributed.
C)
We could use the usual MSE but with once more degree of freedom. (since we dont have an intercept)
n n
^ 2 2 ^ 2 2
(yi x i ) y 2yi x i + x
i i
2 i=1 i=1
^
= =
n 1 n 1
2 2
^ /
(n 1) n1
So
2 2 2
^ ) = n 1
E( /(n 1) =
2)
Consider the SLR model
Yi = 0 + 1 x i + i
for i = 1, . . . , n and i N (0, 2 ) are i.i.d. In class we considered an indicator variable defined by 0 for
male and 1 for female. This problem continues that discussion. Assume there are k males and m females.
A)
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 2/24
9/28/2017 HW 2 Solutions
Using the parameterization, {0, 1} as in class, derive an expression for V ar(^1 ). Recall that the general
2
SXX
Solution
First note n = k + m .
Since x i = {0, 1} for all i , we know that (for an indicator I[x i =1]
= 1 if x i = 1 and 0 otherwise, that is
person i is a female :
n 2
x
i
n n n 2
i=1 m m(n m) mk
2 2 2
SXX = (x i x) = x nx = x i n = m = =
i
n n n k + m
i=1 i=1 i=1
so we have that:
2 2 2
n (k + m)
^
V ar( ) = = =
SXX (k + m) mk
B)
Using the parameterization where a female is coded by +1 and a male by 1, derive an expression for each ^
and interpret the regression coefficient estimators. Derive an expression for var(^1 ).
Solution
n n 2
2 2
m k
2
SXX = (x i x) = x nx = n n( ) =
i
n
i=1 i=1
2 2 2 2 2 2
m 2mk + k m + 2mk + k m + 2mk + k 4mk
m + k = =
m + k m + k m + k
so we have that:
2 2 2
n (m + k)
^
V ar( ) = = =
SXX 4mk 4mk
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 3/24
9/28/2017 HW 2 Solutions
n n n
^ ^ 2
0 x i + 1 x = Yi x i
i
i=1 i=1 i=1
n n
^ ^
0 n + 1 x i = Yi
i=1 i=1
m k
^ ^
0 (m k) + 1 (m + k) = Y i Y i
i=1 i=1
m
^ ^
0 (m + k) + 1 (m k) = Y i
i=1
n m k
^ ^
2 0 m + 2 1 m = Y i + Y i Y i
m k m k
^ ^
2 0 m + 2 1 m = Y i + Y i + Y i Y i
^ ^
2m 0 + 2m 1 = 2 Y i
i=1
^ ^
0 + 1 = Y f
^ ^
0 = Y f 1
m k
^ ^
(Y f 1 ) (m k) + 1 (m + k) = Y i Y i
i=1 i=1
m k
^
Y f (m k) + 1 (m + k + k + m + k) = Y i Y i
i=1 i=1
m k
^
2k 0 = Y i Y i Y f m + Y f k
i=1 i=1
^
2k 1 = kY m + kY f
Yf Ym
^
1 =
2
Yf Ym 2Y f Y f + Y m Yf + Ym
^
0 = Y f = =
2 2 2
C)
The dataset, sexsalary.csv, was collected as part of a legal case on sex discrimination at a certain bank. For
now we are interested in the association between sex (sex) and base salary (bsal) when an employee was
hired. Use each of the above parameterizations for sex to fit a SLR model of base salary on sex. For each fit
interpret the regression coefficient estimates and determine whether there is statistical evidence for sex
differences in base salary. Do the two fits lead to the same conclusion? The dataset, sexsalary.csv, is attached
to this assignment
Solution
For the first model, the results of the regression are shown below:
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 4/24
9/28/2017 HW 2 Solutions
##
## Call:
## lm(formula = Data1$bsal ~ sex1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1336.88 -338.85 43.12 261.15 2143.12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5956.9 105.3 56.580 < 2e-16 ***
## sex1 -818.0 130.0 -6.293 1.08e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 595.6 on 91 degrees of freedom
## Multiple R-squared: 0.3032, Adjusted R-squared: 0.2955
## F-statistic: 39.6 on 1 and 91 DF, p-value: 1.076e-08
We see that for model provides a good fit producing a F statistic of 39.6 on 1,91 degrees of freedom and an
associated p-value of approximately 0. We also see that the t- test for the gender coefficient is 6.293
producing a p-value of approximately 0 for testing H0 : 1 = 0. The intercept is also significantly different from
0. This indicates that women make about 818 less (based on this data) in base salary than their maile
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 5/24
9/28/2017 HW 2 Solutions
This is reflected in the scatterplot with the regression line superimposed. Clearly males on average make more
than females for this data in base salary but the data are spread out greatly around the regression line, so this
trend has lots of variation.
These two methods produce the exact same interpretive results! The only difference are the coefficient values
and standard error change.
For the Second model, the results of the regression are shown below:
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 6/24
9/28/2017 HW 2 Solutions
##
## Call:
## lm(formula = Data1$bsal ~ sex2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1336.88 -338.85 43.12 261.15 2143.12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5548 65 85.353 < 2e-16 ***
## sex2 -409 65 -6.293 1.08e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 595.6 on 91 degrees of freedom
## Multiple R-squared: 0.3032, Adjusted R-squared: 0.2955
## F-statistic: 39.6 on 1 and 91 DF, p-value: 1.076e-08
We see that for model provides a good fit producing a F statistic of 39.6 on 1,91 degrees of freedom and an
associated p-value of approximately 0. We also see that the t- test for the gender coefficient is 6.293
producing a p-value of approximately 0 for testing H0 : 1 = 0. The intercept is also significantly different from
0. This indicates that women make about $ 818$ less (based on this data) in base salary than their maile
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 7/24
9/28/2017 HW 2 Solutions
This is reflected in the scatterplot with the regression line superimposed. Clearly males on average make more
than females for this data in base salary but the data are spread out greatly around the regression line, so this
trend has lots of variation.
For the first model, the results of the regression are shown below:
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 8/24
9/28/2017 HW 2 Solutions
##
## Call:
## lm(formula = Data1$bsal ~ sex1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1336.88 -338.85 43.12 261.15 2143.12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5956.9 105.3 56.580 < 2e-16 ***
## sex1 -818.0 130.0 -6.293 1.08e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 595.6 on 91 degrees of freedom
## Multiple R-squared: 0.3032, Adjusted R-squared: 0.2955
## F-statistic: 39.6 on 1 and 91 DF, p-value: 1.076e-08
We see that for model provides a good fit producing a F statistic of 39.6 on 1,91 degrees of freedom and an
associated p-value of approximately 0. We also see that the t- test for the gender coefficient is 6.293
producing a p-value of approximately 0 for testing H0 : 1 = 0. The intercept is also significantly different from
0. This indicates that women make about (1) 409 + 409 = 818 less (based on this data) in base salary
than their maile counterparts. However, we are cautious because of the low R2 = 0.2955
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 9/24
9/28/2017 HW 2 Solutions
This is reflected in the scatterplot with the regression line superimposed. Clearly males on average make more
than females for this data in base salary but the data are spread out greatly around the regression line, so this
trend has lots of variation.
Interpretation: the difference between men and women should be the same. However, because we now have
men and women encoded as 1 and -1, the estimate of Beta has changed. The estimate is now half of the
estimate of Beta when it was encoded 1 and 0 (-409 now instead of -818). Likewise, The y-intercept can no
longer be interpreted as the average salary of whichever gender was encoded as 0 (females). Beta evaluated
when X=1 is no longer the decrease in average salary when somebody is a female. As mentioned before, to
get this decrease you would have to look at the difference between y hat when beta is 1 and beta is -1 (should
still be 818).
3)
Consider the restaurant data (attached as nyc.csv) in chapter 1 of the textbook, section 1.2.3.
A)
Fit a SLR model of price on service. Report (a) a plot of the data with the fitted regression line and a 95 %
confidence interval for the SLR model; (b) point estimates of the regression coefficients and their SEs; (c)
hypothesis tests of the regression coefficients; (d) confidence intervals for the regression coefficients with
interpretation; (e) regression ANOVA table; (f) R2
names(Dat)
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 10/24
9/28/2017 HW 2 Solutions
m1 = lm(Dat$Price~Dat$Service)
summary(m1)
##
## Call:
## lm(formula = Dat$Price ~ Dat$Service)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6646 -4.7540 -0.2093 4.3368 26.2460
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.9778 5.1093 -2.344 0.0202 *
## Dat$Service 2.8184 0.2618 10.764 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.153 on 166 degrees of freedom
## Multiple R-squared: 0.4111, Adjusted R-squared: 0.4075
## F-statistic: 115.9 on 1 and 166 DF, p-value: < 2.2e-16
names(m1)
We see above that the point estimate for the Service coefficient ^ is 2.818 with standard error 0.262. The t test
for H0 : = 0 vs. H1 : 0 produces a t-value of 2.818/0.262 = 10.76 on 166 degrees of freedom.
This test produces a p-value of p < 2e 16 indicating that good customer service leads to higher prices. The
anova table is shown in the output above.
This produces a p-value of approximately 0 for testing whether there is a linear relationship between Price and
Service, that is H0 : = 0 . So it appears that there is a relationship.
The SSE, MSE and SSR and MSR, R2 as well as the F statistic are calculated below:
SSE=sum(m1$residual^2)
MSE=sum(m1$residual^2)/m1$df.residual
SSR=sum((m1$fitted.values-mean(Dat$Price))^2)
Fstat=sum((m1$fitted.values-mean(Dat$Price))^2)/(sum(m1$residual^2)/m1$df.residual)
SSE
## [1] 8493.398
MSE
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 11/24
9/28/2017 HW 2 Solutions
## [1] 51.16505
SSR
## [1] 5928.12
Fstat
## [1] 115.8627
Rsquared= SSR/(SSE+SSR)
Rsquared
## [1] 0.4110608
The plot of the fitted regression line is shown below along with a 95$ %$ confidence interval.
x=Dat$Service
y=Dat$Price
## Point of averages
(meanx <- mean(x))
## [1] 19.39881
## [1] 42.69643
## Number of samples
n <- length(x)
## [1] 2103.339
## Also
## SXY <- sum((x-meanx)*(y-meany))
## [1] 746.2798
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 12/24
9/28/2017 HW 2 Solutions
## Also
## SXX <- sum(((x-meanx)^2))
## [1] 2.818433
## [1] -11.97781
## [1] 0.6373239
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6646 -4.7540 -0.2093 4.3368 26.2460
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.9778 5.1093 -2.344 0.0202 *
## x 2.8184 0.2618 10.764 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.153 on 166 degrees of freedom
## Multiple R-squared: 0.4111, Adjusted R-squared: 0.4075
## F-statistic: 115.9 on 1 and 166 DF, p-value: < 2.2e-16
## Fitted values
(yhat <- beta0hat + beta1hat*x)
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 13/24
9/28/2017 HW 2 Solutions
## Residuals
(ehat <- y - yhat)
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 14/24
9/28/2017 HW 2 Solutions
## [1] 5928.12
## [1] 8493.398
## [1] 51.16505
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 15/24
9/28/2017 HW 2 Solutions
## [1] 7.152975
## [1] 166
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 16/24
9/28/2017 HW 2 Solutions
Finally the 95% confidence intervals for 0 and 1 are calculated below:
CIB0 = c(m1$coeff[1]-5.1093*1.96,m1$coeff[1]+5.1093*1.96)
CIB1 = c(m1$coeff[2]-0.2618*1.96,m1$coeff[2]+0.2618*1.96)
CIB0
## (Intercept) (Intercept)
## -21.992039 -1.963583
CIB1
## Dat$Service Dat$Service
## 2.305305 3.331561
The values of S^E and S^E were obtained from the regression summary, but also could have been
1 0
obtained directly. We see that the 95% confidence interval for 1 is greater than 0, corroborating the results
from the hypothesis test (that is 1 0 ) and indicating that better restaurant service leads to higher prices. In
this case, the confidence interval for 0 does not 0, corroborating the hypothesis test that rejected H0 : 0 = 0
B)
Repeat the above for a SLR model of price on decor.
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 17/24
9/28/2017 HW 2 Solutions
m1 = lm(Dat$Price~Dat$Decor)
summary(m1)
##
## Call:
## lm(formula = Dat$Price ~ Dat$Decor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9578 -4.4862 -0.4673 4.0422 18.5138
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.362 3.292 -0.414 0.68
## Dat$Decor 2.490 0.184 13.537 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.426 on 166 degrees of freedom
## Multiple R-squared: 0.5247, Adjusted R-squared: 0.5218
## F-statistic: 183.2 on 1 and 166 DF, p-value: < 2.2e-16
We see above that the point estimate for the Service coefficient ^ is 2.491 with standard error 0.184. The t test
for H0 : = 0 vs. H1 : 0 produces a t-value of 2.491/0.184 = 13.54 on 166 degrees of freedom.
This test produces a p-value of p < 2e 16 indicating that good decor leads to higher prices. The anova
table is shown in the output above.
This produces a p-value of approximately 0 for testing whether there is a linear relationship between Price and
Service, that is H0 : = 0 . So it appears that there is a relationship.
The SSE, MSE and SSR, MSR, R2 as well as the F statistic are calculated below:
SSE=sum(m1$residual^2)
MSE=sum(m1$residual^2)/m1$df.residual
SSR=sum((m1$fitted.values-mean(Dat$Price))^2)
Fstat=sum((m1$fitted.values-mean(Dat$Price))^2)/(sum(m1$residual^2)/m1$df.residual)
SSE
## [1] 6854.742
MSE
## [1] 41.29363
SSR
## [1] 7566.776
Fstat
## [1] 183.2432
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 18/24
9/28/2017 HW 2 Solutions
Rsquared= SSR/(SSE+SSR)
Rsquared
## [1] 0.5246865
The regression line and the 95% confidence interval is shown below
x=Dat$Decor
y=Dat$Price
## Point of averages
(meanx <- mean(x))
## [1] 17.69048
## [1] 42.69643
## Number of samples
n <- length(x)
## [1] 3038.214
## Also
## SXY <- sum((x-meanx)*(y-meany))
## [1] 1219.905
## Also
## SXX <- sum(((x-meanx)^2))
## [1] 2.490534
## [1] -1.362304
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 19/24
9/28/2017 HW 2 Solutions
## [1] 0.7200409
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9578 -4.4862 -0.4673 4.0422 18.5138
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.362 3.292 -0.414 0.68
## x 2.490 0.184 13.537 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.426 on 166 degrees of freedom
## Multiple R-squared: 0.5247, Adjusted R-squared: 0.5218
## F-statistic: 183.2 on 1 and 166 DF, p-value: < 2.2e-16
## Fitted values
(yhat <- beta0hat + beta1hat*x)
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 20/24
9/28/2017 HW 2 Solutions
## Residuals
(ehat <- y - yhat)
## [1] 7566.776
## [1] 6854.742
## [1] 41.29363
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 21/24
9/28/2017 HW 2 Solutions
## [1] 6.426012
## [1] 166
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 22/24
9/28/2017 HW 2 Solutions
Finally the 95% confidence intervals for 0 and 1 are calculated below:
CIB0 = c(m1$coeff[1]-3.292*1.96,m1$coeff[1]+3.292*1.96)
CIB1 = c(m1$coeff[2]-0.184*1.96,m1$coeff[2]+0.184*1.96)
CIB0
## (Intercept) (Intercept)
## -7.814624 5.090016
CIB1
## Dat$Decor Dat$Decor
## 2.129894 2.851174
The values of S^E and S^E were obtained from the regression summary, but also could have been
1 0
obtained directly. We see that the 95% confidence interval for 1 is greater than 0, corroborating the results
from the hypothesis test (that is 1 0 ) and indicating that better restaurant decor leads to higher prices. In
this case, the confidence interval for 0 contains 0, corroborating the hypothesis test that failed to reject
H0 : 0 = 0.
C)
Which of the two predictor variables, service or decor, do you think better predicts price. Explain.
The F statistic and R2 values are higher for the model using decor as an explanatory variable indicating that for
this data set, decor is a better predictor of price than service.
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 23/24
9/28/2017 HW 2 Solutions
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 24/24