Vous êtes sur la page 1sur 24

9/28/2017 HW 2 Solutions

HW 2 Solutions
STAT 410
September 30, 2016

1)
Textbook 2.4. For part a also show that ^ is a linear estimator in the y data, being sure to identify the weights,
ci . Add the following part c to this exercise: (c) Find an estimator for ? Is it unbiased?
2

Consider the model Yi = x i + i where i N (0,


2
)

A)
To get the least squares estimator we want to minimize:
n n n n n

2 2 2 2 2
e = (yi x i ) = y 2 x i yi + x
i i i

i=1 i=1 i=1 i=1 i=1

with respect to . If we take the derivative of this with respect to and set this equation equal to 0, we can
solve for the that minimizes the least squares error. Taking the derivative with respect to , we obtain:
n

x i yi
n n
i=1
2 ^
0 = 2 x i yi + 2 x =
i n
i=1 i=1 2
x
i
i=1

which is the desired least squares estimator from the testbook. We have that ^ is a linear estimator in the y
data because
n

x i yi
n
i=1
^
= = ci yi
n
2 i=1
x
i
i=1

where

xi
ci =
n
2
x
i
i=1

B)
i.
Show that E(^|X) =

Solution
n n n n

x y x i E(yi ) x
2
x
2
i i i i

^ i=1 i=1 i=1 i=1


E( |X) = E = = = =
n n n n

2 2 2 2
x x x x
i i i i
i=1 i=1 i=1 i=1

ii.
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 1/24
9/28/2017 HW 2 Solutions
2

Show that V ar(^|x)



= n
2
x
i
i=1

Solution
n n n n

x y 2
x V ar(yi ) x
2 2

2
x
2
i i i i i
2
i=1 i=1 i=1 i=1
^
V ar( |x) = V ar = = = =
n n n n n

2 2 2 2 2 2 2 2
x ( x ) ( x ) ( x ) x
i i i i i
i=1 i=1 i=1 i=1 i=1

iii.
2

Show that ^|X



N (, n
)
xi
i=1

Solution

We already showed the mean and variance of ^|X above so all we need to show is that it is normally
distributed. Now,
n n n n
2
x i yi x x i i x i i
i
i=1 i=1 i=1 i=1
^
= = + = +
n n n n
2 2 2 2
x x x x
i i i i
i=1 i=1 i=1 i=1

and since we know that i n(0, 1) , the above is a linear combination of normally distributed random
variables and is therefore itself normally distributed.

C)
We could use the usual MSE but with once more degree of freedom. (since we dont have an intercept)
n n
^ 2 2 ^ 2 2
(yi x i ) y 2yi x i + x
i i

2 i=1 i=1
^
= =
n 1 n 1

And we have that

2 2
^ /
(n 1) n1

So

2 2 2
^ ) = n 1
E( /(n 1) =

So this estimator is unbiased!

2)
Consider the SLR model

Yi = 0 + 1 x i + i

for i = 1, . . . , n and i N (0, 2 ) are i.i.d. In class we considered an indicator variable defined by 0 for
male and 1 for female. This problem continues that discussion. Assume there are k males and m females.

A)

https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 2/24
9/28/2017 HW 2 Solutions

Using the parameterization, {0, 1} as in class, derive an expression for V ar(^1 ). Recall that the general
2

expression for the variance is V ar(^) =


SXX

Solution

First note n = k + m .

Since x i = {0, 1} for all i , we know that (for an indicator I[x i =1]
= 1 if x i = 1 and 0 otherwise, that is
person i is a female :

n 2

x
i
n n n 2
i=1 m m(n m) mk
2 2 2
SXX = (x i x) = x nx = x i n = m = =
i
n n n k + m
i=1 i=1 i=1

so we have that:

2 2 2
n (k + m)
^
V ar( ) = = =
SXX (k + m) mk

B)
Using the parameterization where a female is coded by +1 and a male by 1, derive an expression for each ^
and interpret the regression coefficient estimators. Derive an expression for var(^1 ).

Solution

We note that since x 2i = 1 for all i ,

n n 2

2 2
m k
2
SXX = (x i x) = x nx = n n( ) =
i
n
i=1 i=1

2 2 2 2 2 2
m 2mk + k m + 2mk + k m + 2mk + k 4mk
m + k = =
m + k m + k m + k

so we have that:

2 2 2
n (m + k)
^
V ar( ) = = =
SXX 4mk 4mk

We solve for ^1 and ^0 :

https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 3/24
9/28/2017 HW 2 Solutions
n n n
^ ^ 2

0 x i + 1 x = Yi x i
i
i=1 i=1 i=1

n n

^ ^
0 n + 1 x i = Yi

i=1 i=1

m k


^ ^
0 (m k) + 1 (m + k) = Y i Y i

i=1 i=1

m


^ ^
0 (m + k) + 1 (m k) = Y i

i=1

n m k

^ ^
2 0 m + 2 1 m = Y i + Y i Y i

i=1 i=1 i=1

m k m k

^ ^
2 0 m + 2 1 m = Y i + Y i + Y i Y i

i=1 i=1 i=1 i=1

^ ^
2m 0 + 2m 1 = 2 Y i

i=1

^ ^
0 + 1 = Y f

^ ^
0 = Y f 1

m k

^ ^
(Y f 1 ) (m k) + 1 (m + k) = Y i Y i

i=1 i=1

m k

^
Y f (m k) + 1 (m + k + k + m + k) = Y i Y i

i=1 i=1

m k

^
2k 0 = Y i Y i Y f m + Y f k

i=1 i=1

^
2k 1 = kY m + kY f


Yf Ym
^
1 =
2

Yf Ym 2Y f Y f + Y m Yf + Ym
^
0 = Y f = =
2 2 2

C)
The dataset, sexsalary.csv, was collected as part of a legal case on sex discrimination at a certain bank. For
now we are interested in the association between sex (sex) and base salary (bsal) when an employee was
hired. Use each of the above parameterizations for sex to fit a SLR model of base salary on sex. For each fit
interpret the regression coefficient estimates and determine whether there is statistical evidence for sex
differences in base salary. Do the two fits lead to the same conclusion? The dataset, sexsalary.csv, is attached
to this assignment

Solution

For the first model, the results of the regression are shown below:

https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 4/24
9/28/2017 HW 2 Solutions

Data1 <- read.table("sexsalary.txt", as.is = TRUE, header = TRUE);


#### Creating new variable where female = 1, male = 0
sex1 = rep(0,length(Data1$sex))
for(i in 1:length(Data1$sex)){
if(Data1$sex[i]=="Female"){
sex1[i] = 1
}else{ sex1[i] = 0}}
## linear model
m2 = lm(Data1$bsal~sex1)
summary(m2)

##
## Call:
## lm(formula = Data1$bsal ~ sex1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1336.88 -338.85 43.12 261.15 2143.12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5956.9 105.3 56.580 < 2e-16 ***
## sex1 -818.0 130.0 -6.293 1.08e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 595.6 on 91 degrees of freedom
## Multiple R-squared: 0.3032, Adjusted R-squared: 0.2955
## F-statistic: 39.6 on 1 and 91 DF, p-value: 1.076e-08

We see that for model provides a good fit producing a F statistic of 39.6 on 1,91 degrees of freedom and an
associated p-value of approximately 0. We also see that the t- test for the gender coefficient is 6.293
producing a p-value of approximately 0 for testing H0 : 1 = 0. The intercept is also significantly different from
0. This indicates that women make about 818 less (based on this data) in base salary than their maile

counterparts. However, we are cautious because of the low R2 = 0.2955

plot(sex1,Data1$bsal, main="Gender Versus base Salary", xlab = "Gender", ylab = "Base


Salary")
abline(m2,col="red")

https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 5/24
9/28/2017 HW 2 Solutions

This is reflected in the scatterplot with the regression line superimposed. Clearly males on average make more
than females for this data in base salary but the data are spread out greatly around the regression line, so this
trend has lots of variation.

These two methods produce the exact same interpretive results! The only difference are the coefficient values
and standard error change.

For the Second model, the results of the regression are shown below:

#### Creating new variable where female = 1, male = -1


sex2 = rep(0,length(Data1$sex))
for(i in 1:length(Data1$sex)){
if(Data1$sex[i]=="Female"){
sex2[i] = 1
}else{ sex2[i] = -1}}
## linear model
m3 = lm(Data1$bsal~sex2)
summary(m3)

https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 6/24
9/28/2017 HW 2 Solutions

##
## Call:
## lm(formula = Data1$bsal ~ sex2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1336.88 -338.85 43.12 261.15 2143.12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5548 65 85.353 < 2e-16 ***
## sex2 -409 65 -6.293 1.08e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 595.6 on 91 degrees of freedom
## Multiple R-squared: 0.3032, Adjusted R-squared: 0.2955
## F-statistic: 39.6 on 1 and 91 DF, p-value: 1.076e-08

We see that for model provides a good fit producing a F statistic of 39.6 on 1,91 degrees of freedom and an
associated p-value of approximately 0. We also see that the t- test for the gender coefficient is 6.293
producing a p-value of approximately 0 for testing H0 : 1 = 0. The intercept is also significantly different from
0. This indicates that women make about $ 818$ less (based on this data) in base salary than their maile

counterparts. However, we are cautious because of the low R2 = 0.2955

plot(sex2,Data1$bsal, main="Gender Versus base Salary", xlab = "Gender", ylab = "Base


Salary")
abline(m3,col="red")

https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 7/24
9/28/2017 HW 2 Solutions

This is reflected in the scatterplot with the regression line superimposed. Clearly males on average make more
than females for this data in base salary but the data are spread out greatly around the regression line, so this
trend has lots of variation.

For the first model, the results of the regression are shown below:

#### Creating new variable where female = 1, male = 0


sex1 = rep(0,length(Data1$sex))
for(i in 1:length(Data1$sex)){
if(Data1$sex[i]=="Female"){
sex1[i] = 1
}else{ sex1[i] = 0}}
## linear model
m2 = lm(Data1$bsal~sex1)
summary(m2)

https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 8/24
9/28/2017 HW 2 Solutions

##
## Call:
## lm(formula = Data1$bsal ~ sex1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1336.88 -338.85 43.12 261.15 2143.12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5956.9 105.3 56.580 < 2e-16 ***
## sex1 -818.0 130.0 -6.293 1.08e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 595.6 on 91 degrees of freedom
## Multiple R-squared: 0.3032, Adjusted R-squared: 0.2955
## F-statistic: 39.6 on 1 and 91 DF, p-value: 1.076e-08

We see that for model provides a good fit producing a F statistic of 39.6 on 1,91 degrees of freedom and an
associated p-value of approximately 0. We also see that the t- test for the gender coefficient is 6.293
producing a p-value of approximately 0 for testing H0 : 1 = 0. The intercept is also significantly different from
0. This indicates that women make about (1) 409 + 409 = 818 less (based on this data) in base salary

than their maile counterparts. However, we are cautious because of the low R2 = 0.2955

plot(sex1,Data1$bsal, main="Gender Versus base Salary", xlab = "Gender", ylab = "Base


Salary")
abline(m2,col="red")

https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 9/24
9/28/2017 HW 2 Solutions

This is reflected in the scatterplot with the regression line superimposed. Clearly males on average make more
than females for this data in base salary but the data are spread out greatly around the regression line, so this
trend has lots of variation.

Interpretation: the difference between men and women should be the same. However, because we now have
men and women encoded as 1 and -1, the estimate of Beta has changed. The estimate is now half of the
estimate of Beta when it was encoded 1 and 0 (-409 now instead of -818). Likewise, The y-intercept can no
longer be interpreted as the average salary of whichever gender was encoded as 0 (females). Beta evaluated
when X=1 is no longer the decrease in average salary when somebody is a female. As mentioned before, to
get this decrease you would have to look at the difference between y hat when beta is 1 and beta is -1 (should
still be 818).

3)
Consider the restaurant data (attached as nyc.csv) in chapter 1 of the textbook, section 1.2.3.

A)
Fit a SLR model of price on service. Report (a) a plot of the data with the fitted regression line and a 95 %
confidence interval for the SLR model; (b) point estimates of the regression coefficients and their SEs; (c)
hypothesis tests of the regression coefficients; (d) confidence intervals for the regression coefficients with
interpretation; (e) regression ANOVA table; (f) R2

Dat <- read.csv("nyc.csv", as.is = TRUE, header = TRUE);

names(Dat)

https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 10/24
9/28/2017 HW 2 Solutions

## [1] "Case" "Restaurant" "Price" "Food" "Decor"


## [6] "Service" "East"

m1 = lm(Dat$Price~Dat$Service)

summary(m1)

##
## Call:
## lm(formula = Dat$Price ~ Dat$Service)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6646 -4.7540 -0.2093 4.3368 26.2460
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.9778 5.1093 -2.344 0.0202 *
## Dat$Service 2.8184 0.2618 10.764 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.153 on 166 degrees of freedom
## Multiple R-squared: 0.4111, Adjusted R-squared: 0.4075
## F-statistic: 115.9 on 1 and 166 DF, p-value: < 2.2e-16

names(m1)

## [1] "coefficients" "residuals" "effects" "rank"


## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"

We see above that the point estimate for the Service coefficient ^ is 2.818 with standard error 0.262. The t test
for H0 : = 0 vs. H1 : 0 produces a t-value of 2.818/0.262 = 10.76 on 166 degrees of freedom.
This test produces a p-value of p < 2e 16 indicating that good customer service leads to higher prices. The
anova table is shown in the output above.

This produces a p-value of approximately 0 for testing whether there is a linear relationship between Price and
Service, that is H0 : = 0 . So it appears that there is a relationship.

The SSE, MSE and SSR and MSR, R2 as well as the F statistic are calculated below:

SSE=sum(m1$residual^2)
MSE=sum(m1$residual^2)/m1$df.residual
SSR=sum((m1$fitted.values-mean(Dat$Price))^2)
Fstat=sum((m1$fitted.values-mean(Dat$Price))^2)/(sum(m1$residual^2)/m1$df.residual)
SSE

## [1] 8493.398

MSE

https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 11/24
9/28/2017 HW 2 Solutions

## [1] 51.16505

SSR

## [1] 5928.12

Fstat

## [1] 115.8627

Rsquared= SSR/(SSE+SSR)
Rsquared

## [1] 0.4110608

The plot of the fitted regression line is shown below along with a 95$ %$ confidence interval.

x=Dat$Service
y=Dat$Price
## Point of averages
(meanx <- mean(x))

## [1] 19.39881

(meany <- mean(y))

## [1] 42.69643

## Number of samples
n <- length(x)

(SXY <- sum(x*y) - n*meanx*meany)

## [1] 2103.339

## Also
## SXY <- sum((x-meanx)*(y-meany))

(SXX <- sum(x^2) - n*meanx^2)

## [1] 746.2798

https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 12/24
9/28/2017 HW 2 Solutions

## Also
## SXX <- sum(((x-meanx)^2))

## Coefficients of the regression line


(beta1hat <- SXY/SXX)

## [1] 2.818433

(beta0hat <- meany - beta1hat*meanx)

## [1] -11.97781

(r <- mean((x-meanx)/sd(x) * (y-meany)/sd(y)))

## [1] 0.6373239

## Using R to get the regression coefficients (and lot more...)


lmfit <-lm(y ~ x)
summary(lmfit)

##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6646 -4.7540 -0.2093 4.3368 26.2460
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.9778 5.1093 -2.344 0.0202 *
## x 2.8184 0.2618 10.764 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.153 on 166 degrees of freedom
## Multiple R-squared: 0.4111, Adjusted R-squared: 0.4075
## F-statistic: 115.9 on 1 and 166 DF, p-value: < 2.2e-16

## Fitted values
(yhat <- beta0hat + beta1hat*x)

https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 13/24
9/28/2017 HW 2 Solutions

## [1] 44.39084 41.57241 38.75398 35.93555 47.20928 47.20928 47.20928


## [8] 47.20928 50.02771 41.57241 44.39084 47.20928 44.39084 41.57241
## [15] 47.20928 47.20928 52.84614 50.02771 38.75398 50.02771 52.84614
## [22] 47.20928 38.75398 47.20928 50.02771 41.57241 50.02771 38.75398
## [29] 44.39084 44.39084 38.75398 38.75398 41.57241 52.84614 52.84614
## [36] 47.20928 44.39084 50.02771 52.84614 47.20928 38.75398 35.93555
## [43] 38.75398 52.84614 47.20928 47.20928 44.39084 44.39084 47.20928
## [50] 33.11711 44.39084 44.39084 35.93555 35.93555 33.11711 30.29868
## [57] 44.39084 41.57241 44.39084 38.75398 47.20928 35.93555 35.93555
## [64] 41.57241 35.93555 38.75398 44.39084 30.29868 30.29868 44.39084
## [71] 44.39084 50.02771 50.02771 38.75398 50.02771 47.20928 44.39084
## [78] 38.75398 41.57241 41.57241 47.20928 47.20928 55.66457 41.57241
## [85] 52.84614 41.57241 44.39084 55.66457 47.20928 38.75398 47.20928
## [92] 52.84614 50.02771 47.20928 38.75398 38.75398 44.39084 41.57241
## [99] 35.93555 30.29868 47.20928 35.93555 44.39084 35.93555 47.20928
## [106] 47.20928 33.11711 41.57241 41.57241 47.20928 41.57241 38.75398
## [113] 47.20928 47.20928 30.29868 35.93555 27.48025 41.57241 38.75398
## [120] 35.93555 38.75398 38.75398 35.93555 44.39084 41.57241 35.93555
## [127] 44.39084 38.75398 41.57241 38.75398 35.93555 50.02771 47.20928
## [134] 44.39084 44.39084 38.75398 44.39084 35.93555 33.11711 35.93555
## [141] 41.57241 44.39084 47.20928 47.20928 50.02771 47.20928 35.93555
## [148] 52.84614 44.39084 38.75398 44.39084 52.84614 44.39084 38.75398
## [155] 33.11711 41.57241 47.20928 35.93555 33.11711 41.57241 50.02771
## [162] 41.57241 38.75398 33.11711 35.93555 35.93555 47.20928 33.11711

## Residuals
(ehat <- y - yhat)

https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 14/24
9/28/2017 HW 2 Solutions

## [1] -1.39084347 -9.57241077 -4.75397807 5.06445464 6.79072383


## [6] 4.79072383 -13.20927617 -13.20927617 -11.02770887 2.42758923
## [11] 0.60915653 -0.20927617 7.60915653 -6.57241077 -0.20927617
## [16] -10.20927617 -7.84614158 6.97229113 -0.75397807 0.97229113
## [21] 1.15385842 3.79072383 -0.75397807 1.79072383 -5.02770887
## [26] -4.57241077 -0.02770887 4.24602193 4.60915653 20.60915653
## [31] -4.75397807 12.24602193 7.42758923 -1.84614158 9.15385842
## [36] 2.79072383 6.60915653 1.97229113 4.15385842 1.79072383
## [41] -5.75397807 7.06445464 2.24602193 5.15385842 8.79072383
## [46] -3.20927617 -7.39084347 11.60915653 10.79072383 10.88288734
## [51] 1.60915653 -4.39084347 3.06445464 0.06445464 0.88288734
## [56] 23.70132004 6.60915653 -0.57241077 -4.39084347 -14.75397807
## [61] 5.79072383 -4.93554536 -0.93554536 7.42758923 2.06445464
## [66] 9.24602193 -1.39084347 -1.29867996 6.70132004 10.60915653
## [71] -7.39084347 4.97229113 -1.02770887 -5.75397807 1.97229113
## [76] -0.20927617 -1.39084347 -5.75397807 -3.57241077 6.42758923
## [81] 2.79072383 -1.20927617 -17.66457428 -8.57241077 -6.84614158
## [86] -4.57241077 5.60915653 -1.66457428 -6.20927617 -1.75397807
## [91] 2.79072383 7.15385842 -14.02770887 6.79072383 0.24602193
## [96] -3.75397807 -14.39084347 -0.57241077 -5.93554536 -5.29867996
## [101] -4.20927617 9.06445464 12.60915653 -3.93554536 3.79072383
## [106] 0.79072383 2.88288734 -4.57241077 -10.57241077 -0.20927617
## [111] -1.57241077 -1.75397807 -4.20927617 3.79072383 -11.29867996
## [116] -7.93554536 -5.48024726 -0.57241077 -5.75397807 -6.93554536
## [121] -5.75397807 6.24602193 2.06445464 7.60915653 -3.57241077
## [126] 11.06445464 1.60915653 1.24602193 -9.57241077 26.24602193
## [131] 11.06445464 14.97229113 -2.20927617 1.60915653 -0.39084347
## [136] 1.24602193 1.60915653 -3.93554536 -10.11711266 6.06445464
## [141] -12.57241077 4.60915653 5.79072383 -2.20927617 12.97229113
## [146] 4.79072383 4.06445464 -7.84614158 -6.39084347 -0.75397807
## [151] -2.39084347 4.15385842 -5.39084347 4.24602193 -4.11711266
## [156] 0.42758923 2.79072383 -1.93554536 -2.11711266 -10.57241077
## [161] -4.02770887 0.42758923 -7.75397807 -2.11711266 -9.93554536
## [166] -4.93554536 -9.20927617 0.88288734

(SSreg <- sum((yhat-meany)^2))

## [1] 5928.12

## This quantity appears in the line of RunSize of anova(lmfit) below.

(RSS <- sum(ehat^2))

## [1] 8493.398

(S2 <- RSS/(n-2))

## [1] 51.16505

## These quantities appear in the line of residuals of:


anova(lmfit)

https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 15/24
9/28/2017 HW 2 Solutions

## Analysis of Variance Table


##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 5928.1 5928.1 115.86 < 2.2e-16 ***
## Residuals 166 8493.4 51.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## The square root of S2 is the residual standard error


(S <- sqrt(S2))

## [1] 7.152975

## where the degrees of freedom are


n-2

## [1] 166

## This info appears in the line "Residual standard error:" of


## summary(lmfit)
## above.
############################

xast <- seq(min(x), max(x), l=200)


yhatast <- beta0hat + beta1hat*xast

se_yhatast <- S * sqrt(1/n + (xast-meanx)^2/SXX)

plot(x, y, xlab="Service", ylab="Price")


abline(a=beta0hat, b=beta1hat)
## 95% confidence intervals for the population regression line
alpha <- (1 - 0.95)
lowerc <- yhatast - se_yhatast * qt(p=1-alpha/2, df=n-2)
upperc <- yhatast + se_yhatast * qt(p=1-alpha/2, df=n-2)
lines(xast, lowerc)
lines(xast, upperc)

https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 16/24
9/28/2017 HW 2 Solutions

Finally the 95% confidence intervals for 0 and 1 are calculated below:

CIB0 = c(m1$coeff[1]-5.1093*1.96,m1$coeff[1]+5.1093*1.96)
CIB1 = c(m1$coeff[2]-0.2618*1.96,m1$coeff[2]+0.2618*1.96)
CIB0

## (Intercept) (Intercept)
## -21.992039 -1.963583

CIB1

## Dat$Service Dat$Service
## 2.305305 3.331561

The values of S^E and S^E were obtained from the regression summary, but also could have been
1 0

obtained directly. We see that the 95% confidence interval for 1 is greater than 0, corroborating the results
from the hypothesis test (that is 1 0 ) and indicating that better restaurant service leads to higher prices. In
this case, the confidence interval for 0 does not 0, corroborating the hypothesis test that rejected H0 : 0 = 0

B)
Repeat the above for a SLR model of price on decor.

https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 17/24
9/28/2017 HW 2 Solutions

m1 = lm(Dat$Price~Dat$Decor)
summary(m1)

##
## Call:
## lm(formula = Dat$Price ~ Dat$Decor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9578 -4.4862 -0.4673 4.0422 18.5138
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.362 3.292 -0.414 0.68
## Dat$Decor 2.490 0.184 13.537 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.426 on 166 degrees of freedom
## Multiple R-squared: 0.5247, Adjusted R-squared: 0.5218
## F-statistic: 183.2 on 1 and 166 DF, p-value: < 2.2e-16

We see above that the point estimate for the Service coefficient ^ is 2.491 with standard error 0.184. The t test
for H0 : = 0 vs. H1 : 0 produces a t-value of 2.491/0.184 = 13.54 on 166 degrees of freedom.
This test produces a p-value of p < 2e 16 indicating that good decor leads to higher prices. The anova
table is shown in the output above.

This produces a p-value of approximately 0 for testing whether there is a linear relationship between Price and
Service, that is H0 : = 0 . So it appears that there is a relationship.

The SSE, MSE and SSR, MSR, R2 as well as the F statistic are calculated below:

SSE=sum(m1$residual^2)
MSE=sum(m1$residual^2)/m1$df.residual
SSR=sum((m1$fitted.values-mean(Dat$Price))^2)
Fstat=sum((m1$fitted.values-mean(Dat$Price))^2)/(sum(m1$residual^2)/m1$df.residual)
SSE

## [1] 6854.742

MSE

## [1] 41.29363

SSR

## [1] 7566.776

Fstat

## [1] 183.2432

https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 18/24
9/28/2017 HW 2 Solutions

Rsquared= SSR/(SSE+SSR)
Rsquared

## [1] 0.5246865

The regression line and the 95% confidence interval is shown below

x=Dat$Decor
y=Dat$Price
## Point of averages
(meanx <- mean(x))

## [1] 17.69048

(meany <- mean(y))

## [1] 42.69643

## Number of samples
n <- length(x)

(SXY <- sum(x*y) - n*meanx*meany)

## [1] 3038.214

## Also
## SXY <- sum((x-meanx)*(y-meany))

(SXX <- sum(x^2) - n*meanx^2)

## [1] 1219.905

## Also
## SXX <- sum(((x-meanx)^2))

## Coefficients of the regression line


(beta1hat <- SXY/SXX)

## [1] 2.490534

(beta0hat <- meany - beta1hat*meanx)

## [1] -1.362304

(r <- mean((x-meanx)/sd(x) * (y-meany)/sd(y)))

https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 19/24
9/28/2017 HW 2 Solutions

## [1] 0.7200409

## Using R to get the regression coefficients (and lot more...)


lmfit <-lm(y ~ x)
summary(lmfit)

##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9578 -4.4862 -0.4673 4.0422 18.5138
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.362 3.292 -0.414 0.68
## x 2.490 0.184 13.537 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.426 on 166 degrees of freedom
## Multiple R-squared: 0.5247, Adjusted R-squared: 0.5218
## F-statistic: 183.2 on 1 and 166 DF, p-value: < 2.2e-16

## Fitted values
(yhat <- beta0hat + beta1hat*x)

## [1] 43.46731 45.95784 31.01464 48.44838 45.95784 53.42944 38.48624


## [8] 43.46731 45.95784 40.97677 40.97677 45.95784 45.95784 40.97677
## [15] 43.46731 45.95784 43.46731 50.93891 40.97677 48.44838 48.44838
## [22] 40.97677 43.46731 50.93891 48.44838 40.97677 45.95784 38.48624
## [29] 45.95784 48.44838 38.48624 48.44838 45.95784 53.42944 53.42944
## [36] 50.93891 43.46731 45.95784 48.44838 48.44838 40.97677 40.97677
## [43] 40.97677 50.93891 40.97677 40.97677 35.99571 40.97677 43.46731
## [50] 38.48624 43.46731 48.44838 43.46731 33.50517 35.99571 38.48624
## [57] 40.97677 33.50517 40.97677 31.01464 48.44838 38.48624 38.48624
## [64] 45.95784 35.99571 38.48624 45.95784 33.50517 43.46731 50.93891
## [71] 43.46731 48.44838 48.44838 33.50517 48.44838 38.48624 38.48624
## [78] 40.97677 38.48624 43.46731 43.46731 45.95784 45.95784 38.48624
## [85] 45.95784 35.99571 43.46731 58.41051 45.95784 35.99571 43.46731
## [92] 53.42944 38.48624 45.95784 43.46731 38.48624 31.01464 40.97677
## [99] 33.50517 35.99571 43.46731 35.99571 38.48624 35.99571 50.93891
## [106] 48.44838 38.48624 40.97677 45.95784 45.95784 38.48624 38.48624
## [113] 48.44838 53.42944 21.05250 33.50517 13.58090 40.97677 35.99571
## [120] 35.99571 40.97677 38.48624 38.48624 55.91998 40.97677 43.46731
## [127] 43.46731 40.97677 35.99571 55.91998 50.93891 53.42944 40.97677
## [134] 53.42944 45.95784 45.95784 43.46731 35.99571 33.50517 50.93891
## [141] 43.46731 43.46731 58.41051 45.95784 60.90105 55.91998 48.44838
## [148] 50.93891 40.97677 40.97677 38.48624 45.95784 45.95784 43.46731
## [155] 33.50517 38.48624 45.95784 38.48624 35.99571 40.97677 45.95784
## [162] 35.99571 38.48624 35.99571 38.48624 38.48624 40.97677 23.54304

https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 20/24
9/28/2017 HW 2 Solutions

## Residuals
(ehat <- y - yhat)

## [1] -0.46730814 -13.95784214 2.98536185 -7.44837614 8.04215786


## [6] -1.42944414 -4.48624014 -9.46730814 -6.95784214 3.02322586
## [11] 4.02322586 1.04215786 6.04215786 -5.97677414 3.53269186
## [16] -8.95784214 1.53269186 6.06108986 -2.97677414 2.55162386
## [21] 5.55162386 10.02322586 -5.46730814 -1.93891014 -3.44837614
## [26] -3.97677414 4.04215786 4.51375986 3.04215786 16.55162386
## [31] -4.48624014 2.55162386 3.04215786 -2.42944414 8.57055586
## [36] -0.93891014 7.53269186 6.04215786 8.55162386 0.55162386
## [41] -7.97677414 2.02322586 0.02322586 7.06108986 15.02322586
## [46] 3.02322586 1.00429386 15.02322586 14.53269186 5.51375986
## [51] 2.53269186 -8.44837614 -4.46730814 2.49482786 -1.99570614
## [56] 15.51375986 10.02322586 7.49482786 -0.97677414 -7.01463815
## [61] 4.55162386 -7.48624014 -3.48624014 3.04215786 2.00429386
## [66] 9.51375986 -2.95784214 -4.50517214 -6.46730814 4.06108986
## [71] -6.46730814 6.55162386 0.55162386 -0.50517214 3.55162386
## [76] 8.51375986 4.51375986 -7.97677414 -0.48624014 4.53269186
## [81] 6.53269186 0.04215786 -7.95784214 -5.48624014 0.04215786
## [86] 1.00429386 6.53269186 -4.41051214 -4.95784214 1.00429386
## [91] 6.53269186 6.57055586 -2.48624014 8.04215786 -4.46730814
## [96] -3.48624014 -1.01463815 0.02322586 -3.50517214 -10.99570614
## [101] -0.46730814 9.00429386 18.51375986 -3.99570614 0.06108986
## [106] -0.44837614 -2.48624014 -3.97677414 -14.95784214 1.04215786
## [111] 1.51375986 -1.48624014 -5.44837614 -2.42944414 -2.05250215
## [116] -5.50517214 8.41909985 0.02322586 -2.99570614 -6.99570614
## [121] -7.97677414 6.51375986 -0.48624014 -3.91997814 -2.97677414
## [126] 3.53269186 2.53269186 -0.97677414 -3.99570614 9.08002186
## [131] -3.93891014 11.57055586 4.02322586 -7.42944414 -1.95784214
## [136] -5.95784214 2.53269186 -3.99570614 -10.50517214 -8.93891014
## [141] -14.46730814 5.53269186 -5.41051214 -0.95784214 2.09895386
## [146] -3.91997814 -8.44837614 -5.93891014 -2.97677414 -2.97677414
## [151] 3.51375986 11.04215786 -6.95784214 -0.46730814 -4.50517214
## [156] 3.51375986 4.04215786 -4.48624014 -4.99570614 -9.97677414
## [161] 0.04215786 6.00429386 -7.48624014 -4.99570614 -12.48624014
## [166] -7.48624014 -2.97677414 10.45696385

(SSreg <- sum((yhat-meany)^2))

## [1] 7566.776

## This quantity appears in the line of RunSize of anova(lmfit) below.

(RSS <- sum(ehat^2))

## [1] 6854.742

(S2 <- RSS/(n-2))

## [1] 41.29363

https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 21/24
9/28/2017 HW 2 Solutions

## These quantities appear in the line of residuals of:


anova(lmfit)

## Analysis of Variance Table


##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 7566.8 7566.8 183.24 < 2.2e-16 ***
## Residuals 166 6854.7 41.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## The square root of S2 is the residual standard error


(S <- sqrt(S2))

## [1] 6.426012

## where the degrees of freedom are


n-2

## [1] 166

## This info appears in the line "Residual standard error:" of


## summary(lmfit)
## above.
############################

xast <- seq(min(x), max(x), l=200)


yhatast <- beta0hat + beta1hat*xast

se_yhatast <- S * sqrt(1/n + (xast-meanx)^2/SXX)

plot(x, y, xlab="Decor", ylab="Price")


abline(a=beta0hat, b=beta1hat)
## 95% confidence intervals for the population regression line
alpha <- (1 - 0.95)
lowerc <- yhatast - se_yhatast * qt(p=1-alpha/2, df=n-2)
upperc <- yhatast + se_yhatast * qt(p=1-alpha/2, df=n-2)
lines(xast, lowerc)
lines(xast, upperc)

https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 22/24
9/28/2017 HW 2 Solutions

Finally the 95% confidence intervals for 0 and 1 are calculated below:

CIB0 = c(m1$coeff[1]-3.292*1.96,m1$coeff[1]+3.292*1.96)
CIB1 = c(m1$coeff[2]-0.184*1.96,m1$coeff[2]+0.184*1.96)
CIB0

## (Intercept) (Intercept)
## -7.814624 5.090016

CIB1

## Dat$Decor Dat$Decor
## 2.129894 2.851174

The values of S^E and S^E were obtained from the regression summary, but also could have been
1 0

obtained directly. We see that the 95% confidence interval for 1 is greater than 0, corroborating the results
from the hypothesis test (that is 1 0 ) and indicating that better restaurant decor leads to higher prices. In
this case, the confidence interval for 0 contains 0, corroborating the hypothesis test that failed to reject
H0 : 0 = 0.

C)
Which of the two predictor variables, service or decor, do you think better predicts price. Explain.

The F statistic and R2 values are higher for the model using decor as an explanatory variable indicating that for
this data set, decor is a better predictor of price than service.
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 23/24
9/28/2017 HW 2 Solutions

https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 24/24

Vous aimerez peut-être aussi