410 - HW 2 Solutions

9/28/2017 HW 2 Solutions
HW 2 Solutions
STAT 410
September 30, 2016
1)
Textbook 2.4. For part a also show that ^ is a linear estimator in the y data, being sure to identify the weights,
ci . Add the following part c to this exercise: (c) Find an estimator for ? Is it unbiased?
2
Consider the model Yi = x i + i where i N (0,

2
)
A)
To get the least squares estimator we want to minimize:
n n n n n
2 2 2 2 2
e = (yi x i ) = y 2 x i yi + x
i i i
i=1 i=1 i=1 i=1 i=1
with respect to . If we take the derivative of this with respect to and set this equation equal to 0, we can
solve for the that minimizes the least squares error. Taking the derivative with respect to , we obtain:
n
x i yi
n n
i=1
2 ^
0 = 2 x i yi + 2 x =
i n
i=1 i=1 2
x
i
i=1
which is the desired least squares estimator from the testbook. We have that ^ is a linear estimator in the y
data because
n
x i yi
n
i=1
^
= = ci yi
n
2 i=1
x
i
i=1
where
xi
ci =
n
2
x
i
i=1
B)
i.
Show that E(^|X) =
Solution
n n n n
x y x i E(yi ) x
2
x
2
i i i i
^ i=1 i=1 i=1 i=1

E( |X) = E = = = =
n n n n

2 2 2 2
x x x x
i i i i
i=1 i=1 i=1 i=1
ii.
https://owlspace-ccm.rice.edu/access/content/group/STAT-410-001-F16/hw_sol/s410_f16_hw2_sol.html 1/24
2
Show that V ar(^|x)

= n
2
x
i
i=1
Solution
n n n n
x y 2
x V ar(yi ) x
2 2

2
x
2
i i i i i
2
i=1 i=1 i=1 i=1
^
V ar( |x) = V ar = = = =
n n n n n

2 2 2 2 2 2 2 2
x ( x ) ( x ) ( x ) x
i i i i i
i=1 i=1 i=1 i=1 i=1
iii.
2
Show that ^|X

N (, n
)
xi
i=1
Solution
We already showed the mean and variance of ^|X above so all we need to show is that it is normally
distributed. Now,
n n n n
2
x i yi x x i i x i i
i
i=1 i=1 i=1 i=1
^
= = + = +
n n n n
2 2 2 2
x x x x
i i i i
i=1 i=1 i=1 i=1
and since we know that i n(0, 1) , the above is a linear combination of normally distributed random
variables and is therefore itself normally distributed.
C)
We could use the usual MSE but with once more degree of freedom. (since we dont have an intercept)
n n
^ 2 2 ^ 2 2
(yi x i ) y 2yi x i + x
i i
2 i=1 i=1
^
= =
n 1 n 1
And we have that
2 2
^ /
(n 1) n1
So
2 2 2
^ ) = n 1
E( /(n 1) =
So this estimator is unbiased!
2)
Consider the SLR model
Yi = 0 + 1 x i + i
for i = 1, . . . , n and i N (0, 2 ) are i.i.d. In class we considered an indicator variable defined by 0 for
male and 1 for female. This problem continues that discussion. Assume there are k males and m females.
A)
Using the parameterization, {0, 1} as in class, derive an expression for V ar(^1 ). Recall that the general
2
expression for the variance is V ar(^) =

SXX
Solution
First note n = k + m .
Since x i = {0, 1} for all i , we know that (for an indicator I[x i =1]
= 1 if x i = 1 and 0 otherwise, that is
person i is a female :
n 2
x
i
n n n 2
i=1 m m(n m) mk
2 2 2
SXX = (x i x) = x nx = x i n = m = =
i
n n n k + m
i=1 i=1 i=1

so we have that:
2 2 2
n (k + m)
^
V ar( ) = = =
SXX (k + m) mk
B)
Using the parameterization where a female is coded by +1 and a male by 1, derive an expression for each ^
and interpret the regression coefficient estimators. Derive an expression for var(^1 ).
Solution
We note that since x 2i = 1 for all i ,
n n 2
2 2
m k
2
SXX = (x i x) = x nx = n n( ) =
i
n
i=1 i=1
2 2 2 2 2 2
m 2mk + k m + 2mk + k m + 2mk + k 4mk
m + k = =
m + k m + k m + k
so we have that:
2 2 2
n (m + k)
^
V ar( ) = = =
SXX 4mk 4mk
We solve for ^1 and ^0 :
n n n
^ ^ 2

0 x i + 1 x = Yi x i
i
i=1 i=1 i=1

n n

^ ^
0 n + 1 x i = Yi

i=1 i=1
m k

^ ^
0 (m k) + 1 (m + k) = Y i Y i

i=1 i=1

m

^ ^
0 (m + k) + 1 (m k) = Y i

i=1
n m k
^ ^
2 0 m + 2 1 m = Y i + Y i Y i
i=1 i=1 i=1
m k m k
^ ^
2 0 m + 2 1 m = Y i + Y i + Y i Y i
i=1 i=1 i=1 i=1
^ ^
2m 0 + 2m 1 = 2 Y i
i=1
^ ^
0 + 1 = Y f
^ ^
0 = Y f 1
m k
^ ^
(Y f 1 ) (m k) + 1 (m + k) = Y i Y i
i=1 i=1
m k
^
Y f (m k) + 1 (m + k + k + m + k) = Y i Y i
i=1 i=1
m k
^
2k 0 = Y i Y i Y f m + Y f k
i=1 i=1
^
2k 1 = kY m + kY f

Yf Ym
^
1 =
2

Yf Ym 2Y f Y f + Y m Yf + Ym
^
0 = Y f = =
2 2 2
C)
The dataset, sexsalary.csv, was collected as part of a legal case on sex discrimination at a certain bank. For
now we are interested in the association between sex (sex) and base salary (bsal) when an employee was
hired. Use each of the above parameterizations for sex to fit a SLR model of base salary on sex. For each fit
interpret the regression coefficient estimates and determine whether there is statistical evidence for sex
differences in base salary. Do the two fits lead to the same conclusion? The dataset, sexsalary.csv, is attached
to this assignment
Solution
For the first model, the results of the regression are shown below:
Data1 <- read.table("sexsalary.txt", as.is = TRUE, header = TRUE);

#### Creating new variable where female = 1, male = 0
sex1 = rep(0,length(Data1$sex))
for(i in 1:length(Data1$sex)){
if(Data1$sex[i]=="Female"){
sex1[i] = 1
}else{ sex1[i] = 0}}
## linear model
m2 = lm(Data1$bsal~sex1)
summary(m2)
##
## Call:
## lm(formula = Data1$bsal ~ sex1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1336.88 -338.85 43.12 261.15 2143.12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5956.9 105.3 56.580 < 2e-16 ***
## sex1 -818.0 130.0 -6.293 1.08e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 595.6 on 91 degrees of freedom
## Multiple R-squared: 0.3032, Adjusted R-squared: 0.2955
## F-statistic: 39.6 on 1 and 91 DF, p-value: 1.076e-08
We see that for model provides a good fit producing a F statistic of 39.6 on 1,91 degrees of freedom and an
associated p-value of approximately 0. We also see that the t- test for the gender coefficient is 6.293
producing a p-value of approximately 0 for testing H0 : 1 = 0. The intercept is also significantly different from
0. This indicates that women make about 818 less (based on this data) in base salary than their maile
counterparts. However, we are cautious because of the low R2 = 0.2955
plot(sex1,Data1$bsal, main="Gender Versus base Salary", xlab = "Gender", ylab = "Base

Salary")
abline(m2,col="red")
This is reflected in the scatterplot with the regression line superimposed. Clearly males on average make more
than females for this data in base salary but the data are spread out greatly around the regression line, so this
trend has lots of variation.
These two methods produce the exact same interpretive results! The only difference are the coefficient values
and standard error change.
For the Second model, the results of the regression are shown below:
#### Creating new variable where female = 1, male = -1

sex2[i] = 1
}else{ sex2[i] = -1}}
## linear model
summary(m3)
##
## Call:
##
## Residuals:
## -1336.88 -338.85 43.12 261.15 2143.12
##
## Coefficients:
## (Intercept) 5548 65 85.353 < 2e-16 ***
## sex2 -409 65 -6.293 1.08e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
0. This indicates that women make about $ 818$ less (based on this data) in base salary than their maile
counterparts. However, we are cautious because of the low R2 = 0.2955

Salary")
For the first model, the results of the regression are shown below:
#### Creating new variable where female = 1, male = 0

sex1[i] = 1
}else{ sex1[i] = 0}}
## linear model
summary(m2)
##
## Call:
##
## Residuals:
## -1336.88 -338.85 43.12 261.15 2143.12
##
## Coefficients:
## (Intercept) 5956.9 105.3 56.580 < 2e-16 ***
## sex1 -818.0 130.0 -6.293 1.08e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
0. This indicates that women make about (1) 409 + 409 = 818 less (based on this data) in base salary
than their maile counterparts. However, we are cautious because of the low R2 = 0.2955

Salary")
Interpretation: the difference between men and women should be the same. However, because we now have
men and women encoded as 1 and -1, the estimate of Beta has changed. The estimate is now half of the
estimate of Beta when it was encoded 1 and 0 (-409 now instead of -818). Likewise, The y-intercept can no
longer be interpreted as the average salary of whichever gender was encoded as 0 (females). Beta evaluated
when X=1 is no longer the decrease in average salary when somebody is a female. As mentioned before, to
get this decrease you would have to look at the difference between y hat when beta is 1 and beta is -1 (should
still be 818).
3)
Consider the restaurant data (attached as nyc.csv) in chapter 1 of the textbook, section 1.2.3.
A)
Fit a SLR model of price on service. Report (a) a plot of the data with the fitted regression line and a 95 %
confidence interval for the SLR model; (b) point estimates of the regression coefficients and their SEs; (c)
hypothesis tests of the regression coefficients; (d) confidence intervals for the regression coefficients with
interpretation; (e) regression ANOVA table; (f) R2
Dat <- read.csv("nyc.csv", as.is = TRUE, header = TRUE);
names(Dat)
## [1] "Case" "Restaurant" "Price" "Food" "Decor"

## [6] "Service" "East"
m1 = lm(Dat$Price~Dat$Service)
summary(m1)
##
## Call:
## lm(formula = Dat$Price ~ Dat$Service)
##
## Residuals:
## -17.6646 -4.7540 -0.2093 4.3368 26.2460
##
## Coefficients:
## (Intercept) -11.9778 5.1093 -2.344 0.0202 *
## Dat$Service 2.8184 0.2618 10.764 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistic: 115.9 on 1 and 166 DF, p-value: < 2.2e-16
names(m1)
## [1] "coefficients" "residuals" "effects" "rank"

## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
We see above that the point estimate for the Service coefficient ^ is 2.818 with standard error 0.262. The t test
for H0 : = 0 vs. H1 : 0 produces a t-value of 2.818/0.262 = 10.76 on 166 degrees of freedom.
This test produces a p-value of p < 2e 16 indicating that good customer service leads to higher prices. The
anova table is shown in the output above.
This produces a p-value of approximately 0 for testing whether there is a linear relationship between Price and
Service, that is H0 : = 0 . So it appears that there is a relationship.
The SSE, MSE and SSR and MSR, R2 as well as the F statistic are calculated below:
SSE=sum(m1$residual^2)
MSE=sum(m1$residual^2)/m1$df.residual
SSR=sum((m1$fitted.values-mean(Dat$Price))^2)
Fstat=sum((m1$fitted.values-mean(Dat$Price))^2)/(sum(m1$residual^2)/m1$df.residual)
SSE
## [1] 8493.398
MSE
## [1] 51.16505
SSR
## [1] 5928.12
Fstat
## [1] 115.8627
Rsquared= SSR/(SSE+SSR)
Rsquared
## [1] 0.4110608
The plot of the fitted regression line is shown below along with a 95$ %$ confidence interval.
x=Dat$Service
y=Dat$Price
## Point of averages
(meanx <- mean(x))
## [1] 19.39881
(meany <- mean(y))
## [1] 42.69643
## Number of samples
n <- length(x)
(SXY <- sum(x*y) - n*meanx*meany)
## [1] 2103.339
## Also
## SXY <- sum((x-meanx)*(y-meany))
(SXX <- sum(x^2) - n*meanx^2)
## [1] 746.2798
## Also
## SXX <- sum(((x-meanx)^2))
## Coefficients of the regression line

(beta1hat <- SXY/SXX)
## [1] 2.818433
(beta0hat <- meany - beta1hat*meanx)
## [1] -11.97781
(r <- mean((x-meanx)/sd(x) * (y-meany)/sd(y)))
## [1] 0.6373239
## Using R to get the regression coefficients (and lot more...)

lmfit <-lm(y ~ x)
summary(lmfit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## -17.6646 -4.7540 -0.2093 4.3368 26.2460
##
## Coefficients:
## (Intercept) -11.9778 5.1093 -2.344 0.0202 *
## x 2.8184 0.2618 10.764 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Fitted values
(yhat <- beta0hat + beta1hat*x)
## [1] 44.39084 41.57241 38.75398 35.93555 47.20928 47.20928 47.20928

## [8] 47.20928 50.02771 41.57241 44.39084 47.20928 44.39084 41.57241
## [15] 47.20928 47.20928 52.84614 50.02771 38.75398 50.02771 52.84614
## [22] 47.20928 38.75398 47.20928 50.02771 41.57241 50.02771 38.75398
## [29] 44.39084 44.39084 38.75398 38.75398 41.57241 52.84614 52.84614
## [36] 47.20928 44.39084 50.02771 52.84614 47.20928 38.75398 35.93555
## [43] 38.75398 52.84614 47.20928 47.20928 44.39084 44.39084 47.20928
## [50] 33.11711 44.39084 44.39084 35.93555 35.93555 33.11711 30.29868
## [57] 44.39084 41.57241 44.39084 38.75398 47.20928 35.93555 35.93555
## [64] 41.57241 35.93555 38.75398 44.39084 30.29868 30.29868 44.39084
## [71] 44.39084 50.02771 50.02771 38.75398 50.02771 47.20928 44.39084
## [78] 38.75398 41.57241 41.57241 47.20928 47.20928 55.66457 41.57241
## [85] 52.84614 41.57241 44.39084 55.66457 47.20928 38.75398 47.20928
## [92] 52.84614 50.02771 47.20928 38.75398 38.75398 44.39084 41.57241
## [99] 35.93555 30.29868 47.20928 35.93555 44.39084 35.93555 47.20928
## [106] 47.20928 33.11711 41.57241 41.57241 47.20928 41.57241 38.75398
## [113] 47.20928 47.20928 30.29868 35.93555 27.48025 41.57241 38.75398
## [120] 35.93555 38.75398 38.75398 35.93555 44.39084 41.57241 35.93555
## [127] 44.39084 38.75398 41.57241 38.75398 35.93555 50.02771 47.20928
## [134] 44.39084 44.39084 38.75398 44.39084 35.93555 33.11711 35.93555
## [141] 41.57241 44.39084 47.20928 47.20928 50.02771 47.20928 35.93555
## [148] 52.84614 44.39084 38.75398 44.39084 52.84614 44.39084 38.75398
## [155] 33.11711 41.57241 47.20928 35.93555 33.11711 41.57241 50.02771
## [162] 41.57241 38.75398 33.11711 35.93555 35.93555 47.20928 33.11711
## Residuals
(ehat <- y - yhat)
## [1] -1.39084347 -9.57241077 -4.75397807 5.06445464 6.79072383

## [6] 4.79072383 -13.20927617 -13.20927617 -11.02770887 2.42758923
## [11] 0.60915653 -0.20927617 7.60915653 -6.57241077 -0.20927617
## [16] -10.20927617 -7.84614158 6.97229113 -0.75397807 0.97229113
## [21] 1.15385842 3.79072383 -0.75397807 1.79072383 -5.02770887
## [26] -4.57241077 -0.02770887 4.24602193 4.60915653 20.60915653
## [31] -4.75397807 12.24602193 7.42758923 -1.84614158 9.15385842
## [36] 2.79072383 6.60915653 1.97229113 4.15385842 1.79072383
## [41] -5.75397807 7.06445464 2.24602193 5.15385842 8.79072383
## [46] -3.20927617 -7.39084347 11.60915653 10.79072383 10.88288734
## [51] 1.60915653 -4.39084347 3.06445464 0.06445464 0.88288734
## [56] 23.70132004 6.60915653 -0.57241077 -4.39084347 -14.75397807
## [61] 5.79072383 -4.93554536 -0.93554536 7.42758923 2.06445464
## [66] 9.24602193 -1.39084347 -1.29867996 6.70132004 10.60915653
## [71] -7.39084347 4.97229113 -1.02770887 -5.75397807 1.97229113
## [76] -0.20927617 -1.39084347 -5.75397807 -3.57241077 6.42758923
## [81] 2.79072383 -1.20927617 -17.66457428 -8.57241077 -6.84614158
## [86] -4.57241077 5.60915653 -1.66457428 -6.20927617 -1.75397807
## [91] 2.79072383 7.15385842 -14.02770887 6.79072383 0.24602193
## [96] -3.75397807 -14.39084347 -0.57241077 -5.93554536 -5.29867996
## [101] -4.20927617 9.06445464 12.60915653 -3.93554536 3.79072383
## [106] 0.79072383 2.88288734 -4.57241077 -10.57241077 -0.20927617
## [111] -1.57241077 -1.75397807 -4.20927617 3.79072383 -11.29867996
## [116] -7.93554536 -5.48024726 -0.57241077 -5.75397807 -6.93554536
## [121] -5.75397807 6.24602193 2.06445464 7.60915653 -3.57241077
## [126] 11.06445464 1.60915653 1.24602193 -9.57241077 26.24602193
## [131] 11.06445464 14.97229113 -2.20927617 1.60915653 -0.39084347
## [136] 1.24602193 1.60915653 -3.93554536 -10.11711266 6.06445464
## [141] -12.57241077 4.60915653 5.79072383 -2.20927617 12.97229113
## [146] 4.79072383 4.06445464 -7.84614158 -6.39084347 -0.75397807
## [151] -2.39084347 4.15385842 -5.39084347 4.24602193 -4.11711266
## [156] 0.42758923 2.79072383 -1.93554536 -2.11711266 -10.57241077
## [161] -4.02770887 0.42758923 -7.75397807 -2.11711266 -9.93554536
## [166] -4.93554536 -9.20927617 0.88288734
(SSreg <- sum((yhat-meany)^2))
## [1] 5928.12
## This quantity appears in the line of RunSize of anova(lmfit) below.
(RSS <- sum(ehat^2))
## [1] 8493.398
(S2 <- RSS/(n-2))
## [1] 51.16505
## These quantities appear in the line of residuals of:

anova(lmfit)
## Analysis of Variance Table

##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 5928.1 5928.1 115.86 < 2.2e-16 ***
## Residuals 166 8493.4 51.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## The square root of S2 is the residual standard error

(S <- sqrt(S2))
## [1] 7.152975
## where the degrees of freedom are

n-2
## [1] 166
## This info appears in the line "Residual standard error:" of

## summary(lmfit)
## above.
############################
xast <- seq(min(x), max(x), l=200)

yhatast <- beta0hat + beta1hat*xast
se_yhatast <- S * sqrt(1/n + (xast-meanx)^2/SXX)
plot(x, y, xlab="Service", ylab="Price")

abline(a=beta0hat, b=beta1hat)
## 95% confidence intervals for the population regression line
alpha <- (1 - 0.95)
lowerc <- yhatast - se_yhatast * qt(p=1-alpha/2, df=n-2)
upperc <- yhatast + se_yhatast * qt(p=1-alpha/2, df=n-2)
lines(xast, lowerc)
lines(xast, upperc)
Finally the 95% confidence intervals for 0 and 1 are calculated below:
CIB0 = c(m1$coeff[1]-5.1093*1.96,m1$coeff[1]+5.1093*1.96)
CIB0
## (Intercept) (Intercept)
## -21.992039 -1.963583
CIB1
## Dat$Service Dat$Service
## 2.305305 3.331561
The values of SÊ and SÊ were obtained from the regression summary, but also could have been
1 0
obtained directly. We see that the 95% confidence interval for 1 is greater than 0, corroborating the results
from the hypothesis test (that is 1 0 ) and indicating that better restaurant service leads to higher prices. In
this case, the confidence interval for 0 does not 0, corroborating the hypothesis test that rejected H0 : 0 = 0
B)
Repeat the above for a SLR model of price on decor.
m1 = lm(Dat$Price~Dat$Decor)
summary(m1)
##
## Call:
## lm(formula = Dat$Price ~ Dat$Decor)
##
## Residuals:
## -14.9578 -4.4862 -0.4673 4.0422 18.5138
##
## Coefficients:
## (Intercept) -1.362 3.292 -0.414 0.68
## Dat$Decor 2.490 0.184 13.537 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
We see above that the point estimate for the Service coefficient ^ is 2.491 with standard error 0.184. The t test
for H0 : = 0 vs. H1 : 0 produces a t-value of 2.491/0.184 = 13.54 on 166 degrees of freedom.
This test produces a p-value of p < 2e 16 indicating that good decor leads to higher prices. The anova
table is shown in the output above.
This produces a p-value of approximately 0 for testing whether there is a linear relationship between Price and
Service, that is H0 : = 0 . So it appears that there is a relationship.
The SSE, MSE and SSR, MSR, R2 as well as the F statistic are calculated below:
SSE=sum(m1$residual^2)
MSE=sum(m1$residual^2)/m1$df.residual
SSR=sum((m1$fitted.values-mean(Dat$Price))^2)
Fstat=sum((m1$fitted.values-mean(Dat$Price))^2)/(sum(m1$residual^2)/m1$df.residual)
SSE
## [1] 6854.742
MSE
## [1] 41.29363
SSR
## [1] 7566.776
Fstat
## [1] 183.2432
Rsquared= SSR/(SSE+SSR)
Rsquared
## [1] 0.5246865
The regression line and the 95% confidence interval is shown below
x=Dat$Decor
y=Dat$Price
## Point of averages
(meanx <- mean(x))
## [1] 17.69048
(meany <- mean(y))
## [1] 42.69643
## Number of samples
n <- length(x)
(SXY <- sum(x*y) - n*meanx*meany)
## [1] 3038.214
## Also
## SXY <- sum((x-meanx)*(y-meany))
(SXX <- sum(x^2) - n*meanx^2)
## [1] 1219.905
## Also
## SXX <- sum(((x-meanx)^2))
## Coefficients of the regression line

(beta1hat <- SXY/SXX)
## [1] 2.490534
(beta0hat <- meany - beta1hat*meanx)
## [1] -1.362304
(r <- mean((x-meanx)/sd(x) * (y-meany)/sd(y)))
## [1] 0.7200409
## Using R to get the regression coefficients (and lot more...)

lmfit <-lm(y ~ x)
summary(lmfit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## -14.9578 -4.4862 -0.4673 4.0422 18.5138
##
## Coefficients:
## (Intercept) -1.362 3.292 -0.414 0.68
## x 2.490 0.184 13.537 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Fitted values
(yhat <- beta0hat + beta1hat*x)
## [1] 43.46731 45.95784 31.01464 48.44838 45.95784 53.42944 38.48624

## [8] 43.46731 45.95784 40.97677 40.97677 45.95784 45.95784 40.97677
## [15] 43.46731 45.95784 43.46731 50.93891 40.97677 48.44838 48.44838
## [22] 40.97677 43.46731 50.93891 48.44838 40.97677 45.95784 38.48624
## [29] 45.95784 48.44838 38.48624 48.44838 45.95784 53.42944 53.42944
## [36] 50.93891 43.46731 45.95784 48.44838 48.44838 40.97677 40.97677
## [43] 40.97677 50.93891 40.97677 40.97677 35.99571 40.97677 43.46731
## [50] 38.48624 43.46731 48.44838 43.46731 33.50517 35.99571 38.48624
## [57] 40.97677 33.50517 40.97677 31.01464 48.44838 38.48624 38.48624
## [64] 45.95784 35.99571 38.48624 45.95784 33.50517 43.46731 50.93891
## [71] 43.46731 48.44838 48.44838 33.50517 48.44838 38.48624 38.48624
## [78] 40.97677 38.48624 43.46731 43.46731 45.95784 45.95784 38.48624
## [85] 45.95784 35.99571 43.46731 58.41051 45.95784 35.99571 43.46731
## [92] 53.42944 38.48624 45.95784 43.46731 38.48624 31.01464 40.97677
## [99] 33.50517 35.99571 43.46731 35.99571 38.48624 35.99571 50.93891
## [106] 48.44838 38.48624 40.97677 45.95784 45.95784 38.48624 38.48624
## [113] 48.44838 53.42944 21.05250 33.50517 13.58090 40.97677 35.99571
## [120] 35.99571 40.97677 38.48624 38.48624 55.91998 40.97677 43.46731
## [127] 43.46731 40.97677 35.99571 55.91998 50.93891 53.42944 40.97677
## [134] 53.42944 45.95784 45.95784 43.46731 35.99571 33.50517 50.93891
## [141] 43.46731 43.46731 58.41051 45.95784 60.90105 55.91998 48.44838
## [148] 50.93891 40.97677 40.97677 38.48624 45.95784 45.95784 43.46731
## [155] 33.50517 38.48624 45.95784 38.48624 35.99571 40.97677 45.95784
## [162] 35.99571 38.48624 35.99571 38.48624 38.48624 40.97677 23.54304
## Residuals
(ehat <- y - yhat)
## [1] -0.46730814 -13.95784214 2.98536185 -7.44837614 8.04215786

## [6] -1.42944414 -4.48624014 -9.46730814 -6.95784214 3.02322586
## [11] 4.02322586 1.04215786 6.04215786 -5.97677414 3.53269186
## [16] -8.95784214 1.53269186 6.06108986 -2.97677414 2.55162386
## [21] 5.55162386 10.02322586 -5.46730814 -1.93891014 -3.44837614
## [26] -3.97677414 4.04215786 4.51375986 3.04215786 16.55162386
## [31] -4.48624014 2.55162386 3.04215786 -2.42944414 8.57055586
## [36] -0.93891014 7.53269186 6.04215786 8.55162386 0.55162386
## [41] -7.97677414 2.02322586 0.02322586 7.06108986 15.02322586
## [46] 3.02322586 1.00429386 15.02322586 14.53269186 5.51375986
## [51] 2.53269186 -8.44837614 -4.46730814 2.49482786 -1.99570614
## [56] 15.51375986 10.02322586 7.49482786 -0.97677414 -7.01463815
## [61] 4.55162386 -7.48624014 -3.48624014 3.04215786 2.00429386
## [66] 9.51375986 -2.95784214 -4.50517214 -6.46730814 4.06108986
## [71] -6.46730814 6.55162386 0.55162386 -0.50517214 3.55162386
## [76] 8.51375986 4.51375986 -7.97677414 -0.48624014 4.53269186
## [81] 6.53269186 0.04215786 -7.95784214 -5.48624014 0.04215786
## [86] 1.00429386 6.53269186 -4.41051214 -4.95784214 1.00429386
## [91] 6.53269186 6.57055586 -2.48624014 8.04215786 -4.46730814
## [96] -3.48624014 -1.01463815 0.02322586 -3.50517214 -10.99570614
## [101] -0.46730814 9.00429386 18.51375986 -3.99570614 0.06108986
## [106] -0.44837614 -2.48624014 -3.97677414 -14.95784214 1.04215786
## [111] 1.51375986 -1.48624014 -5.44837614 -2.42944414 -2.05250215
## [116] -5.50517214 8.41909985 0.02322586 -2.99570614 -6.99570614
## [121] -7.97677414 6.51375986 -0.48624014 -3.91997814 -2.97677414
## [126] 3.53269186 2.53269186 -0.97677414 -3.99570614 9.08002186
## [131] -3.93891014 11.57055586 4.02322586 -7.42944414 -1.95784214
## [136] -5.95784214 2.53269186 -3.99570614 -10.50517214 -8.93891014
## [141] -14.46730814 5.53269186 -5.41051214 -0.95784214 2.09895386
## [146] -3.91997814 -8.44837614 -5.93891014 -2.97677414 -2.97677414
## [151] 3.51375986 11.04215786 -6.95784214 -0.46730814 -4.50517214
## [156] 3.51375986 4.04215786 -4.48624014 -4.99570614 -9.97677414
## [161] 0.04215786 6.00429386 -7.48624014 -4.99570614 -12.48624014
## [166] -7.48624014 -2.97677414 10.45696385
(SSreg <- sum((yhat-meany)^2))
## [1] 7566.776
## This quantity appears in the line of RunSize of anova(lmfit) below.
(RSS <- sum(ehat^2))
## [1] 6854.742
(S2 <- RSS/(n-2))
## [1] 41.29363
## These quantities appear in the line of residuals of:

anova(lmfit)
## Analysis of Variance Table

##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 7566.8 7566.8 183.24 < 2.2e-16 ***
## Residuals 166 6854.7 41.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## The square root of S2 is the residual standard error

(S <- sqrt(S2))
## [1] 6.426012
## where the degrees of freedom are

n-2
## [1] 166
## This info appears in the line "Residual standard error:" of

## summary(lmfit)
## above.
############################
xast <- seq(min(x), max(x), l=200)

yhatast <- beta0hat + beta1hat*xast
se_yhatast <- S * sqrt(1/n + (xast-meanx)^2/SXX)
plot(x, y, xlab="Decor", ylab="Price")

abline(a=beta0hat, b=beta1hat)
## 95% confidence intervals for the population regression line
alpha <- (1 - 0.95)
lowerc <- yhatast - se_yhatast * qt(p=1-alpha/2, df=n-2)
upperc <- yhatast + se_yhatast * qt(p=1-alpha/2, df=n-2)
lines(xast, lowerc)
lines(xast, upperc)
Finally the 95% confidence intervals for 0 and 1 are calculated below:
CIB0
## (Intercept) (Intercept)
## -7.814624 5.090016
CIB1
## Dat$Decor Dat$Decor
## 2.129894 2.851174
The values of SÊ and SÊ were obtained from the regression summary, but also could have been
1 0
obtained directly. We see that the 95% confidence interval for 1 is greater than 0, corroborating the results
from the hypothesis test (that is 1 0 ) and indicating that better restaurant decor leads to higher prices. In
this case, the confidence interval for 0 contains 0, corroborating the hypothesis test that failed to reject
H0 : 0 = 0.
C)
Which of the two predictor variables, service or decor, do you think better predicts price. Explain.
The F statistic and R2 values are higher for the model using decor as an explanatory variable indicating that for
this data set, decor is a better predictor of price than service.

410 - HW 2 Solutions

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

410 - HW 2 Solutions

Transféré par

Droits d'auteur :

Formats disponibles

9/28/2017 HW 2 Solutions

Consider the model Yi = x i + i where i N (0,

i=1 i=1 i=1 i=1 i=1

^ i=1 i=1 i=1 i=1

Show that V ar(^|x)

Show that ^|X

And we have that

So this estimator is unbiased!

expression for the variance is V ar(^) =

We note that since x 2i = 1 for all i ,

We solve for ^1 and ^0 :

i=1 i=1 i=1

i=1 i=1 i=1 i=1

Data1 <- read.table("sexsalary.txt", as.is = TRUE, header = TRUE);

counterparts. However, we are cautious because of the low R2 = 0.2955

plot(sex1,Data1$bsal, main="Gender Versus base Salary", xlab = "Gender", ylab = "Base

#### Creating new variable where female = 1, male = -1

counterparts. However, we are cautious because of the low R2 = 0.2955

plot(sex2,Data1$bsal, main="Gender Versus base Salary", xlab = "Gender", ylab = "Base

#### Creating new variable where female = 1, male = 0

plot(sex1,Data1$bsal, main="Gender Versus base Salary", xlab = "Gender", ylab = "Base

Dat <- read.csv("nyc.csv", as.is = TRUE, header = TRUE);

## [1] "Case" "Restaurant" "Price" "Food" "Decor"

## [1] "coefficients" "residuals" "effects" "rank"

(meany <- mean(y))

(SXY <- sum(x*y) - n*meanx*meany)

(SXX <- sum(x^2) - n*meanx^2)

## Coefficients of the regression line

(beta0hat <- meany - beta1hat*meanx)

(r <- mean((x-meanx)/sd(x) * (y-meany)/sd(y)))

## Using R to get the regression coefficients (and lot more...)

## [1] 44.39084 41.57241 38.75398 35.93555 47.20928 47.20928 47.20928

## [1] -1.39084347 -9.57241077 -4.75397807 5.06445464 6.79072383

(SSreg <- sum((yhat-meany)^2))

## This quantity appears in the line of RunSize of anova(lmfit) below.

(RSS <- sum(ehat^2))

(S2 <- RSS/(n-2))

## These quantities appear in the line of residuals of:

## Analysis of Variance Table

## The square root of S2 is the residual standard error

## where the degrees of freedom are

## This info appears in the line "Residual standard error:" of

xast <- seq(min(x), max(x), l=200)

se_yhatast <- S * sqrt(1/n + (xast-meanx)^2/SXX)

plot(x, y, xlab="Service", ylab="Price")

(meany <- mean(y))

(SXY <- sum(x*y) - n*meanx*meany)

(SXX <- sum(x^2) - n*meanx^2)

## Coefficients of the regression line

(beta0hat <- meany - beta1hat*meanx)

(r <- mean((x-meanx)/sd(x) * (y-meany)/sd(y)))

## Using R to get the regression coefficients (and lot more...)

## [1] 43.46731 45.95784 31.01464 48.44838 45.95784 53.42944 38.48624

## [1] -0.46730814 -13.95784214 2.98536185 -7.44837614 8.04215786

(SSreg <- sum((yhat-meany)^2))

## This quantity appears in the line of RunSize of anova(lmfit) below.

(RSS <- sum(ehat^2))

(S2 <- RSS/(n-2))

## These quantities appear in the line of residuals of:

## Analysis of Variance Table

## The square root of S2 is the residual standard error

## where the degrees of freedom are

## This info appears in the line "Residual standard error:" of

(SXY <- sum(xy) - nmeanx*meany)

(SXY <- sum(xy) - nmeanx*meany)