Académique Documents
Professionnel Documents
Culture Documents
II1
ALR21, 29
II2
ALR22,23
II3
0
1
2
3
yi
1
xi
II4
30
20
10
yi
40
50
60
10
20
30
xi
II5
40
50
( 0 + 1xi)?
Q: if per chance we had 0 = 0 & 1 = 1, would fitted value
yi be equal to actual value yi?
ALR22
II6
i.e., the residual sum of squares when we use b0 for the intercept and b1 for the slope
least squares estimators 0 & 1 are such that
RSS( 0, 1) < RSS(b0, b1)
when either b0 6= 0 or b1 6= 1 (or both)
ALR24
II7
i=1
ALR24, 47, 48
II8
i=1
f(z)
0
z
if f (
z ) < 0, z is halfway between 0 and nonzero root of az 2 +bz
II9
alternative approach to finding minimizer of RSS(b1): dierentiate with respect to b1, set result to 0 and solve for b1:
P
2
X
[y
b
x
]
d RSS(b1) d
1
i
i i
=
= 2
xi(yi b1xi) = 0,
db1
db1
i
ALR47, 48
II10
ALR293
II11
b0 = y
b1x
II12
()
P
xy SXY
1 = Pi xiyi n
=
2
2
SXX
n
x
i xi
P
P 2
note: should avoid i xiyi n
xy & i xi n
x2 when actually
computing 1 use SXY and SXX from () instead
ALR294, 23
II13
Sufficient Statistics
since
1 = SXY and 0 = y
SXX
need only know x, y, SXY & SXX to form
1x,
0 & 1
since
n
n
n
n
X
X
1X
1X
x =
xi, y =
yi, SXY =
xiyi n
xy, SXX =
x2i n
x2
n
n
i=1
i=1
i=1
i=1
i=1
i=1
i=1
II14
II15
1.40
1.35
log(Pressure)
1.45
195
205
200
Boiling point
ALR6
II16
210
ALR8, 9
II17
1x = 28.6358
40
30
20
10
50
60
10
20
30
40
II18
50
II19
1 rxy 1
note that, if xis & yis are such that SDy = SDx, then estimated slope is same as sample correlation, as following set of
plots illustrate
ALR24
II20
0.0
0.5
1.0
yi
0.5
1.0
1.0
0.5
0.0
xi
II21
0.5
1.0
Estimating 2: I
simple linear regression model has 3 parameters: 0 (intercept),
2 (variance of errors)
(slope)
&
1
with 0 & 1 estimated by 0 & 1, will base estimator for 2
on variance of residuals (observed errors)
recall definition of residuals: ei = yi ( 0 + 1xi)
in view of, e.g.,
(x
x
)
i
SD2x = i
n 1
obvious estimator of 2 would appear to be
P
n
2
X
ei e)
1
i(
, where e =
ei
n 1
n
i=1
II22
Estimating 2: II
P 2
obvious estimator thus simplifies to i ei /(n 1)
taking RSS to be shorthand for RSS( 0, 1), we have
n
X
RSS =
[yi
( 0 + 1xi)]2 =
i=1
n
X
e2i ,
i=1
1)
II23
Estimating 2: III
assumption needed for 2 to be an unbiased estimator of 2:
errors ei are uncorrelated RVs with zero means and common variance 2
(SXY)2
RSS = SYY
= SYY 12 SXX
SXX
Q: for a given SYY, what range of values can RSS assume, and
how would you describe the extreme cases?
ALR26, 27
II24
1.40
1.35
log(Pressure)
1.45
195
205
200
Boiling point
ALR6
II25
210
0.005
0.005
0.000
Residuals
0.010
0.015
195
205
200
Boiling point
ALR6
II26
210
Estimating 2: IV
for the Forbes data, computations in R yield
SYY = 0.04279135, SXY = 4.753781 & SXX = 530.7824,
from which we get
(SXY)2
RSS = SYY
= 0.0002156426
SXX
since n = 17 for the Forbes data,
RSS
RSS
2
=
=
= 0.00001437617
n 2
15
standard error of regression is
p
= 0.00001437617 = 0.003791592
(also called residual standard error)
red dashed horizontal lines on residual plot show
ALR26
II27
Estimating 2: V
for the Fort Collins data, computations in R yield
II28
40
30
20
10
50
60
10
20
30
40
II29
50
10
0
10
20
30
Residuals
20
30
10
20
30
40
50
II31
1 1 1
x1 x2 xn
ALR299, 300, 301
II32
II33
q
X
i=1
define
i=1 j=1
df (b)
B db1 C
df (b) C
df (b) B
B
C
= B db2 C
db
B .. C
@ df (b) A
dbq
ALR301, 304
q X
q
X
II34
biAi,j bj
P 2
0
Q: what is the derivative of f4(b) = b b = i bi ?
Q: what is the derivative of
f5(b) = c0Cb =
p X
q
X
ciCi,j bj ,
i=1 j=1
RSS(b) = Y0Y
2Y0Xb + b0X0Xb,
ALR304
II36
1
1 1 1 B
0
B1
XX=
x1 x2 xn @ ..
1
and
x1
P
C
n
xi
n n
x
x2 C
i
P 2
.. A = P x P x2 = n
x i xi
i i
i i
xn
0 1
y1
P
B
C
1 1 1 B y2 C
yi
n
y
0
i
P
P
XY=
=
=
x1 x2 xn @ .. A
x
y
i i i
i xiyi
yn
II37
()
ALR27, 28
II38
1.40
1.35
log(Pressure)
1.45
195
205
200
Boiling point
ALR6
II39
210
40
30
20
10
50
60
10
20
30
40
II40
50
since 1 = SXY/SXX and since SXY = i(xi x)yi (see Problem 1), we have
P
n
X
(x
x
)y
xi x
i
1 = i i
=
ciyi, where ci =
SXX
SXX
i=1
ALR27
II41
0.015
0.005
ci
0.005
0.015
195
205
200
Boiling point
II42
210
1.40
1.35
log(Pressure)
1.45
195
205
200
Boiling point
ALR6
II43
210
II44
II45
ALR304, 292
II46
II47
(X0X)
(X0X)
(X0X)
(X0X)
1X0E(Y | X)
1X0E(X + e | X)
1X0(X + E(e | X))
1X0X
II48
ALR305
II49
1
a b
d
b
=
c d
c a
ad cb
since
P 2
1
n n
x
xi
0
0
1
i
P
XX=
, get (X X) = P 2
2
n
x i xi
n
x
n ix
n2x2
i
since
Var( | X) =
!
Var( 0 | X) Cov( 0, 1 | X)
2(X0X) 1,
=
Cov( 0, 1 | X) Var( 1 | X)
2
ALR64, 305, 28
n
x
n
II50
c 1 | X) =
Var(
SXX
term standard error is sometimes (but not always) used to
refer to the square root of an estimated variance
p
ALR29
II51
t(/2, d)) = /2
II52
0.2
0.1
0.0
0.3
plot below shows probability density function (PDF) for tdistribution with d = 15 degrees of freedom, with t(0.05, 15) =
1.753 marked by vertical dashed line (thus area under PDF to
right of line is 0.05, and area to left is 0.95)
0
x
II53
ALR31
II54
se( 1)
and comparing it to percentage points for t-distribution with
n 2 degrees of freedom
example: for Fort Collins data (n = 93), 1 = 0.2035 and
p
p
II55
ALR31
II56
0.6
0.4
0.2
0.0
pvalue
0.8
1.0
0.0
0.5
1.0
1.5
|t|
II57
2.0
2.5
3.0
Prediction: I
suppose now we want to predict a yet-to-be-observed response
y given a setting x for the predictor
if assumed-to-be-true linear regression model were known perfectly, prediction would be y = 0 + 1x, whereas model
says
y = 0 + 1x + e = y + e
prediction error would be y
Var(e | x) = 2
II58
Prediction: II
b | X = x) = 0 + 1x to predict
using fitted mean function E(Y
response y for given x, prediction is now
y = 0 + 1x,
and prediction error becomes
y y = 0 + 1x + e ( 0 + 1x)
recall that, if U & V are uncorrelated RVs, then Var(U V ) =
Var(U ) + Var(V ) (see Equation (A.2), p. 291, of Weisberg)
letting x+
be shorthand for x, x1, . . . , xn, we can write
+) + Var( + x | x+)
Var(y y | x+
)
=
Var(
+
x
+
e
|
x
0
1
0
1
II59
Prediction: III
0 + 1x | x+)
study pieces Var( 0 + 1x + e | x+
)
&
Var(
one at a time
using fact that Var(c + U ) = Var(U ) for a constant c, we have
+) = 2
Var( 0 + 1x + e | x+
)
=
Var(e
|
x
hence
0 | x+) + x2 Var( 1 | x+) + 2x Cov( 0, 1 | x+)
Var( 0 + 1x | x+
)
=
Var(
II60
Prediction: IV
expressions for Var( 0 | x1, . . . , xn), Var( 1 | x1, . . . , xn) and
Cov( 0, 1 | x1, . . . , xn) can be extracted from matrix
!
0 | X) Cov( 0, 1 | X)
Var(
2(X0X) 1
Var( | X) =
=
Cov( 0, 1 | X) Var( 1 | X)
Exercise (unassigned): using elements of (X0X) 1, show that
!
2
1
(x
x
)
+
2
Var( 0 + 1x | x ) =
+
n
SXX
1
(x
x
)
2 1+ +
Var(y y | x+
)
=
n
SXX
ALR32, 33, 295
II61
Prediction: V
estimating 2 by 2 and taking square root lead to standard
error of prediction (sepred) at x:
!1/2
2
1
(x
x
)
sepred(
y | x+
) = 1+ +
n
SXX
using Forbes data as an example, suppose we want to predict log10(pressure) at a hypothetical location for which boiling
point of water x is somewhere between 190 and 215
prediction for log10(pressure) given boiling point x is
y = 0 + 1x = 0.4216418 + 0.008956178x
(1
+)
y t(/2, n 2)sepred(
y | x+
)
y
+t(/2,
n
2)sepred(
y
|
x
ALR32, 33
II62
Prediction: VI
here n = 17, so, for a 99% prediction interval, we set = 0.01
and use t(0.005, 15) = 2.947
since = 0.003792, x = 203.0 and SXX = 530.8, we have
!1/2
2
1
(x
203.0)
sepred(
y | x+
)
=
0.003792
1
+
+
17
530.8
solid red curves on following plot depict 99% prediction interval
as x sweeps from 190 to 215 (black lines show intervals assuming unrealistically no uncertainty in parameter estimates)
for x = 200, prediction is y = 1.370, and 99% prediction
interval is specified by 1.358 y 1.381
in original space, prediction is 10y = 23.42, and interval is
101.358 10y 101.381, i.e., 22.80 10y 24.05
ALR32, 33
II63
1.40
1.35
1.30
log(Pressure)
1.45
1.50
190
195
205
200
Boiling point
II64
210
215
26
24
22
20
Pressure
28
30
32
190
195
205
200
Boiling point
II65
210
215
ALR35, 36
II66
40
30
20
10
50
60
10
20
30
40
II67
50
SYY
(SXY)2
SXX
!
(SXY)2
(SXY)2
=
SXX
SXX
ALR35, 36
II68
must have 0 R2 1
ALR35, 36
II69
II70
i=1
i=1
i=1
X
X
1X
2
2
x =
xi, SXX =
xi n
x and SXY =
xiyi n
xy
n
(with analogous equations for y and SYY), it follows that basic
linear regression analysis depends only on five so-called sufficient statistics:
n
n
n
n
n
X
X
X
X
X
xi,
yi,
x2i ,
yi2 and
xiyi
i=1
i=1
i=1
i=1
II71
i=1
II72
8
2
Response
10
12
14
15
10
Predictor
ALR13
II73
20
8
2
Response
10
12
14
15
10
Predictor
ALR13
II74
20
8
2
Response
10
12
14
15
10
Predictor
ALR13
II75
20
8
2
Response
10
12
14
15
10
Predictor
ALR13
II76
20
Residuals: I
looking at residuals ei is a vital step in regression analysis
can check assumptions to prevent garbage in/garbage out
basic tool is a plot of residuals versus other quantities, of which
three obvious choices are:
1. residuals versus fitted values yi
2. residuals versus predictors xi
3. residuals versus case numbers i
special nature of certain data might suggest other plots
useful residual plot resembles a null plot when assumptions
hold, and a non-null plot when some assumption fails
lets look at plots 1 to 3 using Anscombes data sets as examples
ALR36, 37, 38
II77
0
1
2
Residuals
Fitted values
II78
10
0
1
2
Residuals
10
Predictors
II79
12
14
0.0
2.0 1.5 1.0 0.5
Residuals
0.5
1.0
Fitted values
II80
10
0.0
2.0 1.5 1.0 0.5
Residuals
0.5
1.0
10
Predictors
II81
12
14
1
0
1
Residuals
Fitted values
II82
10
1
0
1
Residuals
10
Predictors
II83
12
14
0.5
0.5 0.0
1.5
Residuals
1.0
1.5
10
Fitted values
II84
11
12
0.5
0.5 0.0
1.5
Residuals
1.0
1.5
10
12
14
Predictors
II85
16
18
Residuals: II
Q: why is a plot of residuals versus yi identical to a plot of
residuals versus xi after relabeling the horizontal axis?
II86
0
1
2
Residuals
6
Case numbers
II87
10
0.0
2.0 1.5 1.0 0.5
Residuals
0.5
1.0
6
Case numbers
II88
10
1
0
1
Residuals
6
Case numbers
II89
10
0.5
0.5 0.0
1.5
Residuals
1.0
1.5
6
Case numbers
II90
10
Residuals: III
although plots of ei versus i were not particularly useful for
Anscombes data, plot is useful for certain other data sets (particularly where cases are collected sequentially in time)
fourth obvious choice: plot residuals ei versus responses yi
this choice is problematic because relationship
yi = yi + ei
40
30
20
10
50
60
10
20
30
40
II92
50
10
0
10
20
30
Residuals
20
30
30
32
34
36
Fitted values
II93
38
40
10
0
10
20
30
Residuals
20
30
10
20
30
Predictors
II94
40
50
10
0
10
20
30
Residuals
20
30
20
60
40
Case numbers
II95
80
10
0
10
20
30
Residuals
20
30
10
20
30
40
Responses
II96
50
60
Residuals: IV
reconsider Forbes data, focusing first on 3 following overheads
red dashed horizontal lines on residual plot show
recall definition of weights ci:
P
n
X
xi x
1 = i(xi x)yi =
ciyi, where ci =
SXX
SXX
i=1
ALR36, 37, 38
II97
1.40
1.35
log(Pressure)
1.45
195
205
200
Boiling point
ALR6
II98
210
0.005
0.005
0.000
Residuals
0.010
0.015
195
205
200
Boiling point
ALR6
II99
210
0.015
0.005
ci
0.005
0.015
195
205
200
Boiling point
II100
210
Residuals: V
Weisberg notes that Forbes deemed this case evidently a mistake, but perhaps just because of its appearance as an outlier
Weisberg (p. 38) shows that, if (x12, y12) is removed and regression analysis is redone on reduced data set, resulting slope
estimate is virtually the same, but and quantities that depend
upon it drastically change (see overheads that follow)
to delete or not to delete that is the question:
ALR36, 37, 38
II101
1.40
1.35
log(Pressure)
1.45
195
205
200
Boiling point
ALR6
II102
210
1.40
1.35
log(Pressure)
1.45
195
205
200
Boiling point
II103
210
0.005
0.005
0.000
Residuals
0.010
0.015
195
205
200
Boiling point
ALR6
II104
210
0.015
0.005
0.000
0.005
Residuals
0.010
195
205
200
Boiling point
II105
210
1.40
1.35
1.30
log(Pressure)
1.45
1.50
190
195
205
200
Boiling point
II106
210
215
1.35
1.40
1.30
log(Pressure)
1.45
1.50
190
195
205
200
Boiling point
II107
210
215
Main Points: I
given a response Y and a predictor X, simple linear regression
assumes (1) a linear mean function
E(Y | X = x) = 0 + 1x,
where 0 (intercept term) and 1 (slope term) are unknown
parameters (constants), and (2) a constant variance function
Var(Y | X = x) = 2,
Y = E(Y | X = x) + e = 0 + 1x + e,
where e is a statistical error, a random variable (RV) such that
E(e) = 0 and Var(e) = 2
II108
Main Points: II
let (xi, yi), i = 1, . . . , n, be RVs obeying Y = 0 + 1x + e
(predictor/response data are realizations of these 2n RVs)
for the ith case, have yi = 0 + 1xi + ei
errors e1, . . . , en are independent RVs such that E(ei | xi) = 0
model for data can also be written in matrix notation as
Y = X + e,
where
0 1
y1
B y2 C
C,
Y=B
.
@.A
yn
0
1
B1
X=B
@ ..
1
x1
x2 C
C
.. A ,
xn
II109
0
1
0 1
e1
B e2 C
C
e=B
.
@.A
en
n
X
1
xi and y =
i=1
n
X
1
yi
i=1
i=1
i=1
II110
Main Points: IV
letting
n
X
RSS(b0, b1) =
[yi
(b0 + b1xi)]2,
i=1
n
X
i=1
ALR24, 22, 23
II111
P 2
i ei
2
2
ei and =
n 2
yi,
Main Points: V
in matrix notation, least squares estimator of is such that
X0X = X0Y, i.e., is solution to normal equations X0Xb = X0Y
2 2 matrix X0X has an inverse as long as SXX 6= 0, so
= (X0X) 1X0Y
since E( ) = , estimators 0 & 1 are unbiased, as is 2 also:
E( 0) = 0, E( 1) = 1 and E( 2) = 2
also have Var( | X) = 2(X0X) 1, leading us to deduce
2
c 1 | X) =
Var( 1 | X) =
, which can be estimated by Var(
,
SXX
SXX
the square root of which is se( 1), the standard error of 1
ALR304, 305, 61, 62, 63, 27, 28
II112
Main Points: VI
can test null hypothesis (NH) that 1 = 0 by forming t-statistic
t = 1/se( 1) and comparing it to percentage points t(, n 2)
for t-distribution with n 2 degrees of freedom, with a large
value of |t| giving evidence against NH via a small p-value
(1 ) 100% confidence interval for 1 is set of points in
interval whose end points are
1 t(/2, n 2)se( 1) and 1 + t(/2, n 2)se( 1)
ALR31
II113
1
(x
x
)
sepred(
y | x+
,
) = 1+ +
n
SXX
where x+
denotes x along with original predictors x1, . . . , xn
(1 ) 100% prediction interval constitutes all values from
+)
y t(/2, n 2)sepred(
y | x+
)
to
y
+t(/2,
n
2)sepred(
y
|
x
ALR32, 33
II114
E(Y | X = x) = 0 + 1x
is sum of squares due to regression:
(SXY)2
SSreg = SYY RSS =
SXX
large value for SSreg indicates 1 is valuable to include
coefficient of determination is R2 = SSreg/SYY
II115
Main Points: IX
plots of residuals ei are invaluable for assessing reasonableness
of fitted model (a point that cannot be emphasized too much)
standard plot is residuals ei versus fitted values yi, which is
equivalent to eis versus predictors xi
plot of residuals versus case number i is potentially but not
always useful
do not plot residuals versus responses yi misleading!
failure to plot residuals is potentially bad for your health!
ALR36, 37, 38
II116
Additional References
F. Mosteller, A.F. Siegel, E. Trapido and C. Youtz (1981), Eye Fitting Straight Lines,
The American Statistician, 35, pp. 1501
C.R. Rao (1972), Linear Statistical Inference and Its Applications (Second Edition),
New York: John Wiley & Sons, Inc.
G.A.F. Seber (1977), Linear Regression Analysis, New York: John Wiley & Sons, Inc.
II117