02 Chapter ALR For Printing

Introduction to Simple Linear Regression: I
consider response Y and predictor X
model for simple linear regression assumes a mean function

E(Y | X = x) of the form 0 + 1x and a variance function
Var(Y | X = x) that is constant:
E(Y | X = x) = 0 + 1x and Var(Y | X = x) = 2
3 parameters are 0 (intercept), 1 (slope) & 2 > 0

to interpret 2, define random variable e = Y E(Y | X = x)
so that Y = E(Y | X = x) + e
results in A.2.4 of ALR say that E(e) = 0 and Var(e) = 2
2 tells how close Y is likely to be to mean value E(Y | X = x)

ALR21, 292
II1
Introduction to Simple Linear Regression: II

let (xi, yi), i = 1, . . . , n, denote predictor/response pairs (X, Y )
ei = yi E(Y | X = xi) = yi ( 0 + 1xi) is called a statistical
error and represents distance between yi and its mean function
will make two additional assumptions about the errors:
[1] E(ei | xi) = 0, implying a scatter plot of ei versus xi should

resemble a null plot (random deviations about zero)
[2] e1, . . . , en are a set of n independent random variables (RVs)
will make third additional assumption upon occasion:
[3] conditional on xis, errors eis are normally distributed

note: normally distributed is same as Gaussian distributed
(preferred expression in engineering & physical sciences)
ALR21, 29
II2
Estimation of Model Parameters: I

Q: given data (xi, yi), i = 1, . . . , n (n realizations of RVs X
and Y ), how should we determine parameters 0 & 1?
since 0 & 1 determine a line, question is equivalent to deciding
how best to draw a line through a scatterplot of (xi, yi)
for n > 2, possibilities for defining best (lots more exist!):
hire an expert to eye ball a line (Mosteller et al., 1981)

find line minimizing distances between data and all possible
lines, with some considerations being
direction (vertical, horizontal, perpendicular)
squared dierence, absolute dierence, etc.
look at all possible lines determined by two distinct points in
scatterplot, and pick one with median slope (sounds bizarre,
but later on will discuss why this might be of interest)
ALR22,23
II3
0
1
2
3
yi
Vertical, Horizontal & Perpendicular Least Squares
1
xi
II4
30
20
10
yi
40
50
60
Vertical, Horizontal & Perpendicular Least Squares
10
20
30
xi
II5
40
50
Estimation of Model Parameters: II

one strategy: form to-be-defined estimators 0 & 1 of 0 &
1, after which form residuals (observed errors):
ei = yi ( 0 + 1xi) = yi yi,
where yi = ( 0 + 1xi) is fitted value for ith case
Q: why isnt residual ei in general equal to error
ei = yi
( 0 + 1xi)?
Q: if per chance we had 0 = 0 & 1 = 1, would fitted value
yi be equal to actual value yi?
ALR22
II6
Least Squares Criterion: I

least squares scheme: estimate 0 & 1 such that sum of squares
of resulting residuals is as small as possible
since residuals are given by ei = yi ( 0 + 1xi) once 0 & 1
are known, consider
n
X
RSS(b0, b1) =
[yi (b0 + b1xi)]2,
i=1
i.e., the residual sum of squares when we use b0 for the intercept and b1 for the slope
least squares estimators 0 & 1 are such that
RSS( 0, 1) < RSS(b0, b1)
when either b0 6= 0 or b1 6= 1 (or both)
ALR24
II7
Least Squares Criterion: II

Q: how do we set b0 & b1 to make RSS(b0, b1) the smallest?
could try lots of dierent values (a grid search a potentially
exhausting task!), but can put calculus to good use here
to motivate how to find 0 & 1, first consider simpler mean
function E(Y | X = x) = 1x (regression through origin)
model is now Y = 1x + e, and task is to find b1 minimizing
n
n h
i
X
X
RSS(b1) =
[yi b1xi]2 =
yi2 2b1xiyi + b21x2i
i=1
i=1
Q: why is b1 minimizing RSS(b1) is same as z minimizing

n
n
X
X
f (z) = az 2 + bz, where a =
x2i and b = 2
xiyi?
i=1
ALR24, 47, 48
II8
i=1
Least Squares Criterion: III

P 2
since a = i xi > 0, f (z) = az 2 + bz ! 1 as z ! 1
f(z)
since f (0) = 0, minimizer z must be such that f (

z) 0
0
z
if f (
z ) < 0, z is halfway between 0 and nonzero root of az 2 +bz
II9
Least Squares Criterion: IV

roots of polynomial az 2 + bz + c given by quadratic formula:
p
b b2 4ac
2a
when c = 0, one root is 0, and nonzero root is b/a, so
P
xiyi
b
i
z =
= P 2 = 1
()
2a
i xi
alternative approach to finding minimizer of RSS(b1): dierentiate with respect to b1, set result to 0 and solve for b1:
P
2
X
[y
b
x
]
d RSS(b1) d
1
i
i i
=
= 2
xi(yi b1xi) = 0,
db1
db1
i
which yields same expression for 1 as stated in ()
ALR47, 48
II10
Least Squares Criterion: V

return now to mean P
function E(Y | X = x) = 0 + 1X, for
which RSS(b0, b1) = i[yi b0 b1xi]2
calculus-based approach to get least squares estimators 0 &
1 follows a path similar to that for E(Y | X = x) = 1X
leads to two equations to solve for two unknowns ( 0 & 1)
dierentiate RSS(b0, b1) with respect to b0 and set result to 0:
X
X
X
2
(yi b0 b1xi) = 0, giving b0n + b1
xi =
yi
i
dierentiate RSS(b0, b1) with respect to b1 and set result to 0:

X
X
X
X
2
2
xi(yi b0 b1xi) = 0, giving b0
xi+b1
xi =
xiyi
i
ALR293
II11
Least Squares Criterion: VI

so-called normal equations for simple linear regression are thus
X
X
X
X
X
2
b0n + b1
xi =
yi and b0
xi + b1
xi =
xiyi
i
i
i
i
i
P
P
1
1
using x = n i xi & y = n i yi, 1st normal equation gives
b0 = y
b1x
replace b0 in 2nd normal equation with right-hand side of above:

X
X
X
(
y b1x)
xi + b1
x2i =
xiyi
i
after a bit ofP

algebra, getP
P
xiyi y i xi
xiyi n
xy
i
i
b1 = P 2
= P 2
= 1,
P
2
x i xi
n
x
i xi
i xi
and hence 0 = y 1x
ALR293, 204
II12
Sum of Cross Products and Sum of Squares

define sum of cross products and sum of squares for xs:
X
X
SXY =
(xi x)(yi y) & SXX =
(xi x)2
i
()
Problem 1: show that

X
X
X
X
(xi x)(yi y) =
xiyi n
xy &
(xi x)2 =
x2i n
x2
i
can thus write
P
xy SXY
1 = Pi xiyi n
=
2
2
SXX
n
x
i xi
P
P 2
note: should avoid i xiyi n
xy & i xi n
x2 when actually
computing 1 use SXY and SXX from () instead
ALR294, 23
II13
Sufficient Statistics
since
1 = SXY and 0 = y
SXX
need only know x, y, SXY & SXX to form
1x,
0 & 1
since
n
n
n
n
X
X
1X
1X
x =
xi, y =
yi, SXY =
xiyi n
xy, SXX =
x2i n
x2
n
n
i=1
i=1
i=1
i=1
follows that 0 & 1 depend only on four sufficient statistics:

n
n
n
n
X
X
X
X
xi,
yi,
xiyi and
x2i
i=1
i=1
i=1
i=1
in theory, can dispense with 2n values (xi, yi), i = 1, . . . , n, and

just keep 4 sufficient statistics as far as 0 & 1 are concerned
ALR294, 23
II14
Atmospheric Pressure & Boiling Point of Water

as 1st example, reconsider Forbess recordings of atmospheric
pressure and boiling point of water, which physics suggests are
related by
log (pressure) = 0 + 1 boiling point
taking response Y to be log10(pressure) and predictor X to be
boiling point, will estimate 0 & 1 for model
Y = 0 + 1X + e
via least squares based upon data (xi, yi), i = 1, . . . , 17
taking log to mean log base 10, computations in R yield

x = 202.9529, y = 1.396041, SXY = 4.753781 & SXX = 530.7824,
from which we get
1 = SXY = 0.008956178 and 0 = y 1x = 0.4216418
SXX
ALR25, 26
II15
1.40
1.35
log(Pressure)
1.45
Scatterplot of log10(Pressure) Versus Boiling Point
195
205
200
Boiling point
ALR6
II16
210
Predicting the Weather

as 2nd example, reconsider n = 93 years of measured early/late
season snowfalls from Fort Collins, Colorado
taking response Y and predictor X to be late season (JanJune)
and early season (SeptDec) snowfalls, entertain model
Y = 0 + 1X + e
computations in R yield
x = 16.74409, y = 32.04301, SXY = 2229.014 & SXX = 10954.07,

from which we get
1 = SXY = 0.2034873 and 0 = y
SXX
ALR8, 9
II17
1x = 28.6358
40
30
20
10
Late snowfall (inches)
50
60
Scatterplot of Late Snowfall Versus Early Snowfall
10
20
30
40
Early snowfall (inches)

ALR8
II18
50
Sample Variances, Covariance and Correlation

define sample variance and sample standard deviation of xs:
r
P
2
(xi x)
SXX
SXX
2
i
SDx =
=
& SDx =
n 1
n 1
n 1
note: sometimes n is used in place of n 1 in defining SD2x

P
after defining SYY = i(yi y)2 (sum of squares for ys), define
similar quantities for ys:
r
P
2
(yi y)
SYY
SYY
2
i
SDy =
=
& SDy =
n 1
n 1
n 1
finally define sample covariance and then sample correlation:
P
sxy
(xi x)(yi y)
SXY
i
sxy =
=
& rxy =
n 1
n 1
SDxSDy
ALR23
II19
Alternative Expression for Slope Estimator

Problem 2: alternative expression for 1 = SXY/SXX is
1 = rxy SDy
SDx
Problem 3:
1 rxy 1
note that, if xis & yis are such that SDy = SDx, then estimated slope is same as sample correlation, as following set of
plots illustrate
ALR24
II20
0.0
0.5
1.0
yi
0.5
1.0
Sample Correlation rxy = 0.999
1.0
0.5
0.0
xi
II21
0.5
1.0
Estimating 2: I
simple linear regression model has 3 parameters: 0 (intercept),
2 (variance of errors)
(slope)
&
1
with 0 & 1 estimated by 0 & 1, will base estimator for 2
on variance of residuals (observed errors)
recall definition of residuals: ei = yi ( 0 + 1xi)
in view of, e.g.,
(x
x
)
i
SD2x = i
n 1
obvious estimator of 2 would appear to be
P
n
2
X
ei e)
1
i(
, where e =
ei
n 1
n
i=1
Problem 4: show that e = 0 always for simple linear regression

ALR26
II22
Estimating 2: II
P 2
obvious estimator thus simplifies to i ei /(n 1)
taking RSS to be shorthand for RSS( 0, 1), we have
n
X
RSS =
[yi
( 0 + 1xi)]2 =
i=1
n
X
e2i ,
i=1
so the obvious estimator of 2 is RSS/(n
1)
can show (e.g., Seber, 1977, p. 51) that unbiased estimator 2

of 2, i.e., E( 2) = 2, is
RSS
2
=
,
n 2
where n 2 = sample size # of parameters in mean function;
obvious estimator 2 and hence is biased towards zero
ALR26, 27, 306
II23
Estimating 2: III
assumption needed for 2 to be an unbiased estimator of 2:
errors ei are uncorrelated RVs with zero means and common variance 2
(slightly less restrictive than assumptions [1] & [2] stated on

overhead II2)
with additional assumption that eis are normally distributed,
can argue that RV (n 2) 2/ 2 obeys a chi-squared distribution with n 2 degrees of freedom
Problem 5:
(SXY)2
RSS = SYY
= SYY 12 SXX
SXX
Q: for a given SYY, what range of values can RSS assume, and
how would you describe the extreme cases?
ALR26, 27
II24
1.40
1.35
log(Pressure)
1.45
195
205
200
Boiling point
ALR6
II25
210
0.005
0.005
0.000
Residuals
0.010
0.015
Residual Plot for Forbes Data
195
205
200
Boiling point
ALR6
II26
210
Estimating 2: IV
for the Forbes data, computations in R yield
SYY = 0.04279135, SXY = 4.753781 & SXX = 530.7824,
from which we get
(SXY)2
RSS = SYY
= 0.0002156426
SXX
since n = 17 for the Forbes data,
RSS
RSS
2
=
=
= 0.00001437617
n 2
15
standard error of regression is
p
= 0.00001437617 = 0.003791592
(also called residual standard error)
red dashed horizontal lines on residual plot show
ALR26
II27
Estimating 2: V
for the Fort Collins data, computations in R yield
SYY = 17572.41, SXY = 2229.014 & SXX = 10954.07,
from which we get

(SXY)2
RSS = SYY
= 17118.83
SXX
since n = 93 for the Fort Collins data,
RSS
RSS
2
=
=
= 188.119
n 2
91
standard error of regression is
p
= 354.5912 = 13.71565
ALR26
II28
40
30
20
10
50
60
10
20
30
40

ALR8
II29
50
10
0
10
20
30
Residuals
20
30
Residual Plot for Fort Collins Data
10
20
30
40

II30
50
Matrix Formulation of Simple Linear Regression: I

matrix theory oers an alternative formulation for simple linear
regression, with the advantage that it generalizes readily to
handle multiple linear regression
start by defining a n-dimensional column vector Y with yis;
an n 2 matrix X whose 1st column consists of just 1s, and
whose 2nd has the xis; a 2-dimensional vector containing 0
and 1; and an n-dimensional vector e containing the eis:
0 1
0
1
0 1
y1
1 x1
e1

B y2 C
B1 x2 C
B e2 C
0
C, X = B
C,
B C
Y=B
=
,
e
=
.
.
.
@.A
@. . A
@ .. A
1
yn
1 xn
en
matrix version of simple linear regression model is Y = X +e
ALR63, 64, 60
II31
Matrix Formulation of Simple Linear Regression: II

since
0
1
0
1
1 x1
0 + 1x1

B1 x2 C
B 0 + 1x2 C
0
C & =
B
C
X=B
,
it
follows
that
X
=
.
.
.
@. . A
@
A
.
1
1 xn
0 + 1xn
hence ith row of matrix equation Y = X + e says
yi = 0 + 1xi + ei,
which is consistent with model Y = 0 + 1X + e; see also II2
let e0 and X0 denote the transposes of e and X; i.e., e0 is an
n-dimensional row vector e1 e2 en , while X0 is a 2 n
matrix taking the form
1 1 1
x1 x2 xn
ALR299, 300, 301
II32
Matrix Formulation of Simple Linear Regression: III

P 2
0
since e = Y X and since e e = i ei , can express sum of
squares of errors as
e0e = (Y X )0(Y X )
if we entertain b0 = b0 b1 rather than unknown 0 = 0 1 ,
corresponding residuals are given by Y Xb, so residual sum
of squares can be written as
RSS(b) = (Y Xb)0(Y Xb)
= (Y0 b0X0)(Y Xb)
= Y0Y Y0Xb b0X0Y + b0X0Xb
= Y0Y 2Y0Xb + b0X0Xb,
where we make use of 2 facts: (1) transpose of a product is
product of transposes in reverse order & (2) transpose of a
scalar is itself (hence b0X0Y = (b0X0Y)0 = Y0Xb)
ALR61, 62, 300, 301, 304
II33
Taking Derivatives with Respect to Vector b: I

suppose f (b) is a scalar-valued function of vector b (elements
are b1, b2, . . . , bq )
two examples, for which a is a vector (ith element is ai) & A
is an q q matrix (element in ith row & jth column is Ai,j ):
f1(b) = a0b =
q
X
aibi and f2(b) = b0Ab =
i=1
define
i=1 j=1
df (b)
B db1 C
df (b) C
df (b) B
B
C
= B db2 C
db
B .. C
@ df (b) A
dbq
ALR301, 304
q X
q
X
II34
biAi,j bj
Taking Derivatives with Respect to Vector b: II

can show (see, e.g., Rao, 1973, p. 71) that
df1(b)
0
f1(b) = a b has derivative
=a
db
and
df2(b)
0
f2(b) = b Ab has derivative
= 2Ab
db
(not hard to show do it for fun and games!)
Q: what is the derivative of f3(b) = b0a?
P 2
0
Q: what is the derivative of f4(b) = b b = i bi ?
Q: what is the derivative of
f5(b) = c0Cb =
p X
q
X
ciCi,j bj ,
i=1 j=1
where c is a p-dimensional vector and C is a p q matrix?

II35
Matrix Formulation of Simple Linear Regression: IV

returning to
RSS(b) = Y0Y
2Y0Xb + b0X0Xb,
taking the derivative of f (b) = RSS(b) with respect to b and

setting the resulting expression to 0 (a vector of zeros) yields
the matrix version of the normal equations:
X0Xb = X0Y,
where we have made use of the facts
0X0Xb)
d(Y0Y)
d(Y0Xb)
d(b
= 0,
= X0Y and
= 2X0Xb
db
db
db
least squares estimator of is solution to normal equations:
X0X = X0Y
ALR304
II36
Matrix Formulation of Simple Linear Regression: V

lets verify that solution to X0X = X0Y yields same estimators 1 & 0 as before, namely, SXY/SXX & y 1x
now
1
1 1 1 B
0
B1
XX=
x1 x2 xn @ ..
1
and
x1
P
C
n
xi
n n
x
x2 C
i
P 2
.. A = P x P x2 = n
x i xi
i i
i i
xn
0 1
y1
P

B
C
1 1 1 B y2 C
yi
n
y
0
i
P
P
XY=
=
=
x1 x2 xn @ .. A
x
y
i i i
i xiyi
yn
Problem 6: finish the verification!

ALR63, 64
II37
Properties of Least Squares Estimators: I

since E(Y | X = x) = 0 + 1x, fitted mean function is
b | X = x) = 0 + 1x,
E(Y
()
which is a line with intercept 0 and slope 1

recalling that 0 = y 1x, start from the right-hand side of
() with x set to x to get
0 + 1x = y 1x + 1x = y,
which says point (
x, y) must lie on fitted mean function
vertical dashed line on following plots indicates the value of x,
while horizontal dashed line, the value of y
ALR27, 28
II38
1.40
1.35
log(Pressure)
1.45
195
205
200
Boiling point
ALR6
II39
210
40
30
20
10
50
60
10
20
30
40

ALR8
II40
50
Properties of Least Squares Estimators: II

both 0 and 1 can be written as a linear combination of responses y1, y2, . . . , yn
P
since 1 = SXY/SXX and since SXY = i(xi x)yi (see Problem 1), we have
P
n
X
(x
x
)y
xi x
i
1 = i i
=
ciyi, where ci =
SXX
SXX
i=1
Q: which yis will have the most/least influence on 1?

lets look at ci plotted versus xi for Forbes data
ALR27
II41
0.015
0.005
ci
0.005
0.015
Weights Versus Boiling Point for Forbes Data
195
205
200
Boiling point
II42
210
1.40
1.35
log(Pressure)
1.45
195
205
200
Boiling point
ALR6
II43
210
Random Vectors and Their Properties: I

column vector U is said to be a random vector if each of its
elements Ui is an RV (random variable)
expected value of random vector denoted by E(U) is a
vector whose ith element is expected value of ith RV Ui in U
for example, if U0 = U1 U2 U3 , then
0
1
E(U1)
E(U) = @E(U2)A
E(U3)
if U has dimension q, if a is a p-dimensional column vector of

constants and if A is a p q dimensional matrix of constants,
can show (fairly easily give it a try!) that
E(a + AU) = a + AE(U)
ALR303
II44
Random Vectors and Their Properties: II

recall that, if Ui and Uj are two RVs, their covariance is defined
to be
Cov(Ui, Uj ) = E([Ui E(Ui)][Uj E(Uj )]
note that Cov(Uj , Ui) = Cov(Ui, Uj ) and that
Cov(Ui, Ui) = E([Ui E(Ui)][Ui E(Ui)] = E([Ui E(Ui)]2) = Var(Ui)

by definition, the covariance matrix for q-dimensional random
vector U to be denoted by Var(U) is q q matrix whose
(i, j)th element is Cov(Ui, Uj )
for example, if U0 = U1 U2 U3 , then
0
1
Var(U1) Cov(U1, U2) Cov(U1, U3)
Var(U) = @Cov(U2, U1) Var(U2) Cov(U2, U1)A
Cov(U3, U1) Cov(U3, U2) Var(U3)
ALR291, 292, 303
II45
Random Vectors and Their Properties: III

if U has dimension q, if a is a p-dimensional column vector of
constants and if A is p q dimensional matrix of constants,
can show (a bit more challenging, but still worth a try!) that
Var(a + AU) = A Var(U)A0
RVs in U are uncorrelated if Cov(Ui, Uj ) = 0 when i 6= j
if each Ui in U has the same variance ( 2, say) and if Uis are

uncorrelated, then Var(U) = 2I, where I is the q q identity
matrix (1s along its diagonal and 0s elsewhere)
for this special case,
Var(a + AU) = A( 2I)A0 = 2AA0
ALR304, 292
II46
Properties of Least Squares Estimators: III

recall that least squares estimator solves normal equations:
X0X = X0Y
()
Problem 6: in the case of simple linear regression, X0X is an
invertible matrix and thus has an inverse (X0X) 1 such that
(X0X) 1X0X = X0X(X0X) 1 = I
premultiplication of both sides of () by (X0X) 1 yields
(X0X) 1X0X = (X0X) 1X0Y
from which we get I = (X0X) 1X0Y and hence
= (X0X) 1X0Y
above succinctly expresses the fact that 0 & 1 (the elements
of ) are linear combinations of yis (the elements of Y)
ALR61, 64, 304
II47
Properties of Least Squares Estimators: IV

considering = (X0X) 1X0Y and taking conditional expectation of both sides yields
E( | X) = E((X0X) 1X0Y | X)
=
=
=
=
=
(X0X)
(X0X)
(X0X)
(X0X)
1X0E(Y | X)
1X0E(X + e | X)
1X0(X + E(e | X))
1X0X
Q: whats the justification for each step above?

E( | X) = holds for all X and hence E( ) = unconditionally, from which we can conclude that 0 & 1 are unbiased
estimators of 0 & 1: E( 0) = 0 and E( 1) = 1
ALR305
II48
Properties of Least Squares Estimators: V

since (i) Var(a + AU) = A Var(U)A0, (ii) (AB)0 = B0A0 and
(iii) (A 1)0 = (A0) 1 for a square matrix, we have
Var( | X) = Var((X0X) 1X0Y | X)
=
=
=
=
=
=
(X0X) 1X0 Var(Y | X)((X0X) 1X0)0

(X0X) 1X0 Var(X + e | X)X(X0X) 1
(X0X) 1X0 Var(e | X)X(X0X) 1
(X0X) 1X0( 2I)X0(XX) 1
2(X0X) 1X0X(X0X) 1
2(X0X) 1
Q: justification for each step above?
ALR305
II49
Properties of Least Squares Estimators: VI

can readily verify that
1
1
a b
d
b
=
c d
c a
ad cb
since
P 2
1
n n
x
xi
0
0
1
i
P
XX=
, get (X X) = P 2
2
n
x i xi
n
x
n ix
n2x2
i
since
Var( | X) =
!
Var( 0 | X) Cov( 0, 1 | X)
2(X0X) 1,
=
Cov( 0, 1 | X) Var( 1 | X)
2
we find that, e.g., Var( 1 | X) = P 2

=
2
SXX
n
x
i xi
P 2
by making use of i xi n
x2 = SXX (see Problem 1)
ALR64, 305, 28
n
x
n
II50
Properties of Least Squares Estimators: VII

Q: what happens to Var( 1 | X) = 2/SXX if we have the
luxury of making the sample size n as large as we want?
in practice, 2 is usually unknown and must be estimated via
2, leading to the following estimator for Var( 1 | X):
2
c 1 | X) =
Var(
SXX
term standard error is sometimes (but not always) used to
refer to the square root of an estimated variance
p
standard error of 1 denoted by se( 1) is thus / SXX
ALR29
II51
Confidence Intervals and Tests for Slope: I

assuming errors ei in simple linear regression to be normally
distributed, parameter estimator 1 for slope 1 is also normally
distributed (same holds for 0 also)
further assuming errors
ance 2, distribution of
ei to have mean 0 and unknown vari1 also depends upon unknown 2
with 2 estimated by 2, confidence intervals (CIs) and tests

concerning unknown true 1 need to be based on t-distribution
with degrees of freedom in sync with divisor used to form 2
let T be a random variable with a t-distribution with d degrees
of freedom, and let t(/2, d) be percentage point such that
Pr(T
ALR30, 31
t(/2, d)) = /2
II52
Confidence Intervals and Tests for Slope: II
0.2
0.1
0.0
PDF
0.3
plot below shows probability density function (PDF) for tdistribution with d = 15 degrees of freedom, with t(0.05, 15) =
1.753 marked by vertical dashed line (thus area under PDF to
right of line is 0.05, and area to left is 0.95)
0
x
II53
Confidence Intervals and Tests for Slope: III

(1
) 100% CI for slope 1 is set of points 1 in interval

1 t(/2, n 2)se( 1) 1 1 + t(/2, n 2)se( 1)
1 = 0.008956 and se( 1) =
example:
for
Forbes
data
(n
=
17),
p
p
/ SXX = 0.0001646 since = 0.003792 and SXX = 23.04,
so 90% CI for 1 is
0.008956 1.7530.0001646 1 0.008956+1.7530.0001646
because t(0.05, 15) = 1.753, yielding
0.008668 1 0.009245
ALR31
II54
Confidence Intervals and Tests for Slope: IV

can test null hypothesis 1 = 1 versus alternative hypothesis
by computing t-statistic
=
6
1
1
t=
se( 1)
and comparing it to percentage points for t-distribution with
n 2 degrees of freedom
example: for Fort Collins data (n = 93), 1 = 0.2035 and
p
p
se( 1) = / SXX = 0.1310 since = 13.72 and SXX =

104.7, so t-statistic for test of zero slope ( 1 = 0) is
0.2035 0
t=
= 1.553
0.1310
ALR31
II55
Confidence Intervals and Tests for Slope: V

letting G(x) denote cumulative probability distribution function for random variable T with t(91) distribution, i.e.,
G(x) = Pr(T x),
p-value associated with t is 2(1
G(|t|)) see next overhead
p-value is 0.1239, which is not small by common standards (e.g.,

0.05 or 0.01), so not much support for rejecting null hypothesis
ALR31
II56
0.6
0.4
0.2
0.0
pvalue
0.8
1.0
Obtaining p-value for t-test for Fort Collins Data
0.0
0.5
1.0
1.5
|t|
II57
2.0
2.5
3.0
Prediction: I
suppose now we want to predict a yet-to-be-observed response
y given a setting x for the predictor
if assumed-to-be-true linear regression model were known perfectly, prediction would be y = 0 + 1x, whereas model
says
y = 0 + 1x + e = y + e
prediction error would be y
Var(e | x) = 2
y = e, which has variance
in general we must be satisfied using estimators 0 & 1 in

lieu of true values 0 & 1, which intuitively should lead to
predictions that are not as good, resulting in a prediction error
with a variance inflated above 2
ALR32, 33
II58
Prediction: II
b | X = x) = 0 + 1x to predict
using fitted mean function E(Y
response y for given x, prediction is now
y = 0 + 1x,
and prediction error becomes
y y = 0 + 1x + e ( 0 + 1x)
recall that, if U & V are uncorrelated RVs, then Var(U V ) =
Var(U ) + Var(V ) (see Equation (A.2), p. 291, of Weisberg)
assuming that e is uncorrelated with RVs involved in formation

of 0 & 1, can regard U = 0 + 1x + e and V = 0 + 1x
as uncorrelated RVs when conditioned on x and x1, . . . , xn
letting x+
be shorthand for x, x1, . . . , xn, we can write
+) + Var( + x | x+)
Var(y y | x+
)
=
Var(
+
x
+
e
|
x
0
1
0
1
ALR32, 33, 291
II59
Prediction: III
0 + 1x | x+)
study pieces Var( 0 + 1x + e | x+
)
&
Var(
one at a time
using fact that Var(c + U ) = Var(U ) for a constant c, we have
+) = 2
Var( 0 + 1x + e | x+
)
=
Var(e
|
x
recall that, if U & V are correlated RVs and c is a constant

then Var(U + cV ) = Var(U ) + c2 Var(V ) + 2c Cov(U, V ) (see
Equation (A.3), p. 292, of Weisberg)
hence
0 | x+) + x2 Var( 1 | x+) + 2x Cov( 0, 1 | x+)
Var( 0 + 1x | x+
)
=
Var(
= Var( 0 | x1, . . . , xn) + x2 Var( 1 | x1, . . . , xn)

+2x Cov( 0, 1 | x1, . . . , xn)
under assumption x is independent of RVs forming 0 & 1
ALR32, 33, 292
II60
Prediction: IV
expressions for Var( 0 | x1, . . . , xn), Var( 1 | x1, . . . , xn) and
Cov( 0, 1 | x1, . . . , xn) can be extracted from matrix
!
0 | X) Cov( 0, 1 | X)
Var(
2(X0X) 1
Var( | X) =
=
Cov( 0, 1 | X) Var( 1 | X)
Exercise (unassigned): using elements of (X0X) 1, show that
!
2
1
(x
x
)
+
2
Var( 0 + 1x | x ) =
+
n
SXX
above represents increase in variance of prediction error due to

necessity of estimating 0 & 1, with the actual variance being
!
2
1
(x
x
)
2 1+ +
Var(y y | x+
)
=
n
SXX
ALR32, 33, 295
II61
Prediction: V
estimating 2 by 2 and taking square root lead to standard
error of prediction (sepred) at x:
!1/2
2
1
(x
x
)
sepred(
y | x+
) = 1+ +
n
SXX
using Forbes data as an example, suppose we want to predict log10(pressure) at a hypothetical location for which boiling
point of water x is somewhere between 190 and 215
prediction for log10(pressure) given boiling point x is
y = 0 + 1x = 0.4216418 + 0.008956178x
(1
) 100% prediction interval is set of points y in interval
+)
y t(/2, n 2)sepred(
y | x+
)
y
+t(/2,
n
2)sepred(
y
|
x
ALR32, 33
II62
Prediction: VI
here n = 17, so, for a 99% prediction interval, we set = 0.01
and use t(0.005, 15) = 2.947
since = 0.003792, x = 203.0 and SXX = 530.8, we have
!1/2
2
1
(x
203.0)
sepred(
y | x+
)
=
0.003792
1
+
+
17
530.8
solid red curves on following plot depict 99% prediction interval
as x sweeps from 190 to 215 (black lines show intervals assuming unrealistically no uncertainty in parameter estimates)
for x = 200, prediction is y = 1.370, and 99% prediction
interval is specified by 1.358 y 1.381
in original space, prediction is 10y = 23.42, and interval is
101.358 10y 101.381, i.e., 22.80 10y 24.05
ALR32, 33
II63
1.40
1.35
1.30
log(Pressure)
1.45
1.50
190
195
205
200
Boiling point
II64
210
215
26
24
22
20
Pressure
28
30
32
Scatterplot of Pressure Versus Boiling Point
190
195
205
200
Boiling point
II65
210
215
Coefficient of Determination R2: I

ignoring potential predictors, best prediction of response is sample average y of observed responses y1, y2, . . . , yn
P
for Fort Collins data, total sum of squares SYY = i(yi y)2
is sum of squares of deviations of data from horizontal dashed
line on next plot
with inclusion of predictors, unexplained variation is RSS
for Fort Collins data, RSS is sum of squares of deviations from

solid line on next plot
ALR35, 36
II66
40
30
20
10
50
60
10
20
30
40

ALR8
II67
50
Coefficient of Determination R2: II

dierence between SYY and RSS is called sum of squares due
to regression:
SSreg = SYY RSS
Problem 5 says that
RSS = SYY
hence
SSreg = SYY
SYY
(SXY)2
SXX
!
(SXY)2
(SXY)2
=
SXX
SXX
divide SSreg by SYY to get definition for coefficient of determination:

2
SSreg
(SXY)
RSS
2
R =
=
=1
SYY
SXX SYY
SYY
ALR35, 36
II68
Coefficient of Determination R2: III

2 (squared sample correlation)
Exercise (unassigned): R2 = rxy
must have 0 R2 1
R2 100 gives percentage of total sum of squares explained by

regression (concept of R2 generalizes to multiple regression)
examples: R2 = 0.026 for Fort Collins & R2 = 0.995 for Forbes
ALR35, 36
II69
Coefficient of Determination R2: IV

R and other computer packages report both R2 and a variation
known as the adjusted R2 :
RSS/df
RSS
2
2
Radj = 1
as compared to R = 1
,
SYY/(n 1)
SYY
where df is the degrees of freedom
2 gets closer
for simple linear regression, df = n 2, so Radj
and closer to R2 as n increases
in general, df = n minus # of parameters in mean function
2 is intended to facilitate comparison of models, but Weis Radj
berg notes (p. 36) there are better ways of doing so
note: R2 useless if mean function does not have intercept term

(e.g., regression through the origin: E(Y | X = x) = 1x)
ALR36
II70
Inadequacy of Sufficient Statistics: I

all of the data-dependent variables connected with a simple
linear regression (e.g., 0, 1, 2, SSreg, RSS, R2, etc.) can be
formed using just five fundamental statistics:
x, y, SXX, SYY and SXY
since
i=1
i=1
i=1
X
X
1X
2
2
x =
xi, SXX =
xi n
x and SXY =
xiyi n
xy
n
(with analogous equations for y and SYY), it follows that basic
linear regression analysis depends only on five so-called sufficient statistics:
n
n
n
n
n
X
X
X
X
X
xi,
yi,
x2i ,
yi2 and
xiyi
i=1
ALR293, 294, 23, 24, 25
i=1
i=1
i=1
II71
i=1
Inadequacy of Sufficient Statistics: II

under assumptions of normality and correctness of regression
model, we do not in theory lose any probabilistic information
by tossing away the original data (xi, yi), i = 1, . . . , n, and
just keeping five sufficient statistics
reliance on sufficient statistics is dangerous in actual applications, where adequacy of basic assumptions (normality, correctness of model) is always open to question
Anscombe (1973) constructed an example of four data sets
(n = 11) with sufficient statistics that are identical (to within
rounding error), oering much food for thought
third data set: reconsider scheme of picking median slope
amongst all possible lines determined by two distinct points
ALR12, 13
II72
8
2
Response
10
12
14
Anscombes First Data Set
15
10
Predictor
ALR13
II73
20
8
2
Response
10
12
14
Anscombes Second Data Set
15
10
Predictor
ALR13
II74
20
8
2
Response
10
12
14
Anscombes Third Data Set
15
10
Predictor
ALR13
II75
20
8
2
Response
10
12
14
Anscombes Fourth Data Set
15
10
Predictor
ALR13
II76
20
Residuals: I
looking at residuals ei is a vital step in regression analysis
can check assumptions to prevent garbage in/garbage out
basic tool is a plot of residuals versus other quantities, of which
three obvious choices are:
1. residuals versus fitted values yi
2. residuals versus predictors xi
3. residuals versus case numbers i
special nature of certain data might suggest other plots
useful residual plot resembles a null plot when assumptions
hold, and a non-null plot when some assumption fails
lets look at plots 1 to 3 using Anscombes data sets as examples
ALR36, 37, 38
II77
0
1
2
Residuals
Residuals Versus Fitted Values, Data Set #1
Fitted values
II78
10
0
1
2
Residuals
Residuals Versus Predictors, Data Set #1
10
Predictors
II79
12
14
0.0
2.0 1.5 1.0 0.5
Residuals
0.5
1.0
Fitted values
II80
10
0.0
2.0 1.5 1.0 0.5
Residuals
0.5
1.0
10
Predictors
II81
12
14
1
0
1
Residuals
Fitted values
II82
10
1
0
1
Residuals
10
Predictors
II83
12
14
0.5
0.5 0.0
1.5
Residuals
1.0
1.5
10
Fitted values
II84
11
12
0.5
0.5 0.0
1.5
Residuals
1.0
1.5
10
12
14
Predictors
II85
16
18
Residuals: II
Q: why is a plot of residuals versus yi identical to a plot of
residuals versus xi after relabeling the horizontal axis?
II86
0
1
2
Residuals
Residuals Versus Case Numbers, Data Set #1
6
Case numbers
II87
10
0.0
2.0 1.5 1.0 0.5
Residuals
0.5
1.0
6
Case numbers
II88
10
1
0
1
Residuals
6
Case numbers
II89
10
0.5
0.5 0.0
1.5
Residuals
1.0
1.5
6
Case numbers
II90
10
Residuals: III
although plots of ei versus i were not particularly useful for
Anscombes data, plot is useful for certain other data sets (particularly where cases are collected sequentially in time)
fourth obvious choice: plot residuals ei versus responses yi
this choice is problematic because relationship
yi = yi + ei
says that, if spread of yis is small compared to spread of eis,

large eis will correspond to large yis even if model is correct
thus residuals versus responses is not a useful residual plot because it need not resemble a null plot when assumptions hold
as an example, reconsider Fort Collins data
II91
40
30
20
10
50
60
10
20
30
40

ALR8
II92
50
10
0
10
20
30
Residuals
20
30
Residuals Versus Fitted Values, Fort Collins Data
30
32
34
36
Fitted values
II93
38
40
10
0
10
20
30
Residuals
20
30
Residuals Versus Predictors, Fort Collins Data
10
20
30
Predictors
II94
40
50
10
0
10
20
30
Residuals
20
30
Residuals Versus Case Numbers, Fort Collins Data
20
60
40
Case numbers
II95
80
10
0
10
20
30
Residuals
20
30
Residuals Versus Responses, Fort Collins Data
10
20
30
40
Responses
II96
50
60
Residuals: IV
reconsider Forbes data, focusing first on 3 following overheads
red dashed horizontal lines on residual plot show
recall definition of weights ci:
P
n
X
xi x
1 = i(xi x)yi =
ciyi, where ci =
SXX
SXX
i=1
ALR36, 37, 38
II97
1.40
1.35
log(Pressure)
1.45
195
205
200
Boiling point
ALR6
II98
210
0.005
0.005
0.000
Residuals
0.010
0.015
Plot of Residuals versus Predictors for Forbes Data
195
205
200
Boiling point
ALR6
II99
210
0.015
0.005
ci
0.005
0.015
Weights Versus Boiling Point for Forbes Data
195
205
200
Boiling point
II100
210
Residuals: V
Weisberg notes that Forbes deemed this case evidently a mistake, but perhaps just because of its appearance as an outlier
Weisberg (p. 38) shows that, if (x12, y12) is removed and regression analysis is redone on reduced data set, resulting slope
estimate is virtually the same, but and quantities that depend
upon it drastically change (see overheads that follow)
to delete or not to delete that is the question:
if we dont delete, normality assumption is questionable

if we do delete, normality assumption is tenable, but no real
scientific justification for doing so (open to charges of data
massaging)
ALR36, 37, 38
II101
1.40
1.35
log(Pressure)
1.45
195
205
200
Boiling point
ALR6
II102
210
1.40
1.35
log(Pressure)
1.45
195
205
200
Boiling point
II103
210
0.005
0.005
0.000
Residuals
0.010
0.015
195
205
200
Boiling point
ALR6
II104
210
0.015
0.005
0.000
0.005
Residuals
0.010
195
205
200
Boiling point
II105
210
1.40
1.35
1.30
log(Pressure)
1.45
1.50
190
195
205
200
Boiling point
II106
210
215
1.35
1.40
1.30
log(Pressure)
1.45
1.50
190
195
205
200
Boiling point
II107
210
215
Main Points: I
given a response Y and a predictor X, simple linear regression
assumes (1) a linear mean function
E(Y | X = x) = 0 + 1x,
where 0 (intercept term) and 1 (slope term) are unknown
parameters (constants), and (2) a constant variance function
Var(Y | X = x) = 2,
where 2 > 0 is a third unknown parameter

simple linear regression model can also be written as
Y = E(Y | X = x) + e = 0 + 1x + e,
where e is a statistical error, a random variable (RV) such that
E(e) = 0 and Var(e) = 2
ALR21, 292, 293
II108
Main Points: II
let (xi, yi), i = 1, . . . , n, be RVs obeying Y = 0 + 1x + e
(predictor/response data are realizations of these 2n RVs)
for the ith case, have yi = 0 + 1xi + ei
errors e1, . . . , en are independent RVs such that E(ei | xi) = 0
model for data can also be written in matrix notation as
Y = X + e,
where
0 1
y1
B y2 C
C,
Y=B
.
@.A
yn
ALR21, 29, 63, 64
0
1
B1
X=B
@ ..
1
x1
x2 C
C
.. A ,
xn
II109

0
1
0 1
e1
B e2 C
C
e=B
.
@.A
en
Main Points: III

given sample means
x =
n
X
1
xi and y =
i=1
n
X
1
yi
i=1
and sample cross products and sum of squares

n
n
n
X
X
X
SXY =
(xi x)(yi y), SXX =
(xi x)2 & SYY =
(yi y)2,
i=1
i=1
i=1
can form least squares estimators for parameters 1 and 0:

1 = SXY and 0 = y 1x
SXX
corresponding estimator for error variance 2 is
n
2
X
RSS
(SXY)
2 =
, where RSS =
[yi ( 0 + 1xi)]2 = SYY
n 2
SXX
i=1
ALR293, 294, 24, 25
II110
Main Points: IV
letting
n
X
RSS(b0, b1) =
[yi
(b0 + b1xi)]2,
i=1
least squares estimators 0 and 1 are choices for b0 and b1

such that RSS(b0, b1) is minimized
fitted values yi and residuals ei are defined as
yi = 0 + 1xi and ei = yi ( 0 + 1xi) = yi
in terms of which we have
RSS = RSS( 0, 1) =
n
X
i=1
ALR24, 22, 23
II111
P 2
i ei
2
2
ei and =
n 2
yi,
Main Points: V
in matrix notation, least squares estimator of is such that
X0X = X0Y, i.e., is solution to normal equations X0Xb = X0Y
2 2 matrix X0X has an inverse as long as SXX 6= 0, so
= (X0X) 1X0Y
since E( ) = , estimators 0 & 1 are unbiased, as is 2 also:
E( 0) = 0, E( 1) = 1 and E( 2) = 2
also have Var( | X) = 2(X0X) 1, leading us to deduce
2
c 1 | X) =
Var( 1 | X) =
, which can be estimated by Var(
,
SXX
SXX
the square root of which is se( 1), the standard error of 1
ALR304, 305, 61, 62, 63, 27, 28
II112
Main Points: VI
can test null hypothesis (NH) that 1 = 0 by forming t-statistic
t = 1/se( 1) and comparing it to percentage points t(, n 2)
for t-distribution with n 2 degrees of freedom, with a large
value of |t| giving evidence against NH via a small p-value
(1 ) 100% confidence interval for 1 is set of points in
interval whose end points are
1 t(/2, n 2)se( 1) and 1 + t(/2, n 2)se( 1)
ALR31
II113
Main Points: VII

can predict a yet-to-be-observed response y given a setting x
for the predictor using y = 0 + 1x, which has a standard
error given by
!1/2
2
1
(x
x
)
sepred(
y | x+
,
) = 1+ +
n
SXX
where x+
denotes x along with original predictors x1, . . . , xn
(1 ) 100% prediction interval constitutes all values from
+)
y t(/2, n 2)sepred(
y | x+
)
to
y
+t(/2,
n
2)sepred(
y
|
x
ALR32, 33
II114
Main Points: VIII

dierence between residual sum of squares SYY for model
E(Y | X = x) = 0
and residual sum of squares RSS for model
E(Y | X = x) = 0 + 1x
is sum of squares due to regression:
(SXY)2
SSreg = SYY RSS =
SXX
large value for SSreg indicates 1 is valuable to include
coefficient of determination is R2 = SSreg/SYY
R2 100 gives percentage of total sum of squares explained by

linear regression (must have 0 R2 1)
ALR35, 36
II115
Main Points: IX
plots of residuals ei are invaluable for assessing reasonableness
of fitted model (a point that cannot be emphasized too much)
standard plot is residuals ei versus fitted values yi, which is
equivalent to eis versus predictors xi
plot of residuals versus case number i is potentially but not
always useful
do not plot residuals versus responses yi misleading!
failure to plot residuals is potentially bad for your health!
Thou Shalt Plot Residuals (a proposed 11th commandment!)
ALR36, 37, 38
II116
Additional References
F. Mosteller, A.F. Siegel, E. Trapido and C. Youtz (1981), Eye Fitting Straight Lines,
The American Statistician, 35, pp. 1501
C.R. Rao (1972), Linear Statistical Inference and Its Applications (Second Edition),
New York: John Wiley & Sons, Inc.
G.A.F. Seber (1977), Linear Regression Analysis, New York: John Wiley & Sons, Inc.
II117

02 Chapter ALR For Printing

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

02 Chapter ALR For Printing

Transféré par

Droits d'auteur :

Formats disponibles

Introduction to Simple Linear Regression: I

consider response Y and predictor X

model for simple linear regression assumes a mean function

3 parameters are 0 (intercept), 1 (slope) & 2 > 0

2 tells how close Y is likely to be to mean value E(Y | X = x)

Introduction to Simple Linear Regression: II

[1] E(ei | xi) = 0, implying a scatter plot of ei versus xi should

will make third additional assumption upon occasion:

[3] conditional on xis, errors eis are normally distributed

Estimation of Model Parameters: I

hire an expert to eye ball a line (Mosteller et al., 1981)

Vertical, Horizontal & Perpendicular Least Squares

Vertical, Horizontal & Perpendicular Least Squares

Estimation of Model Parameters: II

Least Squares Criterion: I

Least Squares Criterion: II

Q: why is b1 minimizing RSS(b1) is same as z minimizing

Least Squares Criterion: III

since f (0) = 0, minimizer z must be such that f (

Least Squares Criterion: IV

which yields same expression for 1 as stated in ()

Least Squares Criterion: V

dierentiate RSS(b0, b1) with respect to b1 and set result to 0:

Least Squares Criterion: VI

replace b0 in 2nd normal equation with right-hand side of above:

after a bit ofP

Sum of Cross Products and Sum of Squares

Problem 1: show that

can thus write

follows that 0 & 1 depend only on four sufficient statistics:

in theory, can dispense with 2n values (xi, yi), i = 1, . . . , n, and

Atmospheric Pressure & Boiling Point of Water

taking log to mean log base 10, computations in R yield

Scatterplot of log10(Pressure) Versus Boiling Point

Predicting the Weather

x = 16.74409, y = 32.04301, SXY = 2229.014 & SXX = 10954.07,

Late snowfall (inches)

Scatterplot of Late Snowfall Versus Early Snowfall

Early snowfall (inches)

Sample Variances, Covariance and Correlation

note: sometimes n is used in place of n 1 in defining SD2x

Alternative Expression for Slope Estimator

Sample Correlation rxy = 0.999

Problem 4: show that e = 0 always for simple linear regression

so the obvious estimator of 2 is RSS/(n

can show (e.g., Seber, 1977, p. 51) that unbiased estimator 2

(slightly less restrictive than assumptions [1] & [2] stated on

Scatterplot of log10(Pressure) Versus Boiling Point

Residual Plot for Forbes Data

SYY = 17572.41, SXY = 2229.014 & SXX = 10954.07,

from which we get

Late snowfall (inches)

Scatterplot of Late Snowfall Versus Early Snowfall

Early snowfall (inches)

Residual Plot for Fort Collins Data

Early snowfall (inches)

Matrix Formulation of Simple Linear Regression: I

Matrix Formulation of Simple Linear Regression: II

Matrix Formulation of Simple Linear Regression: III

Taking Derivatives with Respect to Vector b: I

aibi and f2(b) = b0Ab =

Taking Derivatives with Respect to Vector b: II

where c is a p-dimensional vector and C is a p q matrix?