Vous êtes sur la page 1sur 117

Introduction to Simple Linear Regression: I

consider response Y and predictor X

model for simple linear regression assumes a mean function


E(Y | X = x) of the form 0 + 1x and a variance function
Var(Y | X = x) that is constant:
E(Y | X = x) = 0 + 1x and Var(Y | X = x) = 2

3 parameters are 0 (intercept), 1 (slope) & 2 > 0


to interpret 2, define random variable e = Y E(Y | X = x)
so that Y = E(Y | X = x) + e
results in A.2.4 of ALR say that E(e) = 0 and Var(e) = 2

2 tells how close Y is likely to be to mean value E(Y | X = x)


ALR21, 292

II1

Introduction to Simple Linear Regression: II


let (xi, yi), i = 1, . . . , n, denote predictor/response pairs (X, Y )
ei = yi E(Y | X = xi) = yi ( 0 + 1xi) is called a statistical
error and represents distance between yi and its mean function
will make two additional assumptions about the errors:

[1] E(ei | xi) = 0, implying a scatter plot of ei versus xi should


resemble a null plot (random deviations about zero)
[2] e1, . . . , en are a set of n independent random variables (RVs)

will make third additional assumption upon occasion:

[3] conditional on xis, errors eis are normally distributed


note: normally distributed is same as Gaussian distributed
(preferred expression in engineering & physical sciences)

ALR21, 29

II2

Estimation of Model Parameters: I


Q: given data (xi, yi), i = 1, . . . , n (n realizations of RVs X
and Y ), how should we determine parameters 0 & 1?
since 0 & 1 determine a line, question is equivalent to deciding
how best to draw a line through a scatterplot of (xi, yi)
for n > 2, possibilities for defining best (lots more exist!):

hire an expert to eye ball a line (Mosteller et al., 1981)


find line minimizing distances between data and all possible
lines, with some considerations being
direction (vertical, horizontal, perpendicular)
squared dierence, absolute dierence, etc.
look at all possible lines determined by two distinct points in
scatterplot, and pick one with median slope (sounds bizarre,
but later on will discuss why this might be of interest)

ALR22,23

II3

0
1
2
3

yi

Vertical, Horizontal & Perpendicular Least Squares

1
xi

II4

30
20
10

yi

40

50

60

Vertical, Horizontal & Perpendicular Least Squares

10

20

30
xi
II5

40

50

Estimation of Model Parameters: II


one strategy: form to-be-defined estimators 0 & 1 of 0 &
1, after which form residuals (observed errors):
ei = yi ( 0 + 1xi) = yi yi,
where yi = ( 0 + 1xi) is fitted value for ith case
Q: why isnt residual ei in general equal to error
ei = yi

( 0 + 1xi)?
Q: if per chance we had 0 = 0 & 1 = 1, would fitted value
yi be equal to actual value yi?

ALR22

II6

Least Squares Criterion: I


least squares scheme: estimate 0 & 1 such that sum of squares
of resulting residuals is as small as possible
since residuals are given by ei = yi ( 0 + 1xi) once 0 & 1
are known, consider
n
X
RSS(b0, b1) =
[yi (b0 + b1xi)]2,
i=1

i.e., the residual sum of squares when we use b0 for the intercept and b1 for the slope
least squares estimators 0 & 1 are such that
RSS( 0, 1) < RSS(b0, b1)
when either b0 6= 0 or b1 6= 1 (or both)

ALR24

II7

Least Squares Criterion: II


Q: how do we set b0 & b1 to make RSS(b0, b1) the smallest?
could try lots of dierent values (a grid search a potentially
exhausting task!), but can put calculus to good use here
to motivate how to find 0 & 1, first consider simpler mean
function E(Y | X = x) = 1x (regression through origin)
model is now Y = 1x + e, and task is to find b1 minimizing
n
n h
i
X
X
RSS(b1) =
[yi b1xi]2 =
yi2 2b1xiyi + b21x2i
i=1

i=1

Q: why is b1 minimizing RSS(b1) is same as z minimizing


n
n
X
X
f (z) = az 2 + bz, where a =
x2i and b = 2
xiyi?
i=1

ALR24, 47, 48

II8

i=1

Least Squares Criterion: III


P 2
since a = i xi > 0, f (z) = az 2 + bz ! 1 as z ! 1

f(z)

since f (0) = 0, minimizer z must be such that f (


z) 0

0
z

if f (
z ) < 0, z is halfway between 0 and nonzero root of az 2 +bz
II9

Least Squares Criterion: IV


roots of polynomial az 2 + bz + c given by quadratic formula:
p
b b2 4ac
2a
when c = 0, one root is 0, and nonzero root is b/a, so
P
xiyi
b
i
z =
= P 2 = 1
()
2a
i xi

alternative approach to finding minimizer of RSS(b1): dierentiate with respect to b1, set result to 0 and solve for b1:
P
2
X
[y
b
x
]
d RSS(b1) d
1
i
i i
=
= 2
xi(yi b1xi) = 0,
db1
db1
i

which yields same expression for 1 as stated in ()

ALR47, 48

II10

Least Squares Criterion: V


return now to mean P
function E(Y | X = x) = 0 + 1X, for
which RSS(b0, b1) = i[yi b0 b1xi]2
calculus-based approach to get least squares estimators 0 &
1 follows a path similar to that for E(Y | X = x) = 1X
leads to two equations to solve for two unknowns ( 0 & 1)
dierentiate RSS(b0, b1) with respect to b0 and set result to 0:
X
X
X
2
(yi b0 b1xi) = 0, giving b0n + b1
xi =
yi
i

dierentiate RSS(b0, b1) with respect to b1 and set result to 0:


X
X
X
X
2
2
xi(yi b0 b1xi) = 0, giving b0
xi+b1
xi =
xiyi
i

ALR293

II11

Least Squares Criterion: VI


so-called normal equations for simple linear regression are thus
X
X
X
X
X
2
b0n + b1
xi =
yi and b0
xi + b1
xi =
xiyi
i
i
i
i
i
P
P
1
1
using x = n i xi & y = n i yi, 1st normal equation gives

b0 = y

b1x

replace b0 in 2nd normal equation with right-hand side of above:


X
X
X
(
y b1x)
xi + b1
x2i =
xiyi
i

after a bit ofP


algebra, getP
P
xiyi y i xi
xiyi n
xy
i
i
b1 = P 2
= P 2
= 1,
P
2
x i xi
n
x
i xi
i xi
and hence 0 = y 1x
ALR293, 204

II12

Sum of Cross Products and Sum of Squares


define sum of cross products and sum of squares for xs:
X
X
SXY =
(xi x)(yi y) & SXX =
(xi x)2
i

()

Problem 1: show that


X
X
X
X
(xi x)(yi y) =
xiyi n
xy &
(xi x)2 =
x2i n
x2
i

can thus write

P
xy SXY
1 = Pi xiyi n
=
2
2
SXX
n
x
i xi
P
P 2
note: should avoid i xiyi n
xy & i xi n
x2 when actually
computing 1 use SXY and SXX from () instead
ALR294, 23

II13

Sufficient Statistics
since

1 = SXY and 0 = y
SXX
need only know x, y, SXY & SXX to form

1x,
0 & 1

since
n
n
n
n
X
X
1X
1X
x =
xi, y =
yi, SXY =
xiyi n
xy, SXX =
x2i n
x2
n
n
i=1

i=1

i=1

i=1

follows that 0 & 1 depend only on four sufficient statistics:


n
n
n
n
X
X
X
X
xi,
yi,
xiyi and
x2i
i=1

i=1

i=1

i=1

in theory, can dispense with 2n values (xi, yi), i = 1, . . . , n, and


just keep 4 sufficient statistics as far as 0 & 1 are concerned
ALR294, 23

II14

Atmospheric Pressure & Boiling Point of Water


as 1st example, reconsider Forbess recordings of atmospheric
pressure and boiling point of water, which physics suggests are
related by
log (pressure) = 0 + 1 boiling point
taking response Y to be log10(pressure) and predictor X to be
boiling point, will estimate 0 & 1 for model
Y = 0 + 1X + e
via least squares based upon data (xi, yi), i = 1, . . . , 17

taking log to mean log base 10, computations in R yield


x = 202.9529, y = 1.396041, SXY = 4.753781 & SXX = 530.7824,
from which we get
1 = SXY = 0.008956178 and 0 = y 1x = 0.4216418
SXX
ALR25, 26

II15

1.40
1.35

log(Pressure)

1.45

Scatterplot of log10(Pressure) Versus Boiling Point

195

205

200

Boiling point
ALR6

II16

210

Predicting the Weather


as 2nd example, reconsider n = 93 years of measured early/late
season snowfalls from Fort Collins, Colorado
taking response Y and predictor X to be late season (JanJune)
and early season (SeptDec) snowfalls, entertain model
Y = 0 + 1X + e
computations in R yield

x = 16.74409, y = 32.04301, SXY = 2229.014 & SXX = 10954.07,


from which we get
1 = SXY = 0.2034873 and 0 = y
SXX

ALR8, 9

II17

1x = 28.6358

40
30
20
10

Late snowfall (inches)

50

60

Scatterplot of Late Snowfall Versus Early Snowfall

10

20

30

40

Early snowfall (inches)


ALR8

II18

50

Sample Variances, Covariance and Correlation


define sample variance and sample standard deviation of xs:
r
P
2
(xi x)
SXX
SXX
2
i
SDx =
=
& SDx =
n 1
n 1
n 1

note: sometimes n is used in place of n 1 in defining SD2x


P
after defining SYY = i(yi y)2 (sum of squares for ys), define
similar quantities for ys:
r
P
2
(yi y)
SYY
SYY
2
i
SDy =
=
& SDy =
n 1
n 1
n 1
finally define sample covariance and then sample correlation:
P
sxy
(xi x)(yi y)
SXY
i
sxy =
=
& rxy =
n 1
n 1
SDxSDy
ALR23

II19

Alternative Expression for Slope Estimator


Problem 2: alternative expression for 1 = SXY/SXX is
1 = rxy SDy
SDx
Problem 3:

1 rxy 1

note that, if xis & yis are such that SDy = SDx, then estimated slope is same as sample correlation, as following set of
plots illustrate

ALR24

II20

0.0
0.5
1.0

yi

0.5

1.0

Sample Correlation rxy = 0.999

1.0

0.5

0.0
xi
II21

0.5

1.0

Estimating 2: I
simple linear regression model has 3 parameters: 0 (intercept),
2 (variance of errors)
(slope)
&
1
with 0 & 1 estimated by 0 & 1, will base estimator for 2
on variance of residuals (observed errors)
recall definition of residuals: ei = yi ( 0 + 1xi)
in view of, e.g.,

(x
x
)
i
SD2x = i
n 1
obvious estimator of 2 would appear to be
P
n
2
X

ei e)
1
i(

, where e =
ei
n 1
n
i=1

Problem 4: show that e = 0 always for simple linear regression


ALR26

II22

Estimating 2: II
P 2
obvious estimator thus simplifies to i ei /(n 1)
taking RSS to be shorthand for RSS( 0, 1), we have
n
X
RSS =
[yi

( 0 + 1xi)]2 =

i=1

n
X

e2i ,

i=1

so the obvious estimator of 2 is RSS/(n

1)

can show (e.g., Seber, 1977, p. 51) that unbiased estimator 2


of 2, i.e., E( 2) = 2, is
RSS
2
=
,
n 2
where n 2 = sample size # of parameters in mean function;
obvious estimator 2 and hence is biased towards zero
ALR26, 27, 306

II23

Estimating 2: III
assumption needed for 2 to be an unbiased estimator of 2:
errors ei are uncorrelated RVs with zero means and common variance 2

(slightly less restrictive than assumptions [1] & [2] stated on


overhead II2)
with additional assumption that eis are normally distributed,
can argue that RV (n 2) 2/ 2 obeys a chi-squared distribution with n 2 degrees of freedom
Problem 5:

(SXY)2
RSS = SYY
= SYY 12 SXX
SXX
Q: for a given SYY, what range of values can RSS assume, and
how would you describe the extreme cases?
ALR26, 27

II24

1.40
1.35

log(Pressure)

1.45

Scatterplot of log10(Pressure) Versus Boiling Point

195

205

200

Boiling point
ALR6

II25

210

0.005
0.005

0.000

Residuals

0.010

0.015

Residual Plot for Forbes Data

195

205

200

Boiling point
ALR6

II26

210

Estimating 2: IV
for the Forbes data, computations in R yield
SYY = 0.04279135, SXY = 4.753781 & SXX = 530.7824,
from which we get
(SXY)2
RSS = SYY
= 0.0002156426
SXX
since n = 17 for the Forbes data,
RSS
RSS
2
=
=
= 0.00001437617
n 2
15
standard error of regression is
p
= 0.00001437617 = 0.003791592
(also called residual standard error)
red dashed horizontal lines on residual plot show
ALR26

II27

Estimating 2: V
for the Fort Collins data, computations in R yield

SYY = 17572.41, SXY = 2229.014 & SXX = 10954.07,

from which we get


(SXY)2
RSS = SYY
= 17118.83
SXX
since n = 93 for the Fort Collins data,
RSS
RSS
2
=
=
= 188.119
n 2
91
standard error of regression is
p
= 354.5912 = 13.71565
ALR26

II28

40
30
20
10

Late snowfall (inches)

50

60

Scatterplot of Late Snowfall Versus Early Snowfall

10

20

30

40

Early snowfall (inches)


ALR8

II29

50

10
0
10
20
30

Residuals

20

30

Residual Plot for Fort Collins Data

10

20

30

40

Early snowfall (inches)


II30

50

Matrix Formulation of Simple Linear Regression: I


matrix theory oers an alternative formulation for simple linear
regression, with the advantage that it generalizes readily to
handle multiple linear regression
start by defining a n-dimensional column vector Y with yis;
an n 2 matrix X whose 1st column consists of just 1s, and
whose 2nd has the xis; a 2-dimensional vector containing 0
and 1; and an n-dimensional vector e containing the eis:
0 1
0
1
0 1
y1
1 x1
e1

B y2 C
B1 x2 C
B e2 C
0
C, X = B
C,
B C
Y=B
=
,
e
=
.
.
.
@.A
@. . A
@ .. A
1
yn
1 xn
en
matrix version of simple linear regression model is Y = X +e
ALR63, 64, 60

II31

Matrix Formulation of Simple Linear Regression: II


since
0
1
0
1
1 x1
0 + 1x1

B1 x2 C
B 0 + 1x2 C
0
C & =
B
C
X=B
,
it
follows
that
X
=
.
.
.
@. . A
@
A
.
1
1 xn
0 + 1xn
hence ith row of matrix equation Y = X + e says
yi = 0 + 1xi + ei,
which is consistent with model Y = 0 + 1X + e; see also II2
let e0 and X0 denote the transposes of e and X; i.e., e0 is an
n-dimensional row vector e1 e2 en , while X0 is a 2 n
matrix taking the form

1 1 1
x1 x2 xn
ALR299, 300, 301

II32

Matrix Formulation of Simple Linear Regression: III


P 2
0
since e = Y X and since e e = i ei , can express sum of
squares of errors as
e0e = (Y X )0(Y X )
if we entertain b0 = b0 b1 rather than unknown 0 = 0 1 ,
corresponding residuals are given by Y Xb, so residual sum
of squares can be written as
RSS(b) = (Y Xb)0(Y Xb)
= (Y0 b0X0)(Y Xb)
= Y0Y Y0Xb b0X0Y + b0X0Xb
= Y0Y 2Y0Xb + b0X0Xb,
where we make use of 2 facts: (1) transpose of a product is
product of transposes in reverse order & (2) transpose of a
scalar is itself (hence b0X0Y = (b0X0Y)0 = Y0Xb)
ALR61, 62, 300, 301, 304

II33

Taking Derivatives with Respect to Vector b: I


suppose f (b) is a scalar-valued function of vector b (elements
are b1, b2, . . . , bq )
two examples, for which a is a vector (ith element is ai) & A
is an q q matrix (element in ith row & jth column is Ai,j ):
f1(b) = a0b =

q
X

aibi and f2(b) = b0Ab =

i=1

define

i=1 j=1

df (b)
B db1 C
df (b) C
df (b) B
B
C
= B db2 C

db

B .. C
@ df (b) A
dbq

ALR301, 304

q X
q
X

II34

biAi,j bj

Taking Derivatives with Respect to Vector b: II


can show (see, e.g., Rao, 1973, p. 71) that
df1(b)
0
f1(b) = a b has derivative
=a
db
and
df2(b)
0
f2(b) = b Ab has derivative
= 2Ab
db
(not hard to show do it for fun and games!)
Q: what is the derivative of f3(b) = b0a?

P 2
0
Q: what is the derivative of f4(b) = b b = i bi ?
Q: what is the derivative of

f5(b) = c0Cb =

p X
q
X

ciCi,j bj ,

i=1 j=1

where c is a p-dimensional vector and C is a p q matrix?


II35

Matrix Formulation of Simple Linear Regression: IV


returning to

RSS(b) = Y0Y

2Y0Xb + b0X0Xb,

taking the derivative of f (b) = RSS(b) with respect to b and


setting the resulting expression to 0 (a vector of zeros) yields
the matrix version of the normal equations:
X0Xb = X0Y,
where we have made use of the facts
0X0Xb)
d(Y0Y)
d(Y0Xb)
d(b
= 0,
= X0Y and
= 2X0Xb
db
db
db
least squares estimator of is solution to normal equations:
X0X = X0Y

ALR304

II36

Matrix Formulation of Simple Linear Regression: V


lets verify that solution to X0X = X0Y yields same estimators 1 & 0 as before, namely, SXY/SXX & y 1x
now

1
1 1 1 B
0
B1
XX=
x1 x2 xn @ ..
1
and

x1

P
C
n
xi
n n
x
x2 C
i
P 2
.. A = P x P x2 = n
x i xi
i i
i i
xn

0 1

y1
P

B
C
1 1 1 B y2 C
yi
n
y
0
i
P
P
XY=
=
=
x1 x2 xn @ .. A
x
y
i i i
i xiyi
yn

Problem 6: finish the verification!


ALR63, 64

II37

Properties of Least Squares Estimators: I


since E(Y | X = x) = 0 + 1x, fitted mean function is
b | X = x) = 0 + 1x,
E(Y

()

which is a line with intercept 0 and slope 1


recalling that 0 = y 1x, start from the right-hand side of
() with x set to x to get
0 + 1x = y 1x + 1x = y,
which says point (
x, y) must lie on fitted mean function
vertical dashed line on following plots indicates the value of x,
while horizontal dashed line, the value of y

ALR27, 28

II38

1.40
1.35

log(Pressure)

1.45

Scatterplot of log10(Pressure) Versus Boiling Point

195

205

200

Boiling point
ALR6

II39

210

40
30
20
10

Late snowfall (inches)

50

60

Scatterplot of Late Snowfall Versus Early Snowfall

10

20

30

40

Early snowfall (inches)


ALR8

II40

50

Properties of Least Squares Estimators: II


both 0 and 1 can be written as a linear combination of responses y1, y2, . . . , yn
P

since 1 = SXY/SXX and since SXY = i(xi x)yi (see Problem 1), we have
P
n
X

(x
x
)y
xi x
i
1 = i i
=
ciyi, where ci =
SXX
SXX
i=1

Q: which yis will have the most/least influence on 1?


lets look at ci plotted versus xi for Forbes data

ALR27

II41

0.015

0.005

ci

0.005

0.015

Weights Versus Boiling Point for Forbes Data

195

205

200

Boiling point
II42

210

1.40
1.35

log(Pressure)

1.45

Scatterplot of log10(Pressure) Versus Boiling Point

195

205

200

Boiling point
ALR6

II43

210

Random Vectors and Their Properties: I


column vector U is said to be a random vector if each of its
elements Ui is an RV (random variable)
expected value of random vector denoted by E(U) is a
vector whose ith element is expected value of ith RV Ui in U
for example, if U0 = U1 U2 U3 , then
0
1
E(U1)
E(U) = @E(U2)A
E(U3)

if U has dimension q, if a is a p-dimensional column vector of


constants and if A is a p q dimensional matrix of constants,
can show (fairly easily give it a try!) that
E(a + AU) = a + AE(U)
ALR303

II44

Random Vectors and Their Properties: II


recall that, if Ui and Uj are two RVs, their covariance is defined
to be
Cov(Ui, Uj ) = E([Ui E(Ui)][Uj E(Uj )]
note that Cov(Uj , Ui) = Cov(Ui, Uj ) and that

Cov(Ui, Ui) = E([Ui E(Ui)][Ui E(Ui)] = E([Ui E(Ui)]2) = Var(Ui)


by definition, the covariance matrix for q-dimensional random
vector U to be denoted by Var(U) is q q matrix whose
(i, j)th element is Cov(Ui, Uj )
for example, if U0 = U1 U2 U3 , then
0
1
Var(U1) Cov(U1, U2) Cov(U1, U3)
Var(U) = @Cov(U2, U1) Var(U2) Cov(U2, U1)A
Cov(U3, U1) Cov(U3, U2) Var(U3)
ALR291, 292, 303

II45

Random Vectors and Their Properties: III


if U has dimension q, if a is a p-dimensional column vector of
constants and if A is p q dimensional matrix of constants,
can show (a bit more challenging, but still worth a try!) that
Var(a + AU) = A Var(U)A0
RVs in U are uncorrelated if Cov(Ui, Uj ) = 0 when i 6= j

if each Ui in U has the same variance ( 2, say) and if Uis are


uncorrelated, then Var(U) = 2I, where I is the q q identity
matrix (1s along its diagonal and 0s elsewhere)
for this special case,

Var(a + AU) = A( 2I)A0 = 2AA0

ALR304, 292

II46

Properties of Least Squares Estimators: III


recall that least squares estimator solves normal equations:
X0X = X0Y
()
Problem 6: in the case of simple linear regression, X0X is an
invertible matrix and thus has an inverse (X0X) 1 such that
(X0X) 1X0X = X0X(X0X) 1 = I
premultiplication of both sides of () by (X0X) 1 yields
(X0X) 1X0X = (X0X) 1X0Y
from which we get I = (X0X) 1X0Y and hence
= (X0X) 1X0Y
above succinctly expresses the fact that 0 & 1 (the elements
of ) are linear combinations of yis (the elements of Y)
ALR61, 64, 304

II47

Properties of Least Squares Estimators: IV


considering = (X0X) 1X0Y and taking conditional expectation of both sides yields
E( | X) = E((X0X) 1X0Y | X)
=
=
=
=
=

(X0X)
(X0X)
(X0X)
(X0X)

1X0E(Y | X)
1X0E(X + e | X)
1X0(X + E(e | X))
1X0X

Q: whats the justification for each step above?


E( | X) = holds for all X and hence E( ) = unconditionally, from which we can conclude that 0 & 1 are unbiased
estimators of 0 & 1: E( 0) = 0 and E( 1) = 1
ALR305

II48

Properties of Least Squares Estimators: V


since (i) Var(a + AU) = A Var(U)A0, (ii) (AB)0 = B0A0 and
(iii) (A 1)0 = (A0) 1 for a square matrix, we have
Var( | X) = Var((X0X) 1X0Y | X)
=
=
=
=
=
=

(X0X) 1X0 Var(Y | X)((X0X) 1X0)0


(X0X) 1X0 Var(X + e | X)X(X0X) 1
(X0X) 1X0 Var(e | X)X(X0X) 1
(X0X) 1X0( 2I)X0(XX) 1
2(X0X) 1X0X(X0X) 1
2(X0X) 1

Q: justification for each step above?

ALR305

II49

Properties of Least Squares Estimators: VI


can readily verify that
1

1
a b
d
b
=
c d
c a
ad cb
since

P 2
1
n n
x
xi
0
0
1
i
P
XX=
, get (X X) = P 2
2
n
x i xi
n
x
n ix
n2x2
i

since

Var( | X) =

!
Var( 0 | X) Cov( 0, 1 | X)
2(X0X) 1,
=
Cov( 0, 1 | X) Var( 1 | X)
2

we find that, e.g., Var( 1 | X) = P 2


=
2
SXX
n
x
i xi
P 2
by making use of i xi n
x2 = SXX (see Problem 1)

ALR64, 305, 28

n
x
n

II50

Properties of Least Squares Estimators: VII


Q: what happens to Var( 1 | X) = 2/SXX if we have the
luxury of making the sample size n as large as we want?
in practice, 2 is usually unknown and must be estimated via
2, leading to the following estimator for Var( 1 | X):
2

c 1 | X) =
Var(
SXX
term standard error is sometimes (but not always) used to
refer to the square root of an estimated variance
p

standard error of 1 denoted by se( 1) is thus / SXX

ALR29

II51

Confidence Intervals and Tests for Slope: I


assuming errors ei in simple linear regression to be normally
distributed, parameter estimator 1 for slope 1 is also normally
distributed (same holds for 0 also)
further assuming errors
ance 2, distribution of

ei to have mean 0 and unknown vari1 also depends upon unknown 2

with 2 estimated by 2, confidence intervals (CIs) and tests


concerning unknown true 1 need to be based on t-distribution
with degrees of freedom in sync with divisor used to form 2
let T be a random variable with a t-distribution with d degrees
of freedom, and let t(/2, d) be percentage point such that
Pr(T
ALR30, 31

t(/2, d)) = /2
II52

Confidence Intervals and Tests for Slope: II

0.2
0.1
0.0

PDF

0.3

plot below shows probability density function (PDF) for tdistribution with d = 15 degrees of freedom, with t(0.05, 15) =
1.753 marked by vertical dashed line (thus area under PDF to
right of line is 0.05, and area to left is 0.95)

0
x
II53

Confidence Intervals and Tests for Slope: III


(1

) 100% CI for slope 1 is set of points 1 in interval


1 t(/2, n 2)se( 1) 1 1 + t(/2, n 2)se( 1)
1 = 0.008956 and se( 1) =
example:
for
Forbes
data
(n
=
17),
p
p
/ SXX = 0.0001646 since = 0.003792 and SXX = 23.04,
so 90% CI for 1 is
0.008956 1.7530.0001646 1 0.008956+1.7530.0001646
because t(0.05, 15) = 1.753, yielding
0.008668 1 0.009245

ALR31

II54

Confidence Intervals and Tests for Slope: IV


can test null hypothesis 1 = 1 versus alternative hypothesis
by computing t-statistic
=
6
1
1
t=

se( 1)
and comparing it to percentage points for t-distribution with
n 2 degrees of freedom
example: for Fort Collins data (n = 93), 1 = 0.2035 and
p
p

se( 1) = / SXX = 0.1310 since = 13.72 and SXX =


104.7, so t-statistic for test of zero slope ( 1 = 0) is
0.2035 0
t=
= 1.553
0.1310
ALR31

II55

Confidence Intervals and Tests for Slope: V


letting G(x) denote cumulative probability distribution function for random variable T with t(91) distribution, i.e.,
G(x) = Pr(T x),

p-value associated with t is 2(1

G(|t|)) see next overhead

p-value is 0.1239, which is not small by common standards (e.g.,


0.05 or 0.01), so not much support for rejecting null hypothesis

ALR31

II56

0.6
0.4
0.2
0.0

pvalue

0.8

1.0

Obtaining p-value for t-test for Fort Collins Data

0.0

0.5

1.0

1.5
|t|
II57

2.0

2.5

3.0

Prediction: I
suppose now we want to predict a yet-to-be-observed response
y given a setting x for the predictor
if assumed-to-be-true linear regression model were known perfectly, prediction would be y = 0 + 1x, whereas model
says
y = 0 + 1x + e = y + e
prediction error would be y
Var(e | x) = 2

y = e, which has variance

in general we must be satisfied using estimators 0 & 1 in


lieu of true values 0 & 1, which intuitively should lead to
predictions that are not as good, resulting in a prediction error
with a variance inflated above 2
ALR32, 33

II58

Prediction: II
b | X = x) = 0 + 1x to predict
using fitted mean function E(Y
response y for given x, prediction is now
y = 0 + 1x,
and prediction error becomes
y y = 0 + 1x + e ( 0 + 1x)
recall that, if U & V are uncorrelated RVs, then Var(U V ) =
Var(U ) + Var(V ) (see Equation (A.2), p. 291, of Weisberg)

assuming that e is uncorrelated with RVs involved in formation


of 0 & 1, can regard U = 0 + 1x + e and V = 0 + 1x
as uncorrelated RVs when conditioned on x and x1, . . . , xn

letting x+
be shorthand for x, x1, . . . , xn, we can write
+) + Var( + x | x+)
Var(y y | x+
)
=
Var(
+
x
+
e
|
x

0
1
0
1

ALR32, 33, 291

II59

Prediction: III
0 + 1x | x+)
study pieces Var( 0 + 1x + e | x+
)
&
Var(

one at a time
using fact that Var(c + U ) = Var(U ) for a constant c, we have
+) = 2
Var( 0 + 1x + e | x+
)
=
Var(e
|
x

recall that, if U & V are correlated RVs and c is a constant


then Var(U + cV ) = Var(U ) + c2 Var(V ) + 2c Cov(U, V ) (see
Equation (A.3), p. 292, of Weisberg)

hence
0 | x+) + x2 Var( 1 | x+) + 2x Cov( 0, 1 | x+)
Var( 0 + 1x | x+
)
=
Var(

= Var( 0 | x1, . . . , xn) + x2 Var( 1 | x1, . . . , xn)


+2x Cov( 0, 1 | x1, . . . , xn)
under assumption x is independent of RVs forming 0 & 1
ALR32, 33, 292

II60

Prediction: IV
expressions for Var( 0 | x1, . . . , xn), Var( 1 | x1, . . . , xn) and
Cov( 0, 1 | x1, . . . , xn) can be extracted from matrix
!
0 | X) Cov( 0, 1 | X)
Var(
2(X0X) 1
Var( | X) =
=
Cov( 0, 1 | X) Var( 1 | X)
Exercise (unassigned): using elements of (X0X) 1, show that
!
2

1
(x
x
)

+
2
Var( 0 + 1x | x ) =
+
n
SXX

above represents increase in variance of prediction error due to


necessity of estimating 0 & 1, with the actual variance being
!
2

1
(x
x
)

2 1+ +
Var(y y | x+
)
=

n
SXX
ALR32, 33, 295

II61

Prediction: V
estimating 2 by 2 and taking square root lead to standard
error of prediction (sepred) at x:
!1/2
2

1
(x
x
)

sepred(
y | x+
) = 1+ +
n
SXX
using Forbes data as an example, suppose we want to predict log10(pressure) at a hypothetical location for which boiling
point of water x is somewhere between 190 and 215
prediction for log10(pressure) given boiling point x is
y = 0 + 1x = 0.4216418 + 0.008956178x
(1

) 100% prediction interval is set of points y in interval

+)

y t(/2, n 2)sepred(
y | x+
)

y
+t(/2,
n
2)sepred(
y
|
x

ALR32, 33

II62

Prediction: VI
here n = 17, so, for a 99% prediction interval, we set = 0.01
and use t(0.005, 15) = 2.947
since = 0.003792, x = 203.0 and SXX = 530.8, we have
!1/2
2
1
(x
203.0)

sepred(
y | x+
)
=
0.003792
1
+
+

17
530.8
solid red curves on following plot depict 99% prediction interval
as x sweeps from 190 to 215 (black lines show intervals assuming unrealistically no uncertainty in parameter estimates)
for x = 200, prediction is y = 1.370, and 99% prediction
interval is specified by 1.358 y 1.381
in original space, prediction is 10y = 23.42, and interval is
101.358 10y 101.381, i.e., 22.80 10y 24.05
ALR32, 33

II63

1.40
1.35
1.30

log(Pressure)

1.45

1.50

Scatterplot of log10(Pressure) Versus Boiling Point

190

195

205

200

Boiling point
II64

210

215

26
24
22
20

Pressure

28

30

32

Scatterplot of Pressure Versus Boiling Point

190

195

205

200

Boiling point
II65

210

215

Coefficient of Determination R2: I


ignoring potential predictors, best prediction of response is sample average y of observed responses y1, y2, . . . , yn
P
for Fort Collins data, total sum of squares SYY = i(yi y)2
is sum of squares of deviations of data from horizontal dashed
line on next plot
with inclusion of predictors, unexplained variation is RSS

for Fort Collins data, RSS is sum of squares of deviations from


solid line on next plot

ALR35, 36

II66

40
30
20
10

Late snowfall (inches)

50

60

Scatterplot of Late Snowfall Versus Early Snowfall

10

20

30

40

Early snowfall (inches)


ALR8

II67

50

Coefficient of Determination R2: II


dierence between SYY and RSS is called sum of squares due
to regression:
SSreg = SYY RSS
Problem 5 says that
RSS = SYY
hence
SSreg = SYY

SYY

(SXY)2
SXX
!
(SXY)2
(SXY)2
=
SXX
SXX

divide SSreg by SYY to get definition for coefficient of determination:


2
SSreg
(SXY)
RSS
2
R =
=
=1
SYY
SXX SYY
SYY

ALR35, 36

II68

Coefficient of Determination R2: III


2 (squared sample correlation)
Exercise (unassigned): R2 = rxy

must have 0 R2 1

R2 100 gives percentage of total sum of squares explained by


regression (concept of R2 generalizes to multiple regression)
examples: R2 = 0.026 for Fort Collins & R2 = 0.995 for Forbes

ALR35, 36

II69

Coefficient of Determination R2: IV


R and other computer packages report both R2 and a variation
known as the adjusted R2 :
RSS/df
RSS
2
2
Radj = 1
as compared to R = 1
,
SYY/(n 1)
SYY
where df is the degrees of freedom
2 gets closer
for simple linear regression, df = n 2, so Radj
and closer to R2 as n increases
in general, df = n minus # of parameters in mean function
2 is intended to facilitate comparison of models, but Weis Radj
berg notes (p. 36) there are better ways of doing so

note: R2 useless if mean function does not have intercept term


(e.g., regression through the origin: E(Y | X = x) = 1x)
ALR36

II70

Inadequacy of Sufficient Statistics: I


all of the data-dependent variables connected with a simple
linear regression (e.g., 0, 1, 2, SSreg, RSS, R2, etc.) can be
formed using just five fundamental statistics:
x, y, SXX, SYY and SXY
since

i=1

i=1

i=1

X
X
1X
2
2
x =
xi, SXX =
xi n
x and SXY =
xiyi n
xy
n
(with analogous equations for y and SYY), it follows that basic
linear regression analysis depends only on five so-called sufficient statistics:
n
n
n
n
n
X
X
X
X
X
xi,
yi,
x2i ,
yi2 and
xiyi
i=1

ALR293, 294, 23, 24, 25

i=1

i=1

i=1

II71

i=1

Inadequacy of Sufficient Statistics: II


under assumptions of normality and correctness of regression
model, we do not in theory lose any probabilistic information
by tossing away the original data (xi, yi), i = 1, . . . , n, and
just keeping five sufficient statistics
reliance on sufficient statistics is dangerous in actual applications, where adequacy of basic assumptions (normality, correctness of model) is always open to question
Anscombe (1973) constructed an example of four data sets
(n = 11) with sufficient statistics that are identical (to within
rounding error), oering much food for thought
third data set: reconsider scheme of picking median slope
amongst all possible lines determined by two distinct points
ALR12, 13

II72

8
2

Response

10

12

14

Anscombes First Data Set

15

10
Predictor

ALR13

II73

20

8
2

Response

10

12

14

Anscombes Second Data Set

15

10
Predictor

ALR13

II74

20

8
2

Response

10

12

14

Anscombes Third Data Set

15

10
Predictor

ALR13

II75

20

8
2

Response

10

12

14

Anscombes Fourth Data Set

15

10
Predictor

ALR13

II76

20

Residuals: I
looking at residuals ei is a vital step in regression analysis
can check assumptions to prevent garbage in/garbage out
basic tool is a plot of residuals versus other quantities, of which
three obvious choices are:
1. residuals versus fitted values yi
2. residuals versus predictors xi
3. residuals versus case numbers i
special nature of certain data might suggest other plots
useful residual plot resembles a null plot when assumptions
hold, and a non-null plot when some assumption fails
lets look at plots 1 to 3 using Anscombes data sets as examples
ALR36, 37, 38

II77

0
1
2

Residuals

Residuals Versus Fitted Values, Data Set #1

Fitted values
II78

10

0
1
2

Residuals

Residuals Versus Predictors, Data Set #1

10

Predictors
II79

12

14

0.0
2.0 1.5 1.0 0.5

Residuals

0.5

1.0

Residuals Versus Fitted Values, Data Set #2

Fitted values
II80

10

0.0
2.0 1.5 1.0 0.5

Residuals

0.5

1.0

Residuals Versus Predictors, Data Set #2

10

Predictors
II81

12

14

1
0
1

Residuals

Residuals Versus Fitted Values, Data Set #3

Fitted values
II82

10

1
0
1

Residuals

Residuals Versus Predictors, Data Set #3

10

Predictors
II83

12

14

0.5
0.5 0.0
1.5

Residuals

1.0

1.5

Residuals Versus Fitted Values, Data Set #4

10

Fitted values
II84

11

12

0.5
0.5 0.0
1.5

Residuals

1.0

1.5

Residuals Versus Predictors, Data Set #4

10

12

14

Predictors
II85

16

18

Residuals: II
Q: why is a plot of residuals versus yi identical to a plot of
residuals versus xi after relabeling the horizontal axis?

II86

0
1
2

Residuals

Residuals Versus Case Numbers, Data Set #1

6
Case numbers
II87

10

0.0
2.0 1.5 1.0 0.5

Residuals

0.5

1.0

Residuals Versus Case Numbers, Data Set #2

6
Case numbers
II88

10

1
0
1

Residuals

Residuals Versus Case Numbers, Data Set #3

6
Case numbers
II89

10

0.5
0.5 0.0
1.5

Residuals

1.0

1.5

Residuals Versus Case Numbers, Data Set #4

6
Case numbers
II90

10

Residuals: III
although plots of ei versus i were not particularly useful for
Anscombes data, plot is useful for certain other data sets (particularly where cases are collected sequentially in time)
fourth obvious choice: plot residuals ei versus responses yi
this choice is problematic because relationship
yi = yi + ei

says that, if spread of yis is small compared to spread of eis,


large eis will correspond to large yis even if model is correct
thus residuals versus responses is not a useful residual plot because it need not resemble a null plot when assumptions hold
as an example, reconsider Fort Collins data
II91

40
30
20
10

Late snowfall (inches)

50

60

Scatterplot of Late Snowfall Versus Early Snowfall

10

20

30

40

Early snowfall (inches)


ALR8

II92

50

10
0
10
20
30

Residuals

20

30

Residuals Versus Fitted Values, Fort Collins Data

30

32

34

36

Fitted values
II93

38

40

10
0
10
20
30

Residuals

20

30

Residuals Versus Predictors, Fort Collins Data

10

20

30

Predictors
II94

40

50

10
0
10
20
30

Residuals

20

30

Residuals Versus Case Numbers, Fort Collins Data

20

60

40

Case numbers
II95

80

10
0
10
20
30

Residuals

20

30

Residuals Versus Responses, Fort Collins Data

10

20

30

40

Responses
II96

50

60

Residuals: IV
reconsider Forbes data, focusing first on 3 following overheads
red dashed horizontal lines on residual plot show
recall definition of weights ci:
P
n
X
xi x
1 = i(xi x)yi =
ciyi, where ci =
SXX
SXX
i=1

ALR36, 37, 38

II97

1.40
1.35

log(Pressure)

1.45

Scatterplot of log10(Pressure) Versus Boiling Point

195

205

200

Boiling point
ALR6

II98

210

0.005
0.005

0.000

Residuals

0.010

0.015

Plot of Residuals versus Predictors for Forbes Data

195

205

200

Boiling point
ALR6

II99

210

0.015

0.005

ci

0.005

0.015

Weights Versus Boiling Point for Forbes Data

195

205

200

Boiling point
II100

210

Residuals: V
Weisberg notes that Forbes deemed this case evidently a mistake, but perhaps just because of its appearance as an outlier
Weisberg (p. 38) shows that, if (x12, y12) is removed and regression analysis is redone on reduced data set, resulting slope
estimate is virtually the same, but and quantities that depend
upon it drastically change (see overheads that follow)
to delete or not to delete that is the question:

if we dont delete, normality assumption is questionable


if we do delete, normality assumption is tenable, but no real
scientific justification for doing so (open to charges of data
massaging)

ALR36, 37, 38

II101

1.40
1.35

log(Pressure)

1.45

Scatterplot of log10(Pressure) Versus Boiling Point

195

205

200

Boiling point
ALR6

II102

210

1.40

1.35

log(Pressure)

1.45

Scatterplot of log10(Pressure) Versus Boiling Point

195

205

200

Boiling point
II103

210

0.005
0.005

0.000

Residuals

0.010

0.015

Plot of Residuals versus Predictors for Forbes Data

195

205

200

Boiling point
ALR6

II104

210

0.015

Plot of Residuals versus Predictors for Forbes Data

0.005
0.000
0.005

Residuals

0.010

195

205

200

Boiling point
II105

210

1.40
1.35
1.30

log(Pressure)

1.45

1.50

Scatterplot of log10(Pressure) Versus Boiling Point

190

195

205

200

Boiling point
II106

210

215

1.35

1.40

1.30

log(Pressure)

1.45

1.50

Scatterplot of log10(Pressure) Versus Boiling Point

190

195

205

200

Boiling point
II107

210

215

Main Points: I
given a response Y and a predictor X, simple linear regression
assumes (1) a linear mean function
E(Y | X = x) = 0 + 1x,
where 0 (intercept term) and 1 (slope term) are unknown
parameters (constants), and (2) a constant variance function
Var(Y | X = x) = 2,

where 2 > 0 is a third unknown parameter


simple linear regression model can also be written as

Y = E(Y | X = x) + e = 0 + 1x + e,
where e is a statistical error, a random variable (RV) such that
E(e) = 0 and Var(e) = 2

ALR21, 292, 293

II108

Main Points: II
let (xi, yi), i = 1, . . . , n, be RVs obeying Y = 0 + 1x + e
(predictor/response data are realizations of these 2n RVs)
for the ith case, have yi = 0 + 1xi + ei
errors e1, . . . , en are independent RVs such that E(ei | xi) = 0
model for data can also be written in matrix notation as
Y = X + e,

where

0 1
y1
B y2 C
C,
Y=B
.
@.A
yn

ALR21, 29, 63, 64

0
1
B1
X=B
@ ..
1

x1
x2 C
C
.. A ,
xn
II109


0
1

0 1
e1
B e2 C
C
e=B
.
@.A
en

Main Points: III


given sample means
x =

n
X
1

xi and y =

i=1

n
X
1

yi

i=1

and sample cross products and sum of squares


n
n
n
X
X
X
SXY =
(xi x)(yi y), SXX =
(xi x)2 & SYY =
(yi y)2,
i=1

i=1

i=1

can form least squares estimators for parameters 1 and 0:


1 = SXY and 0 = y 1x
SXX
corresponding estimator for error variance 2 is
n
2
X
RSS
(SXY)
2 =
, where RSS =
[yi ( 0 + 1xi)]2 = SYY
n 2
SXX
i=1

ALR293, 294, 24, 25

II110

Main Points: IV
letting

n
X
RSS(b0, b1) =
[yi

(b0 + b1xi)]2,

i=1

least squares estimators 0 and 1 are choices for b0 and b1


such that RSS(b0, b1) is minimized
fitted values yi and residuals ei are defined as
yi = 0 + 1xi and ei = yi ( 0 + 1xi) = yi
in terms of which we have
RSS = RSS( 0, 1) =

n
X
i=1

ALR24, 22, 23

II111

P 2
i ei
2
2
ei and =
n 2

yi,

Main Points: V
in matrix notation, least squares estimator of is such that
X0X = X0Y, i.e., is solution to normal equations X0Xb = X0Y
2 2 matrix X0X has an inverse as long as SXX 6= 0, so
= (X0X) 1X0Y
since E( ) = , estimators 0 & 1 are unbiased, as is 2 also:
E( 0) = 0, E( 1) = 1 and E( 2) = 2
also have Var( | X) = 2(X0X) 1, leading us to deduce
2

c 1 | X) =
Var( 1 | X) =
, which can be estimated by Var(
,
SXX
SXX
the square root of which is se( 1), the standard error of 1
ALR304, 305, 61, 62, 63, 27, 28

II112

Main Points: VI
can test null hypothesis (NH) that 1 = 0 by forming t-statistic
t = 1/se( 1) and comparing it to percentage points t(, n 2)
for t-distribution with n 2 degrees of freedom, with a large
value of |t| giving evidence against NH via a small p-value
(1 ) 100% confidence interval for 1 is set of points in
interval whose end points are
1 t(/2, n 2)se( 1) and 1 + t(/2, n 2)se( 1)

ALR31

II113

Main Points: VII


can predict a yet-to-be-observed response y given a setting x
for the predictor using y = 0 + 1x, which has a standard
error given by
!1/2
2

1
(x
x
)

sepred(
y | x+
,
) = 1+ +
n
SXX
where x+
denotes x along with original predictors x1, . . . , xn
(1 ) 100% prediction interval constitutes all values from

+)

y t(/2, n 2)sepred(
y | x+
)
to
y
+t(/2,
n
2)sepred(
y
|
x

ALR32, 33

II114

Main Points: VIII


dierence between residual sum of squares SYY for model
E(Y | X = x) = 0
and residual sum of squares RSS for model

E(Y | X = x) = 0 + 1x
is sum of squares due to regression:
(SXY)2
SSreg = SYY RSS =
SXX
large value for SSreg indicates 1 is valuable to include
coefficient of determination is R2 = SSreg/SYY

R2 100 gives percentage of total sum of squares explained by


linear regression (must have 0 R2 1)
ALR35, 36

II115

Main Points: IX
plots of residuals ei are invaluable for assessing reasonableness
of fitted model (a point that cannot be emphasized too much)
standard plot is residuals ei versus fitted values yi, which is
equivalent to eis versus predictors xi
plot of residuals versus case number i is potentially but not
always useful
do not plot residuals versus responses yi misleading!
failure to plot residuals is potentially bad for your health!

Thou Shalt Plot Residuals (a proposed 11th commandment!)

ALR36, 37, 38

II116

Additional References
F. Mosteller, A.F. Siegel, E. Trapido and C. Youtz (1981), Eye Fitting Straight Lines,
The American Statistician, 35, pp. 1501
C.R. Rao (1972), Linear Statistical Inference and Its Applications (Second Edition),
New York: John Wiley & Sons, Inc.
G.A.F. Seber (1977), Linear Regression Analysis, New York: John Wiley & Sons, Inc.

II117

Vous aimerez peut-être aussi