Vous êtes sur la page 1sur 64

MTH5120 S TATISTICAL M ODELING I 2015/2016

WEEK 2

1 / 64
Assessing the Model

Analysis of Variance

F test

Estimating σ 2

Coefficient of Determination

Minitab Example
Residuals

Crude Residuals

Standardized/Studentized Residuals

Residuals Diagnostics
Inference about the regression parameters

Example: Overheads
2 / 64
Assessing the Model
Analysis of Variance

Parameter estimates obtained for the model

Yi = β0 + β1 xi + εi

can be used to estimate the mean response corresponding to


each variable Yi .

That is,
[i ) = Ybi = βb0 + βb1 xi , i = 1, . . . , n.
E(Y
[i ) for a given data set (xi , yi ), are called fitted
Values of E(Y

values and are denoted by ybi .

3 / 64
Assessing the Model
Analysis of Variance
I They are points on the fitted regression line corresponding
to values xi .
I The observed values yi usually do not fall exactly on the
line and so are usually not equal to the fitted values ybi , as it
is shown in the figure below.

4
y

1
0 2 4 6 8 10 12 14 16 18
x

4 / 64
Assessing the Model
Analysis of Variance
The residuals (also called crude residuals) are defined as

ei := Yi − Ybi , i = 1, . . . , n,

These are estimators of the random errors εi . Thus

ei = Yi − (βb0 + βb1 xi )

= Yi − (Ȳ − βb1 x̄ + βb1 xi )

= Yi − Ȳ − βb1 (xi − x̄).

We shall use the following identity:


n
X n
X n
X
ei = (Yi − Ȳ) − βb1 (xi − x̄) = 0.
i=1 i=1 i=1

5 / 64
Assessing the Model
Analysis of Variance

I Note that the estimators βb0 and βb1 minimize the function
S(β0 , β1 ).
I The minimum is called the Residual Sum of Squares and is
denoted by SSE .

That is,
n
X
SSE = [Yi − (βb0 + βb1 xi )]2
i=1
Xn
= (Yi − Ybi )2
i=1
Xn
= e2i .
i=1

6 / 64
Assessing the Model
Analysis of Variance

Consider the constant model

Yi = β0 + εi .

Y
10

y
4

0
0 2 4 6 8 10 X

7 / 64
Assessing the Model
Analysis of Variance

For this model βb0 = Ȳ and we have

Ybi = Ȳ
ei = Yi − Ybi = Yi − Ȳ

and
n
X
SSE = SST = (Yi − Ȳ)2 .
i=1

It is called the Total Sum of Squares and is denoted by SST .

For a constant model SSE = SST .

8 / 64
Assessing the Model
Analysis of Variance
When the model is non constant, the difference Yi − Ȳ can be
split into two components: one due to the regression model fit
and one due to the residuals. That is

Yi − Ȳ = (Yi − Ybi ) + (Ybi − Ȳ).

For a given data set it could be represented as follows

y
y(14)
5
y(14)

4
fitted line

y
2

1
x
0 2 4 6 8 10 12 14 16 18

9 / 64
Assessing the Model
Analysis of Variance Identity
Theorem
In the simple linear regression model the total sum of squares
is a sum of the regression sum of squares and the residual sum
of squares, that is
SST = SSR + SSE ,
where
n
X
SST = (Yi − Ȳ)2
i=1
n
X
SSR = (Ybi − Ȳ)2
i=1
n
X
SSE = (Yi − Ybi )2
i=1

10 / 64
Assessing the Model
Analysis of Variance Identity

Proof

n
X n
X
SST = (Yi − Ȳ)2 = [(Yi − Ybi ) + (Ybi − Ȳ)]2
i=1 i=1

n
X
= [(Yi − Ybi )2 + (Ybi − Ȳ)2 + 2(Yi − Ybi )(Ybi − Ȳ)]
i=1

= SSE + SSR + 2A,

11 / 64
Assessing the Model
Analysis of Variance Identity

n
X
A = (Yi − Ybi )(Ybi − Ȳ)
i=1
Xn n
X
= (Yi − Ybi )Ybi − Ȳ (Yi − Ybi )
i=1 i=1
n
X n
X
= ei Ybi − Ȳ ei
i=1 i=1
| {z }
=0
n
X n
X n
X
= ei (βb0 + βb1 xi ) = βb0 ei + βb1 ei xi .
i=1 i=1 i=1
| {z } | {z }
=0 =0

Hence A = 0.
 12 / 64
Assessing the Model
Analysis of Variance

For a given data set

I SSR represents the variability in the observations Yi


accounted for by the fitted model.
I SSE represents the variability in Yi accounted for by the
differences between the observations and the fitted values
I SST represents the total variability in Yi .

13 / 64
Assessing the Model
ANOVA Table

The split of the sources of variability is customarily put in the


table called ANOVA Table.

ANOVA table
Source of variation d.f. SS MS VR
SSR MSR
Regression νR = 1 SSR MSR = νR MSE
SSE
Residual νE = n − 2 SSE MSE = νE
Total νT = n − 1 SST

The table shows the sources of variation, the sums of squares


and the statistic, based on the sums of squares, for testing the
significance of regression slope.

14 / 64
Assessing the Model
ANOVA Table

The d.f. is short for degrees of freedom, which is the number of


independent pieces of information used for estimation of each
of the sums of squares.

The Mean Squares (MS) are measures of average variation in


each source.

The Variance Ratio


MSR
VR =
MSE
measures the variation explained by the model fit relative to the
variation due to residuals.

15 / 64
Assessing the Model
F-test

The mean squares are functions of random variables Yi and so


is their ratio. We denote it by F. We will see later, that if β1 = 0,
then
MSR
F= ∼ F1,n−2 .
MSE
Thus, to test the null hypothesis

H0 : β1 = 0

versus the alternative


H1 : β1 6= 0,
we use the variance ratio F as the test statistic.

Under H0 the ratio has F distribution with 1 and n − 2 degrees


of freedom.

16 / 64
Assessing the Model
F-test

We reject H0 at a significance level α if

Fcal > Fα;1,n−2 ,

where Fcal denotes the value of the variance ratio F calculated


for a given data set and Fα;1,n−2 is such that

P(F > Fα;1,n−2 ) = α.

There is no evidence to reject H0 if Fcal < Fα;1,n−2 .

17 / 64
Assessing the Model
F-test

Rejecting H0 means that the slope β1 6= 0 and the full


regression model
Yi = β0 + β1 xi + εi
is better than the constant model

Yi = β0 + εi .

18 / 64
Assessing the Model
Estimating σ 2

Theorem.
In the full simple linear regression model we have

E(SSE ) = (n − 2)σ 2

From the theorem we obtain


 
1
E(MSE ) = E SSE = σ2
n−2

MSE is an unbiased estimator of σ 2 .

It is often denoted by S2 .

19 / 64
Assessing the Model
Estimating σ 2

Notice, that in the full model S2 is not the sample variance.

We have
n
2 1 X [i ))2 , [i ) = βb0 + βb1 xi .
S = MSE = (Yi − E(Y where E(Y
n−2
i=1

It is the sample variance in the constant (null) model, where


[i ) = βb0 = Ȳ and νE = n − 1. Then
E(Y

n
1 X
S2 = (Yi − Ȳ)2 .
n−1
i=1

20 / 64
Assessing the Model
Coefficient of Determination R2

R2 , is the percentage of total variation in the data explained by


the fitted model. That is
 
2 SSR SST − SSE SSE
R = 100% = 100% = 1 − 100%.
SST SST SST

Note:
I R2 ∈ [0, 100].
I R2 = 0 indicates that none of the variability in the response
is explained by the regression model.
I R2 = 100 indicates that SSE = 0 and all observations fall on
the fitted line exactly.
I A small value of R2 does not always imply a poor
relationship between Y and X, which may, for example,
follow another model.

21 / 64
Assessing the Model
Exammple: Sparrows’ Wings continued

22 / 64
Assessing the Model
Exammple: Sparrows’ Wings continued

The regression equation is


yi = 0.550 + 0.303 xi

S = 0.119189 R-Sq = 99.1% R-Sq(adj) = 99.1%

Analysis of Variance
Source DF SS MS F P
Regression 1 82.216 82.216 5787.39 0.000
Residual Error 53 0.753 0.014
Total 54 82.969

23 / 64
Assessing the Model
Exammple: Sparrows’ Wings continued

Comments:
I We fitted a simple linear model of the form

Yi = β0 + β1 xi + εi , i = 1, . . . , 55, εi ∼ N (0, σ 2 ).
iid

I The estimated values of the parameters are

I intercept: βb0 ∼
= 0.550
I slope: βb1 ∼
= 0.303

24 / 64
Assessing the Model
Exammple: Sparrows’ Wings continued

The ANOVA table shows the significance of the regression


(slope), that is the null hypothesis

H0 : β1 = 0

versus the alternative


H1 : β1 6= 0
can be rejected on the significance level α = 0.001 (p ∼
= 0.000).

I The test requires the assumptions of the normality and of


constant variance of random errors.
I It should be checked whether the assumptions are
approximately met.
I If not, the tests may not be valid.

25 / 64
Assessing the Model
Exammple: Sparrows’ Wings continued

I The value of R2 is very high, i.e., R2 = 99.1%.


I It means that the fitted model explains the variability in the
observed responses very well.
I The graph shows that the observations lie along the fitted
line and there are no strange points which are far from the
line or which could strongly affect the slope.

26 / 64
Assessing the Model
Exammple: Sparrows’ Wings continued

Final conclusions:
We can conclude that the data indicate that the length of
sparrows’ wings depends linearly on their age (within the range
3 - 18 days). The mean increase in the wing’s length per day is
estimated as βb1 ∼
= 0.303 cm.

However, it might be wrong to predict the length or its increase


per day outside the range of the observed time. We would
expect that the growth slows down in time and so the
relationship becomes non-linear.


27 / 64
Residuals
Crude Residuals

We defined the residuals as

ei = Yi − Y
bi .

These are often called crude residuals. We have seen that


n
X
ei = 0.
i=1

What is the distribution of crude residuals?

28 / 64
Residuals
Crude Residuals

Expectation:

E[ei ] = E[Yi − βb0 − βb1 xi ]


= E[Yi ] − E[βb0 ] − xi E[βb1 ]
= β0 + β1 xi − β0 − β1 xi
= 0.

Variance:
1 (xi − x̄)2
  
2
var[ei ] = σ 1 − + := σ 2 (1 − hii ).
n Sxx

The derivation is shown below.

29 / 64
Residuals
Crude Residuals

bi = Yi − βb0 − βb1 xi
ei = Yi − Y
= Yi − Y − βb1 (xi − x)
n
X
= Yi − Y − cj Yj (xi − x)
j=1
n n
1X X
= Yi − Yj − cj Yj (xi − x)
n
j=1 j=1
n  
X 1
= Yi − + (xi − x)cj Yj
n
j=1
   X 1 
1
= 1− + (xi − x)ci Yi − + (xi − x)cj Yj
n n
j6=i

30 / 64
Residuals
Crude Residuals

This is a linear combination of independent random variables Yj


and so the variance of the combination is the combination of
variances of Yj with squared coefficients.

Furthermore, var(Yj ) = σ 2 , j = 1, . . . , n.

31 / 64
Residuals
Crude Residuals

Hence,
  2 X 1 2
1
var(ei ) = σ 2 1 − + (xi − x)ci + σ2 + (xi − x)cj
n n
j6=i
 
   2 X  2 
 1 1 1
= σ2 1 − 2 + (xi − x)ci + + (xi − x)ci + + (xi − x)cj
 n n n 
j6=i
 
  X n  2 
 1 1
= σ2 1 − 2 + (xi − x)ci + + (xi − x)cj
 n n 
j=1

1 (xi − x̄)2
  
= . . . = σ2 1 − +
n Sxx
Pn Pn
by j=1 cj = 0 and j=1 c2j = S1xx .

32 / 64
Residuals
Crude Residuals

I Note that the variance depends on i, that is, var(ei ) is not


constant, unlike var(εi ).
I Similarly it can be shown that the covariance of two
residuals ei and ej is
 
1 (xi − x̄)(xj − x̄)
cov[ei , ej ] = −σ 2 + = −σ 2 hij .
n Sxx

I Also, the residuals are normally distributed (as a


combination of normally distributed random variables), but
not independent.
I We know that var[εi ] = σ 2 and cov[εi , εj ] = 0.
I So the crude residuals ei do not quite follow the properties
of εi .

33 / 64
Residuals
Standardized/Studentized Residuals

To standardize the residuals we calculate


ei − E(ei ) ei
di = √ =p .
var ei 2
σ (1 − hii )

Then
di ∼ N (0, 1).
They are not independent, though for large samples the
correlation should be small.

34 / 64
Residuals
Standardized/Studentized Residuals

I However, we do not know σ 2 .


I If we replace σ 2 by S2 we get the so called studentized
residuals (in Minitab they are called standardized
residuals),
ei
ri = p .
2
S (1 − hii )

I For large samples they will approximate the standard di .

35 / 64
Residuals Diagnostics
Residual Plots
To check constant variance (homoscedasticity) and also
linearity, we plot ri against xi , as it is shown below or against byi
as shown in the next set of figures.

(a) (b)

(a) No problem apparent

(b) Clear non-linearity.


36 / 64
Residuals Diagnostics
Residual Plots

(a) (b)

(a) No problem apparent.

(b) Variance increases as the mean response increases.

37 / 64
Residuals Diagnostics
Histograms of Simulated Data from Four Different Distributions

38 / 64
Residuals Diagnostics
Cumulative distributions: Empirical and Predicted

I Various tests of normality are based on the comparison of the


empirical distribution and the predicted normal distribution.
I For example, Ryan-Joiner Test is based on the correlation
between two.
39 / 64
Residuals Diagnostics
Normal Probability Plot

I To check whether the distribution of the residuals follows a


symmetric shape of the normal distribution we can draw so
called Normal Probability Plot.
I It plots each value of ordered residuals vs. the percentage
of values in the sample that are less than or equal to it,
along a fitted distribution line.
I The scales are transformed so that the fitted distribution
forms a straight line.
I A plot that departs substantially from linearity suggests
that the error distribution is not normal.

40 / 64
Residuals Diagnostics
Normal Probability Plot: Data from Normal Distribution

(a) (b)

(a) Histogram of data simulated from standard normal


distribution.

(b) Normal Probability Plot; no problem apparent.


41 / 64
Residuals Diagnostics
Normal Probability Plot: Data from Log-Normal Distribution

(a) (b)

(a) Histogram of data simulated from standard log-normal


distribution.

(b) Normal Probability Plot indicates skewness of the


distribution
42 / 64
Residuals Diagnostics
Normal Probability Plot: Data from Beta Distribution

(a) (b)

(a) Histogram of data simulated from Beta(0,1) distribution.

(b) Normal Probability Plot indicates light tails.

43 / 64
Residuals Diagnostics
Normal Probability Plot: Data from Student t-Distribution

(a) (b)

(a) Histogram of data simulated from t-distribution.

(b) Normal Probability Plot indicates heavy tails.

44 / 64
Residuals Diagnostics
Normal Probability Plot: Sparrows’s Wings

The Normal Probability Plot does not indicate any apparent


problems with normality of the residuals.

MINITAB
Stat → Basic Statistics → Normality Test...
45 / 64
Inference about the regression parameters
Example: Overheads

A company builds custom electronic instruments and computer


components. All jobs are manufactured to customer specifications.
The firm wants to be able to estimate its overhead cost. As part of a
preliminary investigation, the firm decides to focus on a particular
department and investigates the relationship between total
departmental overhead cost (Y) and total direct labor hours (X).

46 / 64
Inference about the regression parameters
Example: Overheads

Two objectives of this investigation are

1. to summarize the relationship between total departmental


overhead and total direct labor hours.
2. to estimate the expected and to predict the actual total
departmental overhead from the total direct labor hours.

47 / 64
Inference about the regression parameters
Example: Overheads
The regression equation is
Ovhd = 16310 + 11.0 Labor

Predictor Coef SE Coef T P


Constant 16310 2421 6.74 0.000
Labor 10.982 2.268 4.84 0.000

S = 1645.61 R-Sq = 62.6% R-Sq(adj) = 60.0%

Analysis of Variance
Source DF SS MS F P
Regression 1 63517077 63517077 23.46 0.000
Residual Error 14 37912232 2708017
Total 15 101429309

MINITAB
Stat → Regression → Regression...
48 / 64
Inference about the regression parameters
Example: Overheads

(a) (b)

(a) Residuals versus Fits plot does not contradict a constant


variance nor the linearity of the model.

(b) Normal Probability Plot does not contradict the normality


assumption.
49 / 64
Inference about the regression parameters
Example: Overheads

Comments:

I The model fit is byi = 16310 + 11xi .


I There is a significant relationship between the overheads
and the labor hours (p < 0.001 in ANOVA).
I The increase of labor hours by 1 will increase the mean
overheads by about £11.
I There is rather large variability in the data; however, the
percentage of total variation explained by the model is
rather small (R2 = 62.6).
I Hence, the question is ’how accurate is the estimate of the
slope?’

50 / 64
Inference about the regression parameters
Inference about β1

Theorem
In the full simple linear regression model (SLRM) the
distribution of the LSE of β1 , βb1 , is normal with the expectation
2
E(βb1 ) = β1 and the variance var(βb1 ) = Sσxx , that is

σ2
 
βb1 ∼ N β1 , .
Sxx


Remark
For large samples, where there is no assumption of normality of
Yi , the sampling distribution of βb1 is approximately normal.


51 / 64
Inference about the regression parameters
Inference about β1

I The theorem allows us to derive a confidence interval (CI)


for β1 and a test of non-significance for β1 .
I After standardization of βb1 we obtain

βb1 − β1
√ ∼ N (0, 1).
σ/ Sxx

I However, the error variance is usually not known and it is


replaced by its estimator.
I Then the normal distribution changes to a Student
t-distribution.

52 / 64
Inference about the regression parameters
Inference about β1
Lemma
If Z ∼ N(0, 1) and U ∼ χ2ν , and Z and U are independent, then

Z
p ∼ tν .
U/ν


Here we have,
βb1 − β1
Z= √ ∼ N (0, 1).
σ/ Sxx
We will see later that
(n − 2)S2
U= ∼ χ2n−2
σ2

and S2 and βb1 are independent.


53 / 64
Inference about the regression parameters
Inference about β1

It follows that
−β1
βb1√
σ/ Sxx
T=q
(n−2)S2
σ 2 (n−2)

βb1 − β1
= √ ∼ tn−2 .
S/ Sxx

54 / 64
Inference about the regression parameters
Inference about β1 : Confidence Interval

To find a CI for an unknown parameter θ means to find values


of the boundaries A and B which satisfy

P(A < θ < B) = 1 − α

for some small α, that is for a high confidence level (1 − α)100%.

Here we have
!
βb1 − β1
P −t α2 ,n−2 < √ < t α2 ,n−2 = 1 − α,
S/ Sxx

where t α2 ,n−2 is such that P(|T| < t α2 ,n−2 ) = 1 − α.

55 / 64
Inference about the regression parameters
Inference about β1 : Confidence Interval

This gives
 
S S
P βb1 − t α2 ,n−2 √ < β1 < βb1 + t α2 ,n−2 √ = 1 − α.
Sxx Sxx

That is the CI for β1 is


 
S S
[A, B] = βb1 − t α2 ,n−2 √ , βb1 + t α2 ,n−2 √ .
Sxx Sxx

56 / 64
Inference about the regression parameters
Inference about β1 : Confidence Interval

Example continued

For the given data we obtained values of βb1 , S and Sxx for the
overhead costs:

βb1 = 10.982, S = 1645.61, Sxx = 526656.9.

Also t0.025,14 = 2.14479. Hence, the 95% CI for β1 is


 
1645.61 1645.61
= 10.982 − 2.14479 √ , 10.982 + 2.14479 √
526656.9 526656.9
= [6.11851, 15.8455]

We would expect (with 95% confidence) that one hour increase in


labour will increase the cost between £6.12 and £15.82.

57 / 64
Inference about the regression parameters
Inference about β1 : Test of H0 : β1 = 0

The null hypothesis H0 : β1 = 0 means that the slope is zero


and a better model is a constant model

Yi = β0 + εi , εi ∼ N (0, σ 2 )
iid

showing no relationship between Y and X. We have

βb1 − β1
T= √ ∼ tn−2 .
S/ Sxx

Hence, if H0 is true, then

βb1
T= ∼ tn−2 .
√S H0
Sxx

58 / 64
Inference about the regression parameters
Inference about β1 : Test of H0 : β1 = 0

We reject H0 at a significance level α when, for a given data set,


the calculated value of the test function, Tcal , is in the rejection
region, that is
|Tcal | > t α2 ,n−2 .

This is equivalent to the F-test, since if the random variable


W ∼ tν then W 2 ∼ F1,ν .

59 / 64
Inference about the regression parameters
Inference about β1 : Test of H0 : β1 = 0

Remark
Square root of the variance var(βb1 ) is called the standard error
of βb1 and it is denoted by se(βb1 ).
That is s
σ2
se(βb1 ) = .
Sxx
Its estimator is s
\ S2
se(βb1 ) = .
Sxx
Often this estimated standard error is called the standard error.
You should be aware of the difference between the two.


60 / 64
Inference about the regression parameters
Inference about β1 : Test of H0 : β1 = 0

Remark
Note that the (1 − α)100% CI for β1 can be written as
 
\ \
β1 − t 2 ,n−2 se(β1 ), β1 + t 2 ,n−2 se(β1 )
b α b b α b

and the test statistic for H0 : β1 = 0 as

βb1
T= ∼ tn−2 .
\
se(β1 )
b

61 / 64
Inference about the regression parameters
Inference about β0

Theorem
In the full SLM the distribution of the LSE of β0 , βb0 , is normal
with the expectation
  E(β0 ) = β0 and the variance
b
2
var(βb0 ) = 1n + Sx̄xx σ 2 , that is

x̄2
  
2 1
βb0 ∼ N β0 , σ + .
n Sxx


62 / 64
Inference about the regression parameters
Inference about β0
Corollary
Assuming the full simple linear regression model, we obtain

CI for β0 :  
\ \
β0 − t α2 ,n−2 se(β0 ), β0 + t α2 ,n−2 se(β0 )
b b b b

Test of the hypothesis H0 : β0 = β0? :

βb0 − β0?
T= ∼ tn−2 ,
\ H0
se(β0 )
b

where s
x̄2
 
\ 1
se(βb0 ) = S2 + .
n Sxx

63 / 64
Inference about the regression parameters
Inference about β0

Example continued
The calculated values for the overhead costs are following:

\
βb0 = 16310, se(βb0 ) = 2421

Hence, the 95% CI for β0 is

[a, b] = [16310 − 2.14479 × 2421, 16310 + 2.14479 × 2421]

= [11117.5, 21502.5]

We would expect (with 95% confidence) that even if there is


zero hours of labor, the overhead cost is between £11117.5 and
£21502.5.

64 / 64

Vous aimerez peut-être aussi