Vous êtes sur la page 1sur 60

Lecture #4

M U LT I P L E R E G R E S S I O N
&
M U LT I FA C TO R A N O VA ( G L M )

“If people do not believe that mathematics is simple, it is only


because they do not realize how complicated life is.”
- Johann von Neumann (1903 - 1957)
Learning Objectives
2

1. Understand how multiple linear regression (MLR) analysis is conducted


and can be used to develop relationships involving one dependent
variable and several independent variables.

2. Be able to interpret the coefficients in a multiple regression analysis.

3. Understand what multicollinearity is and how it can affect the coefficients


in the regression equation.

4. Know the assumptions and procedures to conduct statistical tests


involving the hypothesized model.
Learning Objectives
3

5. Know how to use the F-ratio to test for model significance.

6. Know how to use the t-values to test for significance of individual


predictors in the equation.

7. Be able to use General Linear Models (GLM) to perform an AVOVA on 2


or more factors

8. Know the components of a GLM (ANOVA) Table


Why Study MLR & GLM?

Y, Response variable
Continuous Discrete
(Output has a mean and variance) (Output is a proportion, i.e., 15 out of 50 or 30%)

- Correlation -Logistics Regression


- Regression*
Continuous

X, Input(s) - T-test - Proportions


- Paired
- One Sample - One Proportion
- Two Sample - Two Proportions
-Equal var.
Discrete -Unequal var.
- ANOVA (an X with 2 or more means - Chi Sq (more than 2 proportions)
or 2 or more X’s being investigated)

RARELY do you just look at one variable at a time, and it’s not an efficient way to
experiment. Typically multiple factors influence a response, and looking at them at
the same time allows us to investigate interactions!
Overview

 Multiple Linear Regression (MLR) Example


 Interpretation of MLR Coefficients
 Goodness of Fit
 Inference
 Multifactor ANOVA
 Exercise
Multiple Regression
6

Ordinary least squares multiple regression is


a direct extension of least squares simple
regression. The mathematics involved are
much more complex, but we will rely on the
computer for help in that regard.
Advantages and Disadvantages

 The advantage of multiple regression is potentially


increased explanatory power for the response
variable as a result of including more information.

 The disadvantages of multiple regression include a


more complicated model, the increased
commitment of resources to collecting information
and maintaining the data base, greater risk of
misuse of the model (extrapolation, for example),
mathematical problems, among others.
Goal - PARSIMONY

Develop a model that allows us to


predict the response variable
accurately, but that is as brief as
possible.
Mini-Case: Newsprint Quality

A major newspaper printing facility was interested in


understanding those factors that affect the quality of the
printing produced by their equipment. A major factor in print
quality is the nature of the paper stock used for the printing
operation.
Data were collected on three variables and are presented on a
following slide. Each observation is the mean of a sample of
five sheets taken from the same roll. The variables were print
quality (Print_Q), roughness of the paper (Rough), and the
tensile strength of the paper (Strength).
Details of the measurement process for the variables are
provided in the next two slides.
Response Variable

 Print_Q is measured by the La Rocque printability


test. A precision 6" × 7" plate is inked with a
controlled quantity of ink and printed on a sample of
newsprint using a printing pressure of 60 lbs. per
inch. Printability is measured as a percentage of the
difference in reflectance between printed and
unprinted paper areas as measured by a Hunter
reflectometer.
Possible Predictors

 Rough is measured by the Bendtsen Smoothness and


Porosity Tester. It works on the principle of measuring the
quantity of regulated air flow escaping between the head
and the surface of the paper sample. Values are reported in
Bendtsen Units. Note that this variable not only measures
smoothness, but also the likelihood that ink may be able to
penetrate through the paper.
 Strength is measured as the tensile strength of the paper in
lbs. per square inch. The variable is measured with the use
of a standardized test apparatus which applies a precisely
defined force and angle to the sample. Values reported are
in the ANPA scale units.
Data
OBS Print_Q Rough Strength OBS Print_Q Rough Strength
1 68.7 103.6 145.20 16 72.8 87.3 153.00
2 67.2 167.8 170.05 17 62.4 141.2 133.04
3 66.6 121.4 169.78 18 74.3 67.7 184.53
4 68.3 119.6 162.85 19 69.5 128.6 142.30
5 71.2 68.0 159.65 20 67.5 159.0 167.25
6 71.7 117.4 181.76 21 75.1 62.5 156.27
7 73.0 59.4 159.97 22 74.3 83.3 173.13
8 70.8 97.5 156.71 23 71.7 83.8 178.05
9 61.5 177.5 141.30 24 76.4 97.9 169.90
10 72.3 82.8 157.95 25 67.1 104.8 158.80
11 66.1 123.5 154.81 26 67.2 132.6 141.32
12 65.4 131.4 144.56 27 71.2 94.5 167.04
13 73.7 91.9 162.23 28 72.6 76.0 155.53
14 70.9 100.1 168.05 29 66.2 108.3 135.42
15 65.8 108.6 153.45 30 71.9 76.7 159.19
Summary

Descriptive Statistics: Print_Q, Rough, Strength

Variable N Mean StDev Min Max


Print_Q 30 69.780 3.732 61.500 76.400
Rough 30 105.82 30.43 59.40 177.50
Strength 30 158.77 13.10 133.04 184.53

 Note: SST = (n-1)sy2 = (29)(3.732)2 = 403.907


Summary of Relationships

Correlation Table Print_Q Rough Strength


Print_Q 1.000
Rough -0.789 1.000
Strength 0.582 -0.312 1.000
Simple Regression With Best Predictor

Population Model: Yi = β0 + β1Xi + εi


Multiple Adjusted StErr of
Summary R R-Square R-Square Estimate
0.7890 0.6225 0.6090 2.3339

Degrees of Sum of Mean of


ANOVA Table Freedom Squares Squares F-Ratio p-Value
Explained 1 251.4745 251.4745 46.1683 < 0.0001
Unexplained 28 152.5135 5.44691

Standard Confidence Interval 95%


Regression Table Coefficient Error t-Value p-Value Lower Upper
Constant 80.0208 1.5662 51.0908 0.0000 76.8125 83.2291
Rough -0.0968 0.0142 -6.7947 0.0000 -0.1259 -0.0676
Sample Equation: ŷ  b0  b1 x
Print_ Q̂  80.0208  0.0968 Rough
Regression: Print_Q with Rough and Strength

Multiple Adjusted StErr of


Summary R R-Square R-Square Estimate
0.8645 0.7474 0.7287 1.9441

Degrees of Sum of Mean of


ANOVA Table Freedom Squares Squares F-Ratio p-Value
Explained 2 301.9384 150.9692 39.9430 < 0.0001
Unexplained 27 102.0496 3.7796

Standard Confidence Interval 95%


Regression Table Coefficient Error t-Value p-Value Lower Upper
Constant 61.6823 5.1856 11.8950 < 0.0001 51.0424 72.3223
Rough -0.0825 0.0125 -6.6107 < 0.0001 -0.1082 -0.0569
Strength 0.1060 0.0290 3.6540 0.0011 0.0465 0.1656

The fitted equation:


Print_ Q̂  61.6823  0.0825Rough  0.1060Strength
“Plane” of Best Fit

A plot of the regression plane and original data points for


our mini-case is presented below:

Print_ Q̂  61.6823  0.0825Rough  0.1060Strength


Sample Multiple Regression Equation

The regression model estimated from the data are of


the form:
Yˆi  b0  b1 X 1  b2 X 2

Sample Statistics: bo, b1, and b2 estimate b0, b1, and


b2, respectively, Yˆi estimatesY | X1 , X 2 , e estimates
ε, and s estimates σ.

The estimators are again determined to minimize


SSE:
 ei2   (Yi  Yˆi ) 2   (Yi  (b0  b1 X 1  b2 X 2 )) 2
Overview

 Multiple Linear Regression (MLR) Example


 Interpretation of MLR Coefficients
 Goodness of Fit
 Inference
 Multifactor ANOVA
 Exercise
Interpretation of the multiple
regression coefficients

CHANG E O NLY O NE VARI ABLE AT A TI ME

BE WARY O F CO LLI NE ARI TY


Manipulate One Predictor at a Time

When Xj is a quantitative variable, the coefficient bj should


be interpreted as the estimated change in the mean of Y
when Xj is increased 1 unit while holding all other
predictors in the equation constant.

Example: Print_ Q̂  61.6823  0.0825Rough  0.1060Strength


The average print quality is estimated to decrease 0.08255
units of print quality for every 1 unit increase in the
roughness of the paper, given the same tensile strength.
Collinearity

In general, the values of the bj are affected by the


other variables present in the model.

Example:
Pˆ rint_Q= 80.0 - 0.0968 Rough

Pˆ rint_Q= 61.682 - .08255 Rough+ .10602 Strength

Little difference due to low correlation (-.312)


between Rough and Strength.
Multicollinearity

 As the strength of relationship between predictors increases, it


becomes difficult to estimate their individual coefficients.

 The coefficients become unstable and may exhibit drastic


changes when those predictors are used simultaneously in the
same model.

 The coefficients will have inflated standard errors.

 This condition is called multicollinearity.

 Multicollinearity is identified by a VIF of > 10

 The affects of multicollinearity are inflated variability in the


coefficients
Overview

 Multiple Linear Regression (MLR) Example


 Interpretation of MLR Coefficients
 Goodness of Fit
 Inference
 Multifactor ANOVA
 Exercise
Measures of goodness of fit

1. S

2. R2

3 . ADJ US TE D R 2
Print_Q vs. Rough and Strength
R2 and s
Multiple Adjusted StErr of
Summary R R-Square R-Square Estimate
0.8645 0.7474 0.7287 1.9441

Degrees of Sum of Mean of


ANOVA Table Freedom Squares Squares F-Ratio p-Value
Explained 2 301.9384 150.9692 39.9430 < 0.0001
Unexplained 27 102.0496 3.7796

Standard Confidence Interval 95%


S = 1.9441:
Regression Table Recall that
Coefficient ErrorsY = 3.732. The dispersion
t-Value p-Value Lower Upper
Constant 61.6823 5.1856 11.8950 < 0.0001 51.0424 72.3223
around
Rough
the regression
-0.0825
line
0.0125
is almost 50%
-6.6107 < 0.0001
smaller
-0.1082
than
-0.0569
the dispersion
Strength around
0.1060 y
0.0290 3.6540 0.0011 0.0465 0.1656

R2 = 0.7474: Nearly 75% of the variation in print quality


can be explained by the roughness of the paper used.
Print_Q vs. Rough and Strength
R2-adjusted
Multiple Adjusted StErr of
Summary R R-Square R-Square Estimate
0.8645 0.7474 0.7287 1.9441

Degrees of Sum of Mean of


ANOVA Table Freedom Squares Squares F-Ratio p-Value
Explained 2 301.9384 150.9692 39.9430 < 0.0001
Unexplained 27 102.0496 3.7796

SSE  n  1  Standard
The 2 Confidence Interval 95%
Regression Error adjusted R will always
R a2  1  Table Coefficient. t-Value p-Value be smaller
Lower Upper
Constant  p  1  5.1856
SST  n61.6823 than the11.8950 < 0.0001
unadjusted 51.0424
coefficient. 72.3223
Rough -0.0825 0.0125 -6.6107 < 0.0001 -0.1082 -0.0569
Strength 0.1060 0.0290 3.6540 0.0011 0.0465 0.1656
After adjusting for the use of 2 predictors with n = 30,
nearly 73% of the variation in print quality can be
explained by the roughness and strength of the
paper used.
“Flaw” in R2

R2 can be inflated by the use of many predictors, but


each predictor has a price - a degree of freedom
paid by the Error source of variation:
Source SS df MS F
Regression SSR p MSR MSR/MSE
Error SSE n–p-1 MSE
Total SST n-1

Too many predictors – “over fit” the data


Purpose of R2 Adjusted

 The purpose behind the adjusted R2 is to account


for the number of predictors.
o This is particularly relevant when n is small.

 It helps identify “over fitting” in regression models.

 The adjusted coefficient normally gives a more


appropriate estimate of the proportion of variance
in the dependent variable that is explained when
multiple predictors are used.
Overview

 Multiple Linear Regression (MLR) Example


 Interpretation of MLR Coefficients
 Goodness of Fit
 Inference
 Multifactor ANOVA
 Exercise
Inference

Mathematically more complex,


but otherwise the same as in
Simple Regression, with one
exception.
The significance tests based on
F obs and t obs do not address
identical hypotheses.
F-test of Model Significance

Shows if there is a linear relationship between any or


all X variables and Y

Hypotheses: Decision Rule:


H0: b1 = b2 = ... = bp = 0
Ha: At least one βj is not 0
Reject H0
Test Statistic:
MSR
F  ratio  ~ F ( p, n  p  1) Fp, n-p-1
MSE
Test of Model Significance for the Mini-case

H0: b1 = b2 = 0
Ha: Not both b1 and b2 are 0.0 Reject H0
3.35
Degrees of Sum of Mean of
ANOVA Table Freedom Squares Squares F-Ratio p-Value
Explained 2 301.9384 150.9692 39.9430 < 0.0001
Unexplained 27 102.0496 3.7796

If α = 0.05, Fc = 3.3541

We conclude that there is a statistically significant predictive


relationship between print quality and newsprint
roughness, or newsprint tensile strength, or both.
T-test for the Significance of Each Predictor

Goal: Parsimony
 We want to identify variables that are not
significant in the model, and consider removing
any that aren’t.
 Remove one predictor (least significant) at a time.

 A predictor is not significant if its population


coefficient is 0.
 The test is identical to the t-test in simple
regression.
t-Test of Individual Predictor in the Model

Hypotheses (one set for each predictor):


Ho: bj = 0 Implies an insignificant predictor.
Ha: bj ≠ 0 Implies a significant predictor.

bj  0
Test Statistic: tobs  ~ t (n  p  1)
sb j

Reject Reject
α/2 α/2
t
-tc 0 tc
t-Tests

Hypotheses: H0: b1 = 0 H0: b2 = 0


Ha: b1 ≠0.0 Ha: b2 ≠0.0
Standard Confidence Interval 95%
Regression Table Coefficient Error t-Value p-Value Lower Upper
Constant 61.6823 5.1856 11.8950 < 0.0001 51.0424 72.3223
Rough -0.0825 0.0125 -6.6107 < 0.0001 -0.1082 -0.0569
Strength 0.1060 0.0290 3.6540 0.0011 0.0465 0.1656
If α = 0.05, tc = t27,.05/2 = 2.052:Reject H0 if tobs < -2.052 or tobs > 2.052
We conclude that there is strong evidence that the roughness of the
paper is a significant predictor of print quality in this model.

We conclude that there is strong evidence that the strength of the


paper is a significant predictor of print quality in this model.
Overview

 Multiple Linear Regression (MLR) Example


 Interpretation of MLR Coefficients
 Goodness of Fit
 Inference
 Multifactor ANOVA
 Exercise
Multifactor ANOVA
38
Advantages & Disadvantages

 The advantage of multifactor ANOVA is efficiency


versus one factor at a time analysis. It provides an
examination of the interaction.

 The disadvantages of multifactor ANOVA include a


more complicated model, the increased
commitment of resources, costly & time-
consuming, to collecting information and greater
risk of misuse of the model.
Mini-Case: Sales vs. Display Height & Width

The Acme Bakery wanted to determine if sales are


affected by the display height and display width.
Acme Bakery conducted a test in one store using
the same item. The display heights eye level (5’)
knee level (2’). The widths of the displays were
narrow (18”) and wide (36”). Sales were recorded
for one week for each design and are in $1000’s.
Significance level = 0.01
Data collected on sales, display height and width are
on the following slide.
Data
Two-Factor ANOVA Table
(Fixed Factors)

Source df SS MS F
Factor A a-1 SSA MSA = SSA/(a-1) MSA/MSE
Factor B b-1 SSB MSB = SSB/(b-1) MSB/MSE
AB Interaction (a-1)(b-1) SSAB MSAB = SSAB/(a-1)(b-1) MSAB/MSE
Error ab(n-1) SSE MSE = SSE/ab(n-1) -
Total N-1 SST - -

In our case, a = 2, b = 2 and n, number of replications at each


possible combination, equals 3, and N = 12
a–1=1
b–1=1
(a – 1)(b – 1) = 1
ab(n – 1) = 2x2x(3 – 1) = 8
N – 1 = 12 – 1 = 11
Two-Factor ANOVA—Help from Technology

GLM provides flexibility: Will handle unbalanced and balanced designs. Here,
since the design is balanced, the Sequential SS and Adjusted SS are the same.
Main Effects Plot

Main Effects Plot for Sales


Data Means

Height Width
8.0

7.5

7.0
Mean

6.5

6.0

5.5

5.0

Eye Knee Narrow Wide


Interaction Plot

Interaction Plot for Sales


Data Means
11 Height
Eye
Knee
10

8
Mean

4
Narrow Wide
Width
Examining One Variable at a Time

Notice the SST, in both models, they’re the same—64.34. Also notice
the SS for Width is the same—25.81. However, in the One-way ANOVA,
this factor is not significant at an alpha of 0.01. Whereas in the Two-
way ANOVA with the interaction, it is very significant.
Examining One Variable at a Time

Same issue with Height. If a factor is not a part of the analysis, its effect is
in the error term—the denominator of the F-ratios. The error term is
supposed to be just random error and not assignable causes. If there are
assignable causes in the error term, significant effects can be missed.
BREAK
48

FRIDAY’S
HAND’S-ON EXERCISE
Overview

 Multiple Linear Regression (MLR) Example


 Interpretation of MLR Coefficients
 Goodness of Fit
 Inference
 Multifactor ANOVA
 Exercise
Generate MLR Data

Exercise: Shoot catapult to generate a multiple regression equation for


Distance as a function three inputs.

Objective: MLR Experiment to collect data regarding 3 factors


Use statistical software to generate model
Recognize if multicollinearity exists in the model

Time: Take 45 minutes Cup Rubber Band

Directions: Following slides Hook Pin Vertical PIn

Start Angle

Stop Pin
Generate MLR Data: Directions
 Break into groups of 4 or 5
 Get catapult, duct tape, tape measure, 12’ aluminum foil & rubber ball
 Duct tape catapult to floor securely
 Use aluminum foil to indicate where the ball landed—secure it to the
floor with a couple of pieces of tape to prevent it from sliding when the
ball lands on it
 Place, and then secure, the tape measure along side the aluminum foil
starting from the front of the catapult
 Response variable = Distance ball travels before hitting the ground
 Experimental factors, independent variables…
 Hook pin
 Vertical pin catapult
 Start angle Aluminum foil

Tape measure
Generate MLR Data: Directions

 Experimental factors
 Hook pin
 Vertical pin
 Start angle

Constants:
Put cup at location 1
Put stop pin in location 3
Generate MLR Data

 For Minitab, create a column for each variable—Response variable and the
three factors

 Generate as much data as you believe necessary to calculate:


 Main effects
 Interactions
 Experimental error
Generate MLR Data

 You will have to generate the interactions in Minitab to include them in the analysis

 Multiple two main effects into a separate column to create the interaction
 Calc>Calculator will give you the above dialog box (left)
 Store it in the next column, here I labeled it “HP*VP” for the interaction of Hook Pin and Vertical Pin
 In “Expression” box, double click on Hook Pin, click on the multiplier button, *, and double click on Vertical Pin
 Click on “OK”
 Continue this for the other two, two-way interactions, HP*SA and VP*SA
Generate MLR Data

When you’re ready to analyze the data, your


worksheet should look like this:

One column for your response variable, Distance, C4

The other columns are the factors you will regress on,
the main effects and their interactions
Generate MLR Data

 Select: Stat>Regression>Regression

 Enter Distance in the “Response:” window


 Place cursor in the “Predictors:” window by clicking in the window
 Enter all the main effects and interactions by double clicking on them
Generate MLR Data

 Click the “Options” button and select Variance inflation factors

 Click “OK”

 Click the “Storage” button and select Standardized Residuals

 Click “OK”
Generate MLR Data

 Click on the Graphs button and select the 4-in-1 option

 Click on OK

 Click on OK
Generate MLR Data

 Each person on your team must analyze YOUR team’s results


 Hand it in next week as a homework exercise

 Determine
 Which factors are significant; eliminate insignificant factors/interactions
 Provide model with only significant variables
 If multicollinearity exists
 What’s the proof
 If it does, what are the issues with your model
 Validate your assumptions
 Residuals are normal, independent, constant & E{ei} = 0
 How much of the total variation does your model explain
 What is the 95% range, ballpark, I can expect a predicted value to fall within
 Is your model good or are there issues

Level of Significance = 0.05


Summary of Key Learning Points
60

 With MLR, many more concerns


 Goodness of Fit: R2, s and F-ratio
 R2adj adjust for over fitting the model
 Residuals still need to be
 Normal, Independent, equal variances and E{ei} = 0

 In an ANOVA table, what happens to the SS of a


factor if it’s not in the analysis?

Vous aimerez peut-être aussi