Au 15 Lecture 5 MLR-GLM

Lecture #4
M U LT I P L E R E G R E S S I O N
&
M U LT I FA C TO R A N O VA ( G L M )
“If people do not believe that mathematics is simple, it is only

because they do not realize how complicated life is.”
- Johann von Neumann (1903 - 1957)
Learning Objectives
2
1. Understand how multiple linear regression (MLR) analysis is conducted

and can be used to develop relationships involving one dependent
variable and several independent variables.
2. Be able to interpret the coefficients in a multiple regression analysis.
3. Understand what multicollinearity is and how it can affect the coefficients

in the regression equation.
4. Know the assumptions and procedures to conduct statistical tests

involving the hypothesized model.
Learning Objectives
3
5. Know how to use the F-ratio to test for model significance.
6. Know how to use the t-values to test for significance of individual

predictors in the equation.
7. Be able to use General Linear Models (GLM) to perform an AVOVA on 2

or more factors
8. Know the components of a GLM (ANOVA) Table

Why Study MLR & GLM?
Y, Response variable
Continuous Discrete
(Output has a mean and variance) (Output is a proportion, i.e., 15 out of 50 or 30%)
- Correlation -Logistics Regression

- Regression*
Continuous
X, Input(s) - T-test - Proportions

- Paired
- One Sample - One Proportion
- Two Sample - Two Proportions
-Equal var.
Discrete -Unequal var.
- ANOVA (an X with 2 or more means - Chi Sq (more than 2 proportions)
or 2 or more X’s being investigated)
RARELY do you just look at one variable at a time, and it’s not an efficient way to
experiment. Typically multiple factors influence a response, and looking at them at
the same time allows us to investigate interactions!
Overview
 Multiple Linear Regression (MLR) Example

 Interpretation of MLR Coefficients
 Goodness of Fit
 Inference
 Multifactor ANOVA
 Exercise
Multiple Regression
6
Ordinary least squares multiple regression is

a direct extension of least squares simple
regression. The mathematics involved are
much more complex, but we will rely on the
computer for help in that regard.
Advantages and Disadvantages
 The advantage of multiple regression is potentially

increased explanatory power for the response
variable as a result of including more information.
 The disadvantages of multiple regression include a

more complicated model, the increased
commitment of resources to collecting information
and maintaining the data base, greater risk of
misuse of the model (extrapolation, for example),
mathematical problems, among others.
Goal - PARSIMONY
Develop a model that allows us to

predict the response variable
accurately, but that is as brief as
possible.
Mini-Case: Newsprint Quality
A major newspaper printing facility was interested in

understanding those factors that affect the quality of the
printing produced by their equipment. A major factor in print
quality is the nature of the paper stock used for the printing
operation.
Data were collected on three variables and are presented on a
following slide. Each observation is the mean of a sample of
five sheets taken from the same roll. The variables were print
quality (Print_Q), roughness of the paper (Rough), and the
tensile strength of the paper (Strength).
Details of the measurement process for the variables are
provided in the next two slides.
Response Variable
 Print_Q is measured by the La Rocque printability

test. A precision 6" × 7" plate is inked with a
controlled quantity of ink and printed on a sample of
newsprint using a printing pressure of 60 lbs. per
inch. Printability is measured as a percentage of the
difference in reflectance between printed and
unprinted paper areas as measured by a Hunter
reflectometer.
Possible Predictors
 Rough is measured by the Bendtsen Smoothness and

Porosity Tester. It works on the principle of measuring the
quantity of regulated air flow escaping between the head
and the surface of the paper sample. Values are reported in
Bendtsen Units. Note that this variable not only measures
smoothness, but also the likelihood that ink may be able to
penetrate through the paper.
 Strength is measured as the tensile strength of the paper in
lbs. per square inch. The variable is measured with the use
of a standardized test apparatus which applies a precisely
defined force and angle to the sample. Values reported are
in the ANPA scale units.
Data
OBS Print_Q Rough Strength OBS Print_Q Rough Strength
1 68.7 103.6 145.20 16 72.8 87.3 153.00
2 67.2 167.8 170.05 17 62.4 141.2 133.04
3 66.6 121.4 169.78 18 74.3 67.7 184.53
4 68.3 119.6 162.85 19 69.5 128.6 142.30
5 71.2 68.0 159.65 20 67.5 159.0 167.25
6 71.7 117.4 181.76 21 75.1 62.5 156.27
7 73.0 59.4 159.97 22 74.3 83.3 173.13
8 70.8 97.5 156.71 23 71.7 83.8 178.05
9 61.5 177.5 141.30 24 76.4 97.9 169.90
10 72.3 82.8 157.95 25 67.1 104.8 158.80
11 66.1 123.5 154.81 26 67.2 132.6 141.32
12 65.4 131.4 144.56 27 71.2 94.5 167.04
13 73.7 91.9 162.23 28 72.6 76.0 155.53
14 70.9 100.1 168.05 29 66.2 108.3 135.42
15 65.8 108.6 153.45 30 71.9 76.7 159.19
Summary
Descriptive Statistics: Print_Q, Rough, Strength
Variable N Mean StDev Min Max

Print_Q 30 69.780 3.732 61.500 76.400
Rough 30 105.82 30.43 59.40 177.50
Strength 30 158.77 13.10 133.04 184.53
 Note: SST = (n-1)sy2 = (29)(3.732)2 = 403.907

Summary of Relationships
Correlation Table Print_Q Rough Strength

Print_Q 1.000
Rough -0.789 1.000
Strength 0.582 -0.312 1.000
Simple Regression With Best Predictor
Population Model: Yi = β0 + β1Xi + εi

Multiple Adjusted StErr of
Summary R R-Square R-Square Estimate
0.7890 0.6225 0.6090 2.3339
Degrees of Sum of Mean of

ANOVA Table Freedom Squares Squares F-Ratio p-Value
Explained 1 251.4745 251.4745 46.1683 < 0.0001
Unexplained 28 152.5135 5.44691
Standard Confidence Interval 95%

Regression Table Coefficient Error t-Value p-Value Lower Upper
Constant 80.0208 1.5662 51.0908 0.0000 76.8125 83.2291
Rough -0.0968 0.0142 -6.7947 0.0000 -0.1259 -0.0676
Sample Equation: ŷ  b0  b1 x
Print_ Q̂  80.0208  0.0968 Rough
Regression: Print_Q with Rough and Strength

0.8645 0.7474 0.7287 1.9441

Explained 2 301.9384 150.9692 39.9430 < 0.0001
Unexplained 27 102.0496 3.7796

Constant 61.6823 5.1856 11.8950 < 0.0001 51.0424 72.3223
Rough -0.0825 0.0125 -6.6107 < 0.0001 -0.1082 -0.0569
Strength 0.1060 0.0290 3.6540 0.0011 0.0465 0.1656
The fitted equation:

Print_ Q̂  61.6823  0.0825Rough  0.1060Strength
“Plane” of Best Fit
A plot of the regression plane and original data points for

our mini-case is presented below:
Print_ Q̂  61.6823  0.0825Rough  0.1060Strength

Sample Multiple Regression Equation
The regression model estimated from the data are of

the form:
Yî  b0  b1 X 1  b2 X 2
Sample Statistics: bo, b1, and b2 estimate b0, b1, and

b2, respectively, Yî estimatesY | X1 , X 2 , e estimates
ε, and s estimates σ.
The estimators are again determined to minimize

SSE:
 ei2   (Yi  Yî ) 2   (Yi  (b0  b1 X 1  b2 X 2 )) 2
Overview

 Goodness of Fit
 Inference
 Exercise
Interpretation of the multiple
regression coefficients
CHANG E O NLY O NE VARI ABLE AT A TI ME
BE WARY O F CO LLI NE ARI TY

Manipulate One Predictor at a Time
When Xj is a quantitative variable, the coefficient bj should

be interpreted as the estimated change in the mean of Y
when Xj is increased 1 unit while holding all other
predictors in the equation constant.
Example: Print_ Q̂  61.6823  0.0825Rough  0.1060Strength

The average print quality is estimated to decrease 0.08255
units of print quality for every 1 unit increase in the
roughness of the paper, given the same tensile strength.
Collinearity
In general, the values of the bj are affected by the

other variables present in the model.
Example:
Pˆ rint_Q= 80.0 - 0.0968 Rough
Pˆ rint_Q= 61.682 - .08255 Rough+ .10602 Strength
Little difference due to low correlation (-.312)

between Rough and Strength.
Multicollinearity
 As the strength of relationship between predictors increases, it

becomes difficult to estimate their individual coefficients.
 The coefficients become unstable and may exhibit drastic

changes when those predictors are used simultaneously in the
same model.
 The coefficients will have inflated standard errors.
 This condition is called multicollinearity.
 Multicollinearity is identified by a VIF of > 10
 The affects of multicollinearity are inflated variability in the

coefficients
Overview

 Goodness of Fit
 Inference
 Exercise
Measures of goodness of fit
1. S
2. R2
3 . ADJ US TE D R 2
Print_Q vs. Rough and Strength
R2 and s
0.8645 0.7474 0.7287 1.9441

Explained 2 301.9384 150.9692 39.9430 < 0.0001
Unexplained 27 102.0496 3.7796

S = 1.9441:
Regression Table Recall that
Coefficient ErrorsY = 3.732. The dispersion
t-Value p-Value Lower Upper
Constant 61.6823 5.1856 11.8950 < 0.0001 51.0424 72.3223
around
Rough
the regression
-0.0825
line
0.0125
is almost 50%
-6.6107 < 0.0001
smaller
-0.1082
than
-0.0569
the dispersion
Strength around
0.1060 y
0.0290 3.6540 0.0011 0.0465 0.1656
R2 = 0.7474: Nearly 75% of the variation in print quality

can be explained by the roughness of the paper used.
Print_Q vs. Rough and Strength
R2-adjusted
0.8645 0.7474 0.7287 1.9441

Explained 2 301.9384 150.9692 39.9430 < 0.0001
Unexplained 27 102.0496 3.7796
SSE  n  1  Standard
The 2 Confidence Interval 95%
Regression Error adjusted R will always
R a2  1  Table Coefficient. t-Value p-Value be smaller
Lower Upper
Constant  p  1  5.1856
SST  n61.6823 than the11.8950 < 0.0001
unadjusted 51.0424
coefficient. 72.3223
Rough -0.0825 0.0125 -6.6107 < 0.0001 -0.1082 -0.0569
Strength 0.1060 0.0290 3.6540 0.0011 0.0465 0.1656
After adjusting for the use of 2 predictors with n = 30,
nearly 73% of the variation in print quality can be
explained by the roughness and strength of the
paper used.
“Flaw” in R2
R2 can be inflated by the use of many predictors, but

each predictor has a price - a degree of freedom
paid by the Error source of variation:
Source SS df MS F
Regression SSR p MSR MSR/MSE
Error SSE n–p-1 MSE
Total SST n-1
Too many predictors – “over fit” the data

Purpose of R2 Adjusted
 The purpose behind the adjusted R2 is to account

for the number of predictors.
o This is particularly relevant when n is small.
 It helps identify “over fitting” in regression models.
 The adjusted coefficient normally gives a more

appropriate estimate of the proportion of variance
in the dependent variable that is explained when
multiple predictors are used.
Overview

 Goodness of Fit
 Inference
 Exercise
Inference
Mathematically more complex,

but otherwise the same as in
Simple Regression, with one
exception.
The significance tests based on
F obs and t obs do not address
identical hypotheses.
F-test of Model Significance
Shows if there is a linear relationship between any or

all X variables and Y
Hypotheses: Decision Rule:

H0: b1 = b2 = ... = bp = 0
Ha: At least one βj is not 0
Reject H0
Test Statistic:
MSR
F  ratio  ~ F ( p, n  p  1) Fp, n-p-1
MSE
Test of Model Significance for the Mini-case
H0: b1 = b2 = 0
Ha: Not both b1 and b2 are 0.0 Reject H0
3.35
Explained 2 301.9384 150.9692 39.9430 < 0.0001
Unexplained 27 102.0496 3.7796
If α = 0.05, Fc = 3.3541
We conclude that there is a statistically significant predictive

relationship between print quality and newsprint
roughness, or newsprint tensile strength, or both.
T-test for the Significance of Each Predictor
Goal: Parsimony
 We want to identify variables that are not
significant in the model, and consider removing
any that aren’t.
 Remove one predictor (least significant) at a time.
 A predictor is not significant if its population

coefficient is 0.
 The test is identical to the t-test in simple
regression.
t-Test of Individual Predictor in the Model
Hypotheses (one set for each predictor):

Ho: bj = 0 Implies an insignificant predictor.
Ha: bj ≠ 0 Implies a significant predictor.
bj  0
Test Statistic: tobs  ~ t (n  p  1)
sb j
Reject Reject
α/2 α/2
t
-tc 0 tc
t-Tests
Hypotheses: H0: b1 = 0 H0: b2 = 0

Ha: b1 ≠0.0 Ha: b2 ≠0.0
Constant 61.6823 5.1856 11.8950 < 0.0001 51.0424 72.3223
Rough -0.0825 0.0125 -6.6107 < 0.0001 -0.1082 -0.0569
Strength 0.1060 0.0290 3.6540 0.0011 0.0465 0.1656
If α = 0.05, tc = t27,.05/2 = 2.052:Reject H0 if tobs < -2.052 or tobs > 2.052
We conclude that there is strong evidence that the roughness of the
paper is a significant predictor of print quality in this model.
We conclude that there is strong evidence that the strength of the

paper is a significant predictor of print quality in this model.
Overview

 Goodness of Fit
 Inference
 Exercise
Multifactor ANOVA
38
Advantages & Disadvantages
 The advantage of multifactor ANOVA is efficiency

versus one factor at a time analysis. It provides an
examination of the interaction.
 The disadvantages of multifactor ANOVA include a

more complicated model, the increased
commitment of resources, costly & time-
consuming, to collecting information and greater
risk of misuse of the model.
Mini-Case: Sales vs. Display Height & Width
The Acme Bakery wanted to determine if sales are

affected by the display height and display width.
Acme Bakery conducted a test in one store using
the same item. The display heights eye level (5’)
knee level (2’). The widths of the displays were
narrow (18”) and wide (36”). Sales were recorded
for one week for each design and are in $1000’s.
Significance level = 0.01
Data collected on sales, display height and width are
on the following slide.
Data
Two-Factor ANOVA Table
(Fixed Factors)
Source df SS MS F
Factor A a-1 SSA MSA = SSA/(a-1) MSA/MSE
Factor B b-1 SSB MSB = SSB/(b-1) MSB/MSE
AB Interaction (a-1)(b-1) SSAB MSAB = SSAB/(a-1)(b-1) MSAB/MSE
Error ab(n-1) SSE MSE = SSE/ab(n-1) -
Total N-1 SST - -
In our case, a = 2, b = 2 and n, number of replications at each

possible combination, equals 3, and N = 12
a–1=1
b–1=1
(a – 1)(b – 1) = 1
ab(n – 1) = 2x2x(3 – 1) = 8
N – 1 = 12 – 1 = 11
Two-Factor ANOVA—Help from Technology
GLM provides flexibility: Will handle unbalanced and balanced designs. Here,
since the design is balanced, the Sequential SS and Adjusted SS are the same.
Main Effects Plot
Main Effects Plot for Sales

Data Means
Height Width
8.0
7.5
7.0
Mean
6.5
6.0
5.5
5.0
Eye Knee Narrow Wide

Interaction Plot
Interaction Plot for Sales

Data Means
11 Height
Eye
Knee
10
8
Mean
4
Narrow Wide
Width
Examining One Variable at a Time
Notice the SST, in both models, they’re the same—64.34. Also notice
the SS for Width is the same—25.81. However, in the One-way ANOVA,
this factor is not significant at an alpha of 0.01. Whereas in the Two-
way ANOVA with the interaction, it is very significant.
Examining One Variable at a Time
Same issue with Height. If a factor is not a part of the analysis, its effect is
in the error term—the denominator of the F-ratios. The error term is
supposed to be just random error and not assignable causes. If there are
assignable causes in the error term, significant effects can be missed.
BREAK
48
FRIDAY’S
HAND’S-ON EXERCISE
Overview

 Goodness of Fit
 Inference
 Exercise
Generate MLR Data
Exercise: Shoot catapult to generate a multiple regression equation for

Distance as a function three inputs.
Objective: MLR Experiment to collect data regarding 3 factors

Use statistical software to generate model
Recognize if multicollinearity exists in the model
Time: Take 45 minutes Cup Rubber Band
Directions: Following slides Hook Pin Vertical PIn
Start Angle
Stop Pin
Generate MLR Data: Directions
 Break into groups of 4 or 5
 Get catapult, duct tape, tape measure, 12’ aluminum foil & rubber ball
 Duct tape catapult to floor securely
 Use aluminum foil to indicate where the ball landed—secure it to the
floor with a couple of pieces of tape to prevent it from sliding when the
ball lands on it
 Place, and then secure, the tape measure along side the aluminum foil
starting from the front of the catapult
 Response variable = Distance ball travels before hitting the ground
 Experimental factors, independent variables…
 Hook pin
 Vertical pin catapult
 Start angle Aluminum foil
Tape measure
Generate MLR Data: Directions
 Experimental factors
 Hook pin
 Vertical pin
 Start angle
Constants:
Put cup at location 1
Put stop pin in location 3
Generate MLR Data
 For Minitab, create a column for each variable—Response variable and the
three factors
 Generate as much data as you believe necessary to calculate:

 Main effects
 Interactions
 Experimental error
Generate MLR Data
 You will have to generate the interactions in Minitab to include them in the analysis
 Multiple two main effects into a separate column to create the interaction
 Calc>Calculator will give you the above dialog box (left)
 Store it in the next column, here I labeled it “HP*VP” for the interaction of Hook Pin and Vertical Pin
 In “Expression” box, double click on Hook Pin, click on the multiplier button, *, and double click on Vertical Pin
 Click on “OK”
 Continue this for the other two, two-way interactions, HP*SA and VP*SA
Generate MLR Data
When you’re ready to analyze the data, your

worksheet should look like this:
One column for your response variable, Distance, C4
The other columns are the factors you will regress on,
the main effects and their interactions
Generate MLR Data
 Select: Stat>Regression>Regression
 Enter Distance in the “Response:” window

 Place cursor in the “Predictors:” window by clicking in the window
 Enter all the main effects and interactions by double clicking on them
Generate MLR Data
 Click the “Options” button and select Variance inflation factors
 Click “OK”
 Click the “Storage” button and select Standardized Residuals
 Click “OK”
Generate MLR Data
 Click on the Graphs button and select the 4-in-1 option
 Click on OK
 Click on OK
Generate MLR Data
 Each person on your team must analyze YOUR team’s results

 Hand it in next week as a homework exercise
 Determine
 Which factors are significant; eliminate insignificant factors/interactions
 Provide model with only significant variables
 If multicollinearity exists
 What’s the proof
 If it does, what are the issues with your model
 Validate your assumptions
 Residuals are normal, independent, constant & E{ei} = 0
 How much of the total variation does your model explain
 What is the 95% range, ballpark, I can expect a predicted value to fall within
 Is your model good or are there issues
Level of Significance = 0.05

Summary of Key Learning Points
60
 With MLR, many more concerns

 Goodness of Fit: R2, s and F-ratio
 R2adj adjust for over fitting the model
 Residuals still need to be
 Normal, Independent, equal variances and E{ei} = 0
 In an ANOVA table, what happens to the SS of a

factor if it’s not in the analysis?

Au 15 Lecture 5 MLR-GLM

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Au 15 Lecture 5 MLR-GLM

Transféré par

Droits d'auteur :

Formats disponibles

Lecture #4

“If people do not believe that mathematics is simple, it is only

1. Understand how multiple linear regression (MLR) analysis is conducted

2. Be able to interpret the coefficients in a multiple regression analysis.

3. Understand what multicollinearity is and how it can affect the coefficients

4. Know the assumptions and procedures to conduct statistical tests

5. Know how to use the F-ratio to test for model significance.

6. Know how to use the t-values to test for significance of individual

7. Be able to use General Linear Models (GLM) to perform an AVOVA on 2

8. Know the components of a GLM (ANOVA) Table

- Correlation -Logistics Regression

X, Input(s) - T-test - Proportions

 Multiple Linear Regression (MLR) Example

Ordinary least squares multiple regression is

 The advantage of multiple regression is potentially

 The disadvantages of multiple regression include a

Develop a model that allows us to

A major newspaper printing facility was interested in

 Print_Q is measured by the La Rocque printability

 Rough is measured by the Bendtsen Smoothness and

Descriptive Statistics: Print_Q, Rough, Strength

Variable N Mean StDev Min Max

 Note: SST = (n-1)sy2 = (29)(3.732)2 = 403.907

Correlation Table Print_Q Rough Strength

Population Model: Yi = β0 + β1Xi + εi

Degrees of Sum of Mean of

Standard Confidence Interval 95%

Multiple Adjusted StErr of

Degrees of Sum of Mean of

Standard Confidence Interval 95%

The fitted equation:

A plot of the regression plane and original data points for

Print_ Q̂  61.6823  0.0825Rough  0.1060Strength

The regression model estimated from the data are of

Sample Statistics: bo, b1, and b2 estimate b0, b1, and

The estimators are again determined to minimize

 Multiple Linear Regression (MLR) Example

CHANG E O NLY O NE VARI ABLE AT A TI ME

BE WARY O F CO LLI NE ARI TY

When Xj is a quantitative variable, the coefficient bj should

Example: Print_ Q̂  61.6823  0.0825Rough  0.1060Strength

In general, the values of the bj are affected by the

Pˆ rint_Q= 61.682 - .08255 Rough+ .10602 Strength

Little difference due to low correlation (-.312)

 As the strength of relationship between predictors increases, it

 The coefficients become unstable and may exhibit drastic

 The coefficients will have inflated standard errors.

 This condition is called multicollinearity.

 Multicollinearity is identified by a VIF of > 10

 The affects of multicollinearity are inflated variability in the

 Multiple Linear Regression (MLR) Example

Degrees of Sum of Mean of

Standard Confidence Interval 95%

R2 = 0.7474: Nearly 75% of the variation in print quality

Degrees of Sum of Mean of

R2 can be inflated by the use of many predictors, but

Too many predictors – “over fit” the data

 The purpose behind the adjusted R2 is to account

 It helps identify “over fitting” in regression models.

 The adjusted coefficient normally gives a more

 Multiple Linear Regression (MLR) Example

Mathematically more complex,

Shows if there is a linear relationship between any or