MIT2 854F10 Reg

Data and Regression Analysis
Lecturer: Prof. Duane S. Boning
Rev 10
Agenda
1.
Comparison of Treatments (One Variable)

Analysis of Variance (ANOVA)
2.
Multivariate Analysis of Variance

Model forms
3.

Regression Modeling
Regression fundamentals Significance of model terms Confidence intervals
Is Process B Better Than Process A?

time order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 method A A A A A A A A A A B B B B B B B B B B yield 89.7 81.4 84.5 84.8 87.3 79.7 85.1 81.7 83.7 84.5 84.7 86.1 83.2 91.9 86.3 79.3 82.6 89.1 83.7 88.5
yield 92 90
88
86 84 82 80 78 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 time order
Assume variances in A and B are equal.
Two Means with Internal Estimate of Variance

Method A Method B
Pooled estimate of Estimated variance of
with =18 d.o.f
Estimated standard error of
So only about 80.5% confident that mean difference is real (significant)

4
Comparison of Treatments
Population A
Population C Population B
Sample A
Sample B
Sample C
Consider multiple conditions (treatments, settings for some variable)

There is an overall mean and real effects or deltas between conditions i. We observe samples at each condition of interest
Key question: are the observed differences in mean significant?

Typical assumption (should be checked): the underlying variances are all the same usually an unknown value ( 02)
Steps/Issues in Analysis of Variance

1. Within group variation
Estimate underlying population variance
2. Between group variation

Estimate group to group variance
3. Compare the two estimates of variance

If there is a difference between the different treatments, then the between group variation estimate will be inflated compared to the within group estimate We will be able to establish confidence in whether or not observed differences between treatments are significant Hint: well be using F tests to look at ratios of variances
6
(1) Within Group Variation

Assume that each group is normally distributed and shares a common variance 02 SSt = sum of square deviations within tth group (there are k groups)
Estimate of within group variance in tth group (just variance formula)
Pool these (across different conditions) to get estimate of common within group variance:
This is the within group mean square (variance estimate)
(2) Between Group Variation

We will be testing hypothesis 1 = 2 = = k If all the means are in fact equal, then a 2nd estimate of 2 could be formed based on the observed differences between group means:
If the treatments in fact have different means, then sT2 estimates something larger:
Variance is inflated by the real treatment effects t
(3) Compare Variance Estimates

We now have two different possibilities for sT2, depending on whether the observed sample mean differences are real or are just occurring by chance (by sampling) Use F statistic to see if the ratios of these variances are likely to have occurred by chance! Formal test for significance:
(4) Compute Significance Level

Calculate observed F ratio (with appropriate degrees of freedom in numerator and denominator) Use F distribution to find how likely a ratio this large is to have occurred by chance alone
This is our significance level Define observed ratio: If then we say that the mean differences or treatment effects are significant to (1- )100% confidence or better
10
(5) Variance Due to Treatment Effects

We also want to estimate the sum of squared deviations from the grand mean among all samples:
11
(6) Results: The ANOVA Table

source of sum of variation squares Between treatments Within treatments Total about the grand average
Also referred to as residual SS
degrees of freedom
mean square
F0
Pr(F0)
12
Example: Anova
A 11 10 12 B 10 8 6 C 12 10 11
12
10 8 6 A (t = 1) B (t = 2) C (t = 3)
Excel: Data Analysis, One-Variation Anova

Anova: Single Factor SUMMARY Groups A B C
Count 3 3 3
Sum 33 24 33
Average Variance 11 1 8 4 11 1
ANOVA Source of Variation Between Groups Within Groups Total
SS 18 12 30
df 2 6 8
MS 9 2
F 4.5
P-value 0.064
F crit 5.14
13
ANOVA Implied Model

The ANOVA approach assumes a simple mathematical model:
Where t is the treatment mean (for treatment type t) And t is the treatment effect With ti being zero mean normal residuals ~N(0, 02) Checks
Plot residuals against time order Examine distribution of residuals: should be IID, Normal Plot residuals vs. estimates Plot residuals vs. other variables of interest
14
MANOVA Two Dependencies

Can extend to two (or more) variables of interest. MANOVA assumes a mathematical model, again simply capturing the means (or treatment offsets) for each discrete variable level:
^ indicates estimates:
Assumes that the effects from the two variables are additive
15
Example: Two Factor MANOVA

Two LPCVD deposition tube types, three gas suppliers. Does supplier matter in average particle counts on wafers?
Experiment: 3 lots on each tube, for each gas; report average # particles added
Factor 1 Gas A Factor 2 1 2 Tube 7 13 10 B 36 44 40 C 2 18 10 15 25 Analysis of Variance
Source Model Error C. Total DF Sum of Squares Mean Square 3 1350.00 450.0 2 28.00 14.0 5 1378.00 F Ratio 32.14 Prob > F 0.0303 Prob > F 0.0820 0.0228
Effect Tests
Source Tube Gas Nparm 1 2 DF Sum of Squares 1 150.00 2 1200.00 F Ratio 10.71 42.85
7 36 2 13 44 18
20 20 20 20 20 20
-10 20 -10 -10 20 -10
-5 5
-5 -5 5 5
2 -2
1 -3 -1 3
16
MANOVA Two Factors with Interactions

May be interaction: not simply additive effects may depend synergistically on both factors: 2
IID, ~N(0, ) An effect that depends on both t & q factors simultaneously t = first factor = 1,2, k q = second factor = 1,2, n i = replication = 1,2, m (k = # levels of first factor) (n = # levels of second factor) (m = # replications at t, qth combination of factor levels
Can split out the model more explicitly

Estimate by:
17
MANOVA Table Two Way with Interactions

source of variation Between levels of factor 1 (T) Between levels of factor 2 (B) Interaction Within Groups (Error) Total about the grand average sum of squares degrees of freedom mean square F0 Pr(F0)
18
Measures of Model Goodness R2

Goodness of fit R2
Question considered: how much better does the model do than just using the grand average?
Think of this as the fraction of squared deviations (from the grand average) in the data which is captured by the model
Adjusted R2
For fair comparison between models with different numbers of coefficients, an alternative is often used
Think of this as (1 variance remaining in the residual). Recall R = D - T
19
Regression Fundamentals
Use least square error as measure of goodness to estimate coefficients in a model One parameter model:
Model form Squared error Estimation using normal equations Estimate of experimental error Precision of estimate: variance in b Confidence interval for Analysis of variance: significance of b Lack of fit vs. pure error
Polynomial regression
20
Least Squares Regression

We use least-squares to estimate coefficients in typical regression models One-Parameter Model:
Goal is to estimate with best b How define best?

That b which minimizes sum of squared error between prediction and data
The residual sum of squares (for the best estimate) is
21
Least Squares Regression, cont.

Least squares estimation via normal equations
For linear problems, we need not calculate SS( ); rather, direct solution for b is possible Recognize that vector of residuals will be normal to vector of x values at the least squares estimate
Estimate of experimental error

Assuming model structure is adequate, estimate s2 of 2 can be obtained:
22
Precision of Estimate: Variance in b

We can calculate the variance in our estimate of the slope, b:
Why?
23
Confidence Interval for

Once we have the standard error in b, we can calculate confidence intervals to some desired (1- )100% level of confidence
Analysis of variance
Test hypothesis: If confidence interval for includes 0, then not significant
Degrees of freedom (need in order to use t distribution)
p = # parameters estimated by least squares
24
Example Regression
Age 8 22 35 40 57 73 78 Income 6.16 9.88 14.35 24.06 30.34 32.17 42.18 Whole Model
Analysis of Variance
Source DF Sum of Squares Model 1 8836.6440 Error 8 64.6695 C. Total 9 8901.3135 Tested against reduced model: Y=0 Mean Square 8836.64 8.08 F Ratio 1093.146 Prob > F <.0001
Parameter Estimates
Term Intercept age Source age Zeroed Estimate 0 0.500983 DF 1 Std Error 0 0.015152 t Ratio . 33.06 F Ratio 1093.146 Prob>|t| . <.0001 Prob > F <.0001
87
98
income Leverage Residuals
43.23
48.76
Effect Tests
Nparm 1 Sum of Squares 8836.6440
50
40 30 20 10
Note that this simple model assumes an intercept of zero model must go through origin
0 25 50 75 100 age Leverage, P<.0001
We will relax this requirement soon
25
Lack of Fit Error vs. Pure Error

Sometimes we have replicated data E.g. multiple runs at same x values in a designed experiment
We can decompose the residual error contributions

Where SSR = residual sum of squares error SSL = lack of fit squared error SSE = pure replicate error
This allows us to TEST for lack of fit

By lack of fit we mean evidence that the linear model form is inadequate
26
Regression: Mean Centered Models

Model form Estimate by
27
Regression: Mean Centered Models

Confidence Intervals
Our confidence interval on y widens as we get further from the center of our data!
28
Polynomial Regression
We may believe that a higher order model structure applies. Polynomial forms are also linear in the coefficients and can be fit with least squares
Curvature included through x2 term
Example: Growth rate data
29
Regression Example: Growth Rate Data

Bivariate Fit of y By x
95 90 85 80
y
75 70 65 60 5 10 15 20 x Fit Mean Linear Fit Polynomial Fit Degree=2 25 30 35 40
Image by MIT OpenCourseWare.
Replicate data provides opportunity to check for lack of fit
30
Growth Rate First Order Model

Mean significant, but linear term not Clear evidence of lack of fit
Source
Model
Sum of squares
SM = 67,428.6
Degrees of freedom
2
Mean square
67,404.1 24.5 85.8
mean 67,404.1 extra for linear 24.5
{ {
1 1 4 4
Residual Total
lack of fit pure error
SR = 686.4
SL = 659.40 SE = 27.0
8 10
164.85 ratio = 24.42 6.75
ST = 68,115.0
31
Growth Rate Second Order Model

No evidence of lack of fit Quadratic term significant
Degrees of freedom
3
Source
Model
Sum of squares
mean 67,404.1 SM = 68,071.8 extra for linear 24.5 extra for quadratic 643.2 SR = 43.2
Mean square
67,404.1 24.5 643.2
1 1 1 3 4
Residual Total
SL = 16.2 SE = 27.0
7 10
5.40 ratio = 0.80 6.75
ST = 68,115.0
32
Polynomial Regression In Excel

Create additional input columns for each input Use Data Analysis and Regression tool
x 10 10 15 20 20 25 25 25 30 35 x^2 100 100 225 400 400 625 625 625 900 1225 y 73 78 85 90 91 87 86 91 75 65
Regression Statistics Multiple R 0.968 R Square 0.936 Adjusted R Square 0.918 Standard Error 2.541 Observations 10 ANOVA df Regression Residual Total 2 7 9 SS MS F Significance F 665.706 332.853 51.555 6.48E-05 45.194 6.456 710.9 Standard Lower Upper P-value Error t Stat 95% 95% 5.618 6.347 0.0004 22.373 48.942 0.558 9.431 3.1E-05 3.943 6.582 0.013 -9.966 2.2E-05 -0.158 -0.097
Intercept x x^2
Coefficients 35.657 5.263 -0.128
33
Polynomial Regression
Analysis of Variance
Source Model Error C. Total
Source Lack Of Fit Pure Error Total Error
DF 2 7 9
Sum of Squares Mean Square 665.70617 332.853 45.19383 6.456 710.90000
F Ratio 51.5551 Prob > F <.0001

F Ratio 0.8985 Prob > F 0.5157 Max RSq 0.9620
Generated using JMP package
Lack Of Fit
DF 3 4 7 Sum of Squares Mean Square 18.193829 6.0646 27.000000 6.7500 45.193829
Summary of Fit
RSquare
RSquare Adj Root Mean Sq Error Mean of Response
Observations (or Sum Wgts)
0.936427 0.918264 2.540917 82.1 10
Parameter Estimates
Term Intercept x x*x Estimate 35.657437 5.2628956 -0.127674 Nparm 1 1 DF 1 1 Std Error 5.617927 0.558022 0.012811 t Ratio 6.35 9.43 -9.97 Prob>|t| 0.0004 <.0001 <.0001 F Ratio 88.9502 99.3151 Prob > F <.0001 <.0001
Effect Tests
Source x x*x Sum of Squares 574.28553 641.20451
34
Summary
Comparison of Treatments ANOVA Multivariate Analysis of Variance Regression Modeling
Next Time
Time Series Models Forecasting
35
MIT OpenCourseWare http://ocw.mit.edu
2.854 / 2.853 Introduction to Manufacturing Systems

Fall 2010
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

MIT2 854F10 Reg

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

MIT2 854F10 Reg

Transféré par

Droits d'auteur :

Formats disponibles

Data and Regression Analysis

Lecturer: Prof. Duane S. Boning

Comparison of Treatments (One Variable)

Multivariate Analysis of Variance

Is Process B Better Than Process A?

Assume variances in A and B are equal.

Two Means with Internal Estimate of Variance

Pooled estimate of Estimated variance of

with =18 d.o.f

Estimated standard error of

So only about 80.5% confident that mean difference is real (significant)

Consider multiple conditions (treatments, settings for some variable)

Key question: are the observed differences in mean significant?

Steps/Issues in Analysis of Variance

2. Between group variation

3. Compare the two estimates of variance

(1) Within Group Variation

Estimate of within group variance in tth group (just variance formula)

This is the within group mean square (variance estimate)

(2) Between Group Variation

Variance is inflated by the real treatment effects t

(3) Compare Variance Estimates

(4) Compute Significance Level

(5) Variance Due to Treatment Effects

(6) Results: The ANOVA Table

Excel: Data Analysis, One-Variation Anova

ANOVA Source of Variation Between Groups Within Groups Total

ANOVA Implied Model

MANOVA Two Dependencies

Example: Two Factor MANOVA

-10 20 -10 -10 20 -10

MANOVA Two Factors with Interactions

Can split out the model more explicitly

MANOVA Table Two Way with Interactions

Measures of Model Goodness R2

Think of this as (1 variance remaining in the residual). Recall R = D - T

Least Squares Regression

Goal is to estimate with best b How define best?

The residual sum of squares (for the best estimate) is

Least Squares Regression, cont.

Estimate of experimental error

Precision of Estimate: Variance in b

Confidence Interval for

Degrees of freedom (need in order to use t distribution)

p = # parameters estimated by least squares

We will relax this requirement soon

Lack of Fit Error vs. Pure Error

We can decompose the residual error contributions

This allows us to TEST for lack of fit

Regression: Mean Centered Models

Regression: Mean Centered Models

Example: Growth rate data

Regression Example: Growth Rate Data

75 70 65 60 5 10 15 20 x Fit Mean Linear Fit Polynomial Fit Degree=2 25 30 35 40

Image by MIT OpenCourseWare.

Replicate data provides opportunity to check for lack of fit

Growth Rate First Order Model

mean 67,404.1 extra for linear 24.5

lack of fit pure error

164.85 ratio = 24.42 6.75

Image by MIT OpenCourseWare.

Growth Rate Second Order Model