Vous êtes sur la page 1sur 70

What is a general linear model?

Use General Linear Model to determine whether the means of two or more groups
differ. You can include random factors, covariates, or a mix of crossed and nested
factors. You can also use stepwise regression to help determine the model. You can
then use the model to predict values for new observations, identify the combination of
predictor values that jointly optimize one or more fitted values, and create surface plots,
contour plots, and factorial plots.
GLM is an ANOVA procedure in which the calculations are performed using a least
squares regression approach to describe the statistical relationship between one or
more predictors and a continuous response variable. Predictors can be factors and
covariates. GLM codes factor levels as indicator variables using a 1, 0, - 1 coding
scheme, although you can choose to change this to a binary coding scheme (0, 1).
Factors may be crossed or nested, fixed or random. Covariates may be crossed with
each other or with factors, or nested within factors. The design may be balanced or
unbalanced. GLM can perform multiple comparisons between factor level means to find
significant differences.
Example of a general linear model
Suppose you are studying the affect of an additive (factor with three levels) and
temperature (covariate) on the coating thickness of your product. You collect your data
and fit a general linear model. The following output is a portion of the results from
Minitab:

Factor Information

Factor Type Levels Values
Additive fixed 3 1, 2, 3

Analysis of Variance

Source F P
Temperature 719.21 0.000
Additive 56.65 0.000
Additive*Temperature 69.94 0.000

Model Summary

S R-Sq R-Sq(adj) R-sq(pred)
19.1185 99.73% 99.61% 99.39%

Coefficients

Term Coef T P
Constant -4968 -25.97 0.000
Temperature 83.87 26.82 0.000
Additive*Temperature -0.2852 -22.83 0.000
Additive
1 -24.40 -5.52 0.000
2 -27.87 -6.30 0.000
Because the p-values are less than any reasonable alpha level, evidence exists that
your two predictors and their interaction have a significant affect on strength. In addition,
your model explains 99.73% of the variance. The coefficient for the covariate,
temperature, indicates that the mean strength increases by 83.87 units per one degree
increase in temperature when all other predictors are held constant. For the additive
factor, the mean for level 1 is 24.40 units below the overall mean while level 2 is 27.87
units below the overall mean. Level 3 is the baseline value so it is not displayed. You
can calculate the baseline factor level mean by adding all the level coefficients for a
factor (excluding the intercept) and multiplying by - 1. In this case, it is 52.27 ((-24.40-
27.87) * -1) units above the overall mean.
Perform a fully nested ANOVA
Use Fully Nested ANOVA to determine whether the means of two or more groups differ
when all the factors are nested. For example, compare the production rates of two
machines that each have unique operators. To perform ANOVA with nested factors in
Minitab you can use either Fully Nested ANOVA or General Linear Model. Fully Nested
ANOVA does not display F and p values when the data are unbalanced while General
Linear Model does.
The following options show how to perform fully nested ANOVA using both methods
using an example. Suppose you want to understand the sources of variability in the
manufacture of glass jars. You do an experiment and measure furnace temperature
three times during a work shift for each of four operators from each plant on four
different shifts. Using the Minitab sample data set FURNTEMP.MTW where Temp is the
response and the four nested factors are Plant, Operator, Shift and Batch.
IN THIS TOPIC
Option 1: Use Fully Nested ANOVA
Option 2: Use General Linear Model
Option 1: Use Fully Nested ANOVA
1. Choose Stat > ANOVA > Fully Nested ANOVA.
2. In Responses, enter Temp.
3. In Factors, enter Plant Operator Shift Batch. Click OK.
Note
In step 3, the factors are listed in hierarchical order.
Return to top
Option 2: Use General Linear Model
1. Choose Stat > ANOVA > General Linear Model > Fit General Linear Model
2. In Responses, enter Temp.
3. In Factors, enter Plant Operator Shift Batch.
4. Click Random/Nest. Complete the Nesting Table as follows:
Factor/Covariate Nested in specified factors
Plant
Operator Plant
Shift Operator
Batch Shift
5. In the Factor type table, change all of the factors to Random.
6. Click OK in each dialog box.
The general linear model output will not be identical to the fully nested ANOVA output.
What is analysis of means?
Analysis of means is a graphical analog to ANOVA that tests the equality of population
means. The graph displays each factor level mean, the overall mean, and the decision
limits. If a point falls outside the decision limits, then evidence exists that the factor level
mean represented by that point is significantly different from the overall mean.
For example, you are investigating how temperature and additive settings affect the
rating of your product. After your experiment, you use ANOM to generate the following
graph.
The top plot shows that the interaction effects are well within the decision limits,
signifying no evidence of interaction. The lower two plots show the means for the levels
of the two factors, with the main effect being the difference between the mean and the
center line. In the lower left plot, the point representing the third mean of the factor
Temperature is displayed by a red symbol, indicating that there is evidence that the
Temperature 200 mean is significantly different from the overall mean at = 0.05. The
main effects for levels 1 and 3 of the Additive factor are well outside the decision limits
of the lower right plot, signifying that there is evidence that these means are different
from the overall mean.
Return to top
Comparison of ANOM (analysis of means) and ANOVA
ANOVA tests whether the treatment means differ from each other. ANOM tests whether
the treatment means differ from the overall mean (also called grand mean).
Often, both analyses yield similar results. However, there are some scenarios in which
the results can differ:
If one group of means is above the overall mean and a different group of means is below the
overall mean, ANOVA might indicate evidence for differences where ANOM might not.
If the mean of one group is separated from the other means, the ANOVA F-test might not
indicate evidence for differences whereas ANOM might flag this group as being different
from the overall mean.
One more important difference is that ANOVA assumes that your data follow a normal
distribution, while ANOM can be used with data that follows a normal, binomial, or
Poisson distribution.
What is MANOVA (multivariate analysis of variance)?
A test that analyzes the relationship between several response variables and a common
set of predictors at the same time. Like ANOVA, MANOVA requires continuous
response variables and categorical predictors. MANOVA has several important
advantages over doing multiple ANOVAs, one response variable at a time.
Increased power
You can use the covariance structure of the data between the response variables to test the
equality of means at the same time. If the response variables are correlated, then this
additional information can help detect differences too small to be detected through individual
ANOVAs.
Detects multivariate response patterns
The factors may affect the relationship between responses instead of affecting a single
response. ANOVAs will not detect these multivariate patterns as the following figures show.
Controls the family error rate
Your chance of incorrectly rejecting the null hypothesis increases with each successive
ANOVA. Doing one MANOVA to test all response variables at the same time keeps the
family error rate equal to your alpha level.
For example, you are studying the affects of different alloys (1, 2, and 3) on the
strength and flexibility of your company's building products. You first perform two
separate ANOVAs but the results are not significant. Surprised, you plot the raw
data for both response variables using individual value plots. These plots visually
confirm the insignificant ANOVA results.

Because the response variables are correlated, you perform a MANOVA. This
time the results are significant with p-values less than 0.05. You create a
scatterplot to better understand the results.

The individual value plots show, from a univariate perspective, that the alloys do
not significantly affect either strength or flexibility. However, the scatterplot of the
same data shows that the different alloys change the relationship between the
two response variables. That is, for a specified flexibility score, Alloy 3 usually
has a higher strength score than Alloys 1 and 2. MANOVA can detect this type of
multivariate response whereas ANOVA cannot.
Note
Usually, you should graph the data before conducting any analyses because it
will help you decide what approach is appropriate.
Return to top
Which multivariate tests are included in MANOVA?
Minitab automatically does four multivariate tests for each term in the model and
for specially requested terms:
Wilk's test
Lawley-Hotelling test
Pillai's test
Roy's largest root test
All four tests are based on two SSCP (sums of squares and cross products)
matrices:
An H (hypothesis) matrix associated with each term; also called between sample
sums of squares
An E (error) matrix associated with the error for the test; also called within sample
sums of squares
The SSCP matrices are displayed when you request the hypothesis matrices.
You can express the test statistics either as H, E, or H and E, or as the
eigenvalues of E
-1
H. You can request to have these eigenvalues displayed. (If
the eigenvalues are repeated, corresponding eigenvectors are not unique and in
this case, the eigenvectors Minitab displays and those in books or other software
may not agree. The MANOVA tests, however, are always unique.)
Analyzing a repeated measures design
You can use Fit General Linear Model to analyze a repeated measures design in
Minitab. To use Fit General Linear Model, choose Stat > ANOVA > General Linear
Model > Fit General Linear Model.
In all cases, you must arrange the data in the Minitab worksheet so the response values
are in one column, subject IDs are in a different column, and each factor has its own
separate column.
The following examples show analyses of several different repeated measures designs.
You can find the data and more information on these examples in J. Neter, M.H. Kutner,
C.J. Nachtsheim, and W. Wasserman (1996). Applied Linear Statistical Models, 4th
edition. WCB/McGraw-Hill.
IN THIS TOPIC
Example of a single-factor experiment with repeated measures on all treatments
Example of a two-factor experiment with repeated measures on both factors
Example of a two-factor experiment with repeated measures on one factor
Example of a single-factor experiment with repeated measures on all
treatments
In this designed experiment each subject receives each treatment in succession. Create
three columns in the Minitab worksheet: one column for the measurements, one column
identifying which subject corresponds to that measurement, and one column identifying
the treatment applied to that subject. Each row represents a single measurement.
For more information, see page 1166, model 29.1 in Neter, Kutner, Nachtsheim, and
Wwasserman (1996).
C1 C2 C3
Subject Dosage Measurement
A low 1.33
A medium 0.27
B medium 0.49
B low 0.99
C medium 0.41
C low 1.12
1. Choose Stat > ANOVA > General Linear Model > Fit General Linear Model
2. In Responses, enter Measurement.
3. In Factors, enter Subject Dosage.
4. Click Random/Nest.
5. Under Factor type, choose Random in the field beside Subject.
6. Click OK in each dialog box.
Return to top
Example of a two-factor experiment with repeated measures on both factors
In this designed experiment each subject is measured after receiving, successively,
every combination of the levels of the two factors A and B. For example, suppose there
are three subjects, and factors A and B each have two levels. For more information, see
page 1177, model 29.10 in Neter, Kutner, Nachtsheim, and Wwasserman (1996). The
designed experiment continues as follows:

1 2 3 4
Subject 1 A1B2 A2B2 A1B1 A2B1
Subject 2 A2B1 A1B2 A2B2 A1B1
Subject 3 A1B1 A2B1 A1B2 A2B2
1. Create four columns in the Minitab worksheet: one column for the measurements, one
column identifying which subject corresponds to that measurement, one column for Factor A,
and one column for Factor B.
C1 C2 C3 C4
Subject Temperature Fabric Measurement
A High Old 10.4
A High New 9.5
A Low New 7.6
A Low Old 6.9
B High New 9.1
B High Old 7.9
B Low New 10.0
B Low Old 8.1
2. Choose Stat > ANOVA > General Linear Model > Fit General Linear Model.
3. In Responses, enter Measurement.
4. In Factors, enter Subject Temperature Fabric.
5. Click Random/Nest.
6. Under Factor type, choose Random in the field beside Subject.
7. Click OK.
8. Click Model.
9. Use the dialog box to add interactions to the model. For example, to add the interaction
between Temperature and Fabric:
1. In the field under Factors and covariates, select both Temperature and Fabric.
2. Verify that 2 is selected beside Interactions through order.
3. Click Add beside the field that has 2 selected.
4. Click OK in each dialog box.
Return to top
Example of a two-factor experiment with repeated measures on one factor
In this designed experiment each subject is measured after receiving, successively, all
levels of Factor B in combination with only one level of Factor A. For more information,
see page 1186, model 29.16 in Neter, Kutner, Nachtsheim, and Wwasserman (1996).
This designed experiment continues as follows:
Factor A Factor B Treatment Order 1 Treatment Order 2
A1
1
...
n
A1B1
...
A1B2
A1B2
...
A1B1
A2 n+1
...
2n
A2B2
...
A2B1
A2B1
...
A2B2
1. Create four columns in the Minitab worksheet: one column for the measurement, one column
identifying which subject corresponds to that measurement, one column for Factor A, and
one column for Factor B.
C1 C2 C3 C4
Subject Temperature Fabric Measurement
A High Old 1.1
A High New 2.2
B High New 1.9
B High Old 1.2
C1 C2 C3 C4
Subject Temperature Fabric Measurement
C Low Old 0.8
C Low New 1.1
D Low Old 0.9
D Low New 1.3
2. Choose Stat > ANOVA > General Linear Model > Fit General Linear Model.
3. In Responses, enter Measurement.
4. In Factors, enter Subject Temperature Fabric.
5. Click Random/Nest.
6. Under Nesting in specified factors, enter Temperature beside Subject.
7. Under Factor type, choose Random in the field beside Subject.
Note
If any factors besides Subject are random, choose Random for them too.
8. Click OK.
9. Click Model.
10. Use the dialog box to add interactions to the model. For example, to add the interaction
between Temperature and Fabric:
1. In the field under Factors and covariates, select both Temperature and Fabric.
2. Verify that 2 is selected beside Interactions through order.
3. Click Add beside the field that has 2 selected.
4. Click OK in each dialog box.
11. How variability can affect your ANOVA
12. The data sets in the following two individual value plots have exactly the same
factor level means. Therefore, the variability in the data because of the factor is
the same for both data sets. When you examine the plots, you might be tempted
to conclude that the means are different in both cases. Notice, however, that the
variability within factor levels is much greater in the second data set than in the
first.
13. To assess the differences between means, you must compare these differences
with the spread of the observations about the means. This is exactly what an
analysis of variance does. Using analysis of variance, the p-value corresponding
to the first plot is 0.000, whereas the p-value corresponding to the second plot is
0.109.
14. Therefore, using an of 0.05, the test says the means in the first data set are
significantly different. The differences in the sample means for the second data
set, however, could very well be a random result of the large overall variability in
the data.
15.
16. Plot with low variability
17.
18. Plot with high variability
Perform a two-way ANOVA
To perform a two-way ANOVA in MInitab, use Stat > ANOVA > General Linear Model >
Fit General Linear Model. Suppose your response is called A and your factors are B
and C.
1. Choose Stat > ANOVA > General Linear Model > Fit General Linear Model.
2. In Responses, enter A.
3. In Factors, enter B C.
4. Click Model.
5. In Models and covariates, select both B and C. To the right of Interactions through order,
choose 2 and click Add.
6. Click OK in each dialog box.
What is a main effects plot?
Use a main effects plot to examine differences between level means for one or more
factors. There is a main effect when different levels of a factor affect the response
differently. A main effects plot graphs the response mean for each factor level
connected by a line.
When you choose Stat > ANOVA > Main Effects Plot Minitab creates a plot that uses
data means. After you have fit a model, you can use the stored model to generate plots
that use fitted means.
Example
For example, fertilizer company B is comparing the plant growth rate measured in plants
treated with their product compared to plants treated by company A's fertilizer. They
tested the two fertilizers in two locations. The following are the main effects plots of
these two factors.

Fertilizer seems to affect the plant growth rate because the line is not horizontal.
Fertilizer B has a higher plant growth rate mean than fertilizer A. Location also affects
the plant growth rate. Location 1 had a higher plant growth rate mean than location 2.
The reference line represents the overall mean.
General patterns to look for:
When the line is horizontal (parallel to the x-axis), then there is no main effect. Each level of
the factor affects the response in the same way, and the response mean is the same across
all factor levels.
When the line is not horizontal, then there is a main effect. Different levels of the factor affect
the response differently. The steeper the slope of the line, the greater the magnitude of the
main effect.
Main effects plots will not show interactions. To view interactions between factors, use
an interaction plot.
Important
To determine whether a pattern is statistically significant, you must do an appropriate
test.
What is sum of squares?
The sum of squares represents a measure of variation or deviation from the mean. It is
calculated as a summation of the squares of the differences from the mean. The
calculation of the total sum of squares considers both the sum of squares from the
factors and from randomness or error.
Return to top
Sum of squares in ANOVA
In analysis of variance (ANOVA), the total sum of squares helps express the total
variation that can be attributed to various factors. For example, you do an experiment to
test the effectiveness of three laundry detergents.
The total sum of squares = treatment sum of squares (SST) + sum of squares of the
residual error (SSE)
The treatment sum of squares is the variation attributed to, or in this case between, the
laundry detergents. The sum of squares of the residual error is the variation attributed to
the error.
Converting the sum of squares into mean squares by dividing by the degrees of
freedom lets you compare these ratios and determine whether there is a significant
difference due to detergent. The larger this ratio is, the more the treatments affect the
outcome.
Return to top
Sum of squares in regression
In regression, the total sum of squares helps express the total variation of the y's. For
example, you collect data to determine a model explaining overall sales as a function of
your advertising budget.
The total sum of squares = regression sum of squares (SSR) + sum of squares of the
residual error (SSE)

The regression sum of squares is the variation attributed to the relationship between the
x's and y's, or in this case between the advertising budget and your sales. The sum of
squares of the residual error is the variation attributed to the error.
By comparing the regression sum of squares to the total sum of squares, you determine
the proportion of the total variation that is explained by the regression model (R
2
, the
coefficient of determination). The larger this value is, the better the relationship
explaining sales as a function of advertising budget.
Return to top
Comparison of sequential sums of squares and adjusted sums of squares
Minitab breaks down the SS Regression or Treatments component of variance into
sums of squares for each factor.
Sequential sums of squares
Sequential sums of squares depend on the order the factors are entered into the model. It is
the unique portion of SS Regression explained by a factor, given any previously entered
factors.
For example, if you have a model with three factors, X1, X2, and X3, the sequential sums of
squares for X2 shows how much of the remaining variation X2 explains, given that X1 is
already in the model. To obtain a different sequence of factors, repeat the regression
procedure entering the factors in a different order.
Adjusted sums of squares
Adjusted sums of squares does not depend on the order the factors are entered into the
model. It is the unique portion of SS Regression explained by a factor, given all other factors
in the model, regardless of the order they were entered into the model.
For example, if you have a model with three factors, X1, X2, and X3, the adjusted sum of
squares for X2 shows how much of the remaining variation X2 explains, given that X1 and
X3 are also in the model.
When will the sequential and adjusted sums of squares be the same?
The sequential and adjusted sums of squares are always the same for the last term
in the model. For example, if your model contains the terms A, B, and C (in that
order), then both sums of squares for C represent the reduction in the sum of
squares of the residual error that occurs when C is added to a model containing
both A and B.
The sequential and adjusted sums of squares will be the same for all terms if the
design matrix is orthogonal. The most common case where this occurs is with
factorial and fractional factorial designs (with no covariates) when analyzed in
coded units. In these designs, the columns in the design matrix for all main effects
and interactions are orthogonal to each other. Plackett-Burman designs have
orthogonal columns for main effects (usually the only terms in the model) but
interactions terms, if any, may be partially confounded with other terms (that is, not
orthogonal). In response surface designs, the columns for squared terms are not
orthogonal to each other.
For any design, if the design matrix is in uncoded units then there may be columns
that are not orthogonal unless the factor levels are still centered at zero.
Can the adjusted sums of squares be less than, equal to, or greater than the
sequential sums of squares?
The adjusted sums of squares can be less than, equal to, or greater than the
sequential sums of squares.
Suppose you fit a model with terms A, B, C, and A*B. Let SS (A,B,C, A*B) be the
sum of squares when A, B, C, and A*B are in the model. Let SS (A, B, C) be the
sum of squares when A, B, and C are included in the model. Then, the adjusted
sum of squares for A*B, is:
SS(A, B, C, A*B) - SS(A, B, C)
However, with the same terms A, B, C, A*B in the model, the sequential sums of
squares for A*B depends on the order the terms are specified in the model.
Using similar notation, if the order is A, B, A*B, C, then the sequential sums of
squares for A*B is:
SS(A, B, A*B) - SS(A, B)
Depending on the data set and the order in which the terms are entered, all the
following cases are possible:
SS(A, B, C, A*B) - SS(A, B, C) < SS(A, B, A*B) - SS(A, B), or
SS(A, B, C, A*B) - SS(A, B, C) = SS(A, B, A*B) - SS(A, B), or
SS(A, B, C, A*B) - SS(A, B, C) > SS(A, B, A*B) - SS(A, B)
Return to top
What is uncorrected sum of squares?
Squares each value in the column, and calculates the sum of those squared
values. That is, if the column contains x1, x2, ... , xn, then sum of squares
calculates (x1
2
+ x2
2
+ ... + xn
2
). Unlike the corrected sum of squares, the uncorrected
sum of squares includes error. The data values are squared without first subtracting
the mean.
In Minitab, you can use descriptive statistics to display the uncorrected sum of
squares (choose Stat > Basic Statistics > Display Descriptive Statistics). You can
also use the sum of squares (SSQ) function in the Calculator to calculate the
uncorrected sum of squares for a column or row. For example, you are calculating
a formula manually and you want to obtain the sum of the squares for a set of
response (y) variables.
Choose Calc > Calculator and enter the expression: SSQ (C1)
Store the results in C2 to see the sum of the squares, uncorrected. The following
worksheet shows the results from using the calculator to calculate the sum of
squares of column y.
C1 C2
y Sum of Squares
2.40 41.5304
4.60
2.50
1.60
2.20
0.98
Note
Minitab omits missing values from the calculation of this function.
What are mean squares?
Mean squares represent an estimate of population variance. It is calculated by dividing
the corresponding sum of squares by the degrees of freedom.
Regression
In regression, mean squares are used to determine whether terms in the model are
significant.
The term mean square is obtained by dividing the term sum of squares by the degrees of
freedom.
The mean square of the error (MSE) is obtained by dividing the sum of squares of the
residual error by the degrees of freedom. The MSE is the variance (s
2
) around the fitted
regression line.
Dividing the MS (term) by the MSE gives F, which follows the F-distribution with
degrees of freedom for the term and degrees of freedom for error.
ANOVA
In ANOVA, mean squares are used to determine whether factors (treatments) are
significant.
The treatment mean square is obtained by dividing the treatment sum of squares by the
degrees of freedom. The treatment mean square represents the variation between the
sample means.
The mean square of the error (MSE) is obtained by dividing the sum of squares of the
residual error by the degrees of freedom. The MSE represents the variation within the
samples.
For example, you do an experiment to test the effectiveness of three laundry
detergents. You collect 20 observations for each detergent. The variation in means
between Detergent 1, Detergent 2, and Detergent 3 is represented by the treatment
mean square. The variation within the samples is represented by the mean square of
the error.
Return to top
What are adjusted mean squares?
Adjusted mean squares are calculated by dividing the adjusted sum of squares by the
degrees of freedom. The adjusted sum of squares does not depend on the order the
factors are entered into the model. It is the unique portion of SS Regression explained
by a factor, assuming all other factors in the model, regardless of the order they were
entered into the model.
For example, if you have a model with three factors, X1, X2, and X3, the adjusted sum
of squares for X2 shows how much of the remaining variation X2 explains, assuming
that X1 and X3 are also in the model.
Return to top
What are expected mean squares?
If you do not specify any factors to be random, Minitab assumes that they are fixed. In
this case, the denominator for F-statistics will be the MSE. However, for models which
include random terms, the MSE is not always the correct error term. You can examine
the expected means squares to determine the error term that was used in the F-test.
When you perform General Linear Model, Minitab displays a table of expected mean
squares, estimated variance components, and the error term (the denominator mean
squares) used in each F-test by default. The expected mean squares are the expected
values of these terms with the specified model. If there is no exact F-test for a term,
Minitab solves for the appropriate error term in order to construct an approximate F-test.
This test is called a synthesized test.
The estimates of variance components are the unbiased ANOVA estimates. They are
obtained by setting each calculated mean square equal to its expected mean square,
which gives a system of linear equations in the unknown variance components that is
then solved. Unfortunately, this approach can cause negative estimates, which should
be set to zero. Minitab, however, displays the negative estimates because they
sometimes indicate that the model being fit is inappropriate for the data. Variance
components are not estimated for fixed terms.
How the F-statistics in the ANOVA output are calculated
Each F-statistic is a ratio of mean squares. The numerator is the mean square for the
term. The denominator is chosen such that the expected value of the numerator mean
square differs from the expected value of the denominator mean square only by the
effect of interest. The effect for a random term is represented by the variance
component of the term. The effect for a fixed term is represented by the sum of squares
of the model components associated with that term divided by its degrees of freedom.
Therefore, a high F-statistic indicates a significant effect.
When all the terms in the model are fixed, the denominator for each F-statistic is the
mean square of the error (MSE). However, for models that include random terms, the
MSE is not always the correct mean square. The expected mean squares (EMS) can be
used to determine which is appropriate for the denominator.
Example
Suppose you performed an ANOVA with the fixed factor Screen and the random factor
Tech, and get the following output for the EMS:
Source Expected Mean Square for Each Term
(1) Screen (4) + 2.0000(3) + Q[1]
(2) Tech (4) + 2.0000(3) + 4.0000(2)
(3) Screen*Tech (4) + 2.0000(3)
(4) Error (4)
A number with parentheses indicates a random effect associated with the term listed
beside the source number. (2) represents the random effect of Tech, (3) represents the
random effect of the Screen*Tech interaction, and (4) represents the random effect of
Error. The EMS for Error is the effect of the error term. In addition, the EMS for
Screen*Tech is the effect of the error term plus two times the effect of the Screen*Tech
interaction.
To calculate the F-statistic for Screen*Tech, the mean square for Screen*Tech is
divided by the mean square of the error so that the expected value of the numerator
(EMS for Screen*Tech = (4) + 2.0000(3)) differs from the expected value of the
denominator (EMS for Error = (4)) only by the effect of the interaction (2.0000(3)).
Therefore, a high F-statistic indicates a significant Screen*Tech interaction.
A number with Q[ ] indicates the fixed effect associated with the term listed beside the
source number. For example, Q[1] is the fixed effect of Screen. The EMS for Screen is
the effect of the error term plus two times the effect of the Screen*Tech interaction plus
a constant times the effect of Screen. Q[1] equals (b*n * sum((coefficients for levels of
Screen)**2)) divided by (a - 1), where a and b are the number of levels of Screen and
Tech, respectively, and n is the number of replicates.
To calculate the F-statistic for Screen, the mean square for Screen is divided by the
mean square for Screen*Tech so that the expected value of the numerator (EMS for
Screen = (4) + 2.0000(3) + Q[1] ) differs from the expected value of the denominator
(EMS for Screen*Tech = (4) + 2.0000(3) ) only by the effect due to the Screen (Q[1]).
Therefore, a high F-statistic indicates a significant Screen effect.
Return to top
Why does my ANOVA output include an "x" beside a p-value in the
ANOVA table and the label "Not an exact F-test"?
An exact F-test for a term is one in which the expected value of the numerator mean
squares differs from the expected value of the denominator mean squares only by the
variance component or the fixed factor of interest.
Sometimes, however, such a mean square cannot be calculated. In this case, Minitab
uses a mean square that results in an approximate F-test and displays an "x" beside the
p-value to identify that the F-test is not exact.
For example, suppose you performed an ANOVA with the fixed factor Supplement and
the random factor Lake, and the got following output for the expected mean squares
(EMS):
Source Expected Mean Square for Each Term
(1) Supplement (4) + 1.7500(3) + Q[1]
Source Expected Mean Square for Each Term
(2) Lake (4) + 1.7143(3) + 5.1429(2)
(3) Supplement*Lake (4) + 1.7500(3)
(4) Error (4)
The F-statistic for Supplement is the mean square for Supplement divided by the mean
square for the Supplement*Lake interaction. If the effect for Supplement is very small,
the expected value of the numerator equals the expected value of the denominator. This
is an example of an exact F-test.
Notice, however, that for a very small Lake effect, there are no mean squares such that
the expected value of the numerator equals the expected value of the denominator.
Therefore, Minitab uses an approximate F-test. In this example, the mean square for
Lake is divided by the mean square for the Supplement*Lake interaction. This results in
an expected value of the numerator being approximately equal to that of the
denominator if the Lake effect is very small.
Return to top
About the "Denominator of F-test is zero or undefined" message
Minitab will display an error that the denominator of the F-test is zero or undefined for
one of the following reasons:
There is not at least one degree of freedom for error.
The adjusted MS values are very small, and thus there is not enough precision to display the
F and p-values. As a workaround, multiply the response column by 10. Then perform the
same regression model, but instead use this new response column for the response.
Note
Multiplying the response values by 10 will not affect the F and p-values that Minitab displays
the output. However, decimal position will be affected in the remaining output, specifically,
the sequential sums of squares, Adj SS, Adj MS, Fit, standard error of the fits, and residual
columns.
What is a variance component?
Use variance components to assess sources of variation.
Variance components assess the amount of variation in the response because of
random factors. Random factors have levels that are selected at random; whereas fixed
factors have levels that are the only levels of interest. For example, you do a study on
the effect of two levels of pressure on output measured by randomly chosen operators.
Pressure is fixed (2 levels); and operator is random. The variance components output
lists the estimated variance for the operator and error term.
Interpret a negative variance component
The following are possible ways to deal with negative estimates of variance
components:
Accept the estimate as evidence of a true value of zero and use zero as the estimate,
recognizing that the estimator will no longer be unbiased.
Retain the negative estimate, recognizing that subsequent calculations using the results
might not make much sense.
Interpret that the negative component estimate indicates an incorrect statistical model.
Employ a method different from ANOVA for estimating the variance components.
Collect more data and analyze them separately or in conjunction with the existing data and
hope that increased information will yield positive estimates.
Comparison of data means and fitted means
Data means are the raw response variable means for each factor level
combination whereas fitted means use least squares to predict the mean
response values of a balanced design. Therefore, the two types of means are
identical for balanced designs but can be different for unbalanced designs.
Fitted means are useful for assessing response differences due to changes in
factor levels rather than differences due to the unbalanced experimental
conditions. While you can use raw data with unbalanced designs to obtain a
general idea of which main effects may be evident, it is generally good practice to
use the fitted means to obtain more precise results.
Example of data means and fitted means
For example, you are investigating how time and temperature affect the yield of a
chemical reaction. The two factors each have two levels producing four
experimental conditions. This is an exaggerated unbalanced experiment to
emphasize the difference between the two types of means. All experimental
conditions are measured twice except for the time and temperature combination
of 50 and 200 which is measured four times. The following tables summarize the
designed experiment and results.
Number of Observations per Experimental Condition
Temp 150 Temp 200
Temp 150 Temp 200
Time 20 2 2
Time 50 2 4
Means by Factor Level
Data Means Fitted Means
Time 20 44.01 44.03
Time 50 47.63 47.02
Temp 150 44.13 44.14
Temp 200 47.55 46.90
The "Time 20" and "Temp 150" data means and fitted means are virtually
identical because all experimental conditions involving either one or both of these
factor levels are measured exactly twice (top table). However, the combination
"Time 50" and "Temp 200" is measured four times which over represents their
effects in the raw data means. The fitted means adjust for this and predict what a
balanced design would yield.
What is the overall mean (also called grand mean)?
The overall mean is the mean of all observations, as opposed to the mean of
individual groups. In ANOVA, for example, this is the mean of all observations
across factor levels, as opposed to the means of individual levels.

Here is a main effects plot comparing absenteeism at each of four different local
schools. Each dot on the plot identifies the mean absenteeism of students from a
different school. The line drawn across the plot (called the center line) marks the
overall mean: the mean absenteeism of students from all schools.
What is the variance-covariance matrix?
A variance-covariance matrix is a square matrix that contains the variances and
covariances associated with several variables. The diagonal elements of the
matrix contain the variances of the variables and the off-diagonal elements
contain the covariances between all possible pairs of variables.
For example, you create a variance-covariance matrix for three variables X, Y, and Z. In
the following table, the variances are displayed in bold along the diagonal; the variance
of X, Y, and Z are 2.0, 3.4, and 0.82 respectively. The covariance between X and Y is -
0.86.

X Y Z
X 2.0 -0.86 -0.15
Y -0.86 3.4 0.48
Z -0.15 0.48 0.82
The variance-covariance matrix is symmetric because the covariance between X
and Y is the same as the covariance between Y and X. Therefore, the covariance
for each pair of variables is displayed twice in the matrix: the covariance between
the ith and jth variables is displayed at positions (i, j) and (j, i).
Many statistical applications calculate the variance-covariance matrix for the estimators
of parameters in a statistical model. It is often used to calculate standard errors of
estimators or functions of estimators. For example, logistic regression creates this matrix
for the estimated coefficients, letting you view the variances of coefficients and the
covariances between all possible pairs of coefficients.
Note
For most statistical analyses, if a missing value exists in any column, Minitab
ignores the entire row when it calculates the correlation or covariance matrix.
However, when you calculate the covariance matrix by itself, Minitab does not
ignore entire rows in its calculations when there are missing values. To obtain
only the covariance matrix, choose Stat > Basic Statistics > Covariance
What is the Hotelling's t-squared test?
Compares two groups in a special case of MANOVA, using one factor that has two
levels. The usual T-squared test statistic can be calculated from Minitab's output
using the relationship T-squared = (N - 2)U, where N is the total number of
observations in the data set and U is the Lawley-Hotelling trace.
Calculate Levene's test in Minitab
The modified Levene's test uses the absolute deviation of the observations in each
treatment from the treatment median. It then assesses whether or not the mean of these
deviations are equal for all treatments. If the mean deviations are equal, the variances
of the observations in all treatments will be the same. The test statistic for Levene's test
is the ANOVA F-statistic for testing equality of means applied to the absolute deviations.
You can do this in Minitab by making a new column where each value is the absolute
value of the response minus the median of that treatment. Then perform a One-Way
ANOVA using the new column as the response. The F-statistic and p-value will be the
test statistic and p-value for Levene's test.
For example, suppose the responses are in C1 and the treatments are in C2, and C3-
C6 are empty.
C1 C2
Responses Treatments
10 1
8 1
6 1
4 1
3 1
16 2
14 2
10 2
6 2
2 2
To perform Levene's test in Minitab for this data set:
1. Choose Stat > Basic Statistics > 2 Variances.
2. Click Both samples are in one column.
3. In Samples, enter C1.
4. In Sample IDs, enter C2. Click OK.

Tests

Test
Method DF1 DF2 Statistic P-Value
Bonett 1 2.14 0.143
Levene 1 8 2.20 0.176
You can verify these calculations using One-Way ANOVA:
1. Choose Stat > Basic Statistics > Store Descriptive Statistics.
2. In Variables, enter C1.
3. In By variables (optional), enter C2.
4. Click Statistics.
5. Deselect all fields except Median.
6. Click OK in each dialog box.
7. Enter the treatment medians in C5.
C1 C2 C3 C4 C5
Responses Treatments ByVar1 Median1 Treatment Medians
10 1 1 6 6
8 1 2 10 6
6 1 6
4 1 6
3 1 6
16 2 10
14 2 10
10 2 10
6 2 10
2 2 10
8. Choose Calc > Calculator.
9. In Store result in variable, enter C6.
10. In Expression, enter ABSO(C1-C5). Click OK.
11. Choose Stat > ANOVA > One-Way.
12. Choose Response data are in one column for all factor levels.
13. In Response, enter C6.
14. In Factor, enter C2. Click OK.

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Treatments 1 12.10 12.100 2.20 0.176
Error 8 44.00 5.500
Total 9 56.10
When you examine the output you see that the F-statistic and p-value in the One-way
ANOVA table are identical with the test statistic and p-value for Levenes test.
Scenario for examples
Suppose the design has 2 factors (Factor1 and Factor2). Factor1 has two levels (a and
b) and Factor2 has three levels (x, y, and z). The data for Factor1 are in C1, Factor2 are
in C2, and the responses are in C3. You perform General Linear Model with Factor1,
Factor2, and the 2-way interaction Factor1*Factor2 in the model.
Return to top
Example of letting Minitab calculate the fitted values and store them in the
worksheet
This option allows to you determine the fitted values using the values in the worksheet.
1. Choose Stat > ANOVA > General Linear Model > Fit General Linear Model.
2. In Responses, enter C3. In Factors, enter Factor1 Factor2.
3. Click Model. In the field under Factors and covariates: select both 'Factor 1' and 'Factor 2'.
Verify that 2 is selected in the field beside Interactions through order.
4. Click Add, then click OK.
5. Click Storage. Check Fits.
6. Click OK in each dialog box.
The fitted values are stored in the next available blank column in the worksheet, named
FITS1.
Return to top
Example of entering coded values into the equation
Suppose you obtain the following coefficients in the output:

Term Coef SE Coef T P
Constant 8.0000 0.5528 14.47 0.000
Factor1
a -0.6667 0.5528 -1.21 0.273
Factor2
x 5.0000 0.7817 6.40 0.001
y -2.0000 0.7817 -2.56 0.043
Factor1*Factor2
a x -2.8333 0.7817 -3.62 0.011
a y 1.6667 0.7817 2.13 0.077

1. Using the coefficients from the table above, you can obtain the following regression equation.
The equation is:
Using the default coding that Minitab uses:
o If Factor1 is a, use a = 1
o If Factor1 is b, use a = 1
o If Factor2 is x, use x = 1 and y = 0
o If Factor2 is y, use x = 0 and y = 1
o If Factor2 is z, use x = 1 and y = 1
2. Put the factor levels in to the equation.
Suppose the 9th row in the data set has Factor1 = b and Factor2 = z. The fitted value is:
= 8.0000 + - 0.6667*-1 + 5.0000*-1 - 2.0000*-1 - 2.8333*-1*-1 + 1.6667*-1*-1
= 8.0000 + 0.6667 - 5.0000 + 2.0000 - 2.8333 + 1.6667
= 4.5
If you choose to store the fits as described in Option 1, you will see 4.5 in row 9 (with Factor1
= b and Factor2 = z) of the FITS1 column.
Return to top
How to display all the coefficients
You can have Minitab display the coefficients that are not displayed by default.
1. Choose Stat > ANOVA > General Linear Model > Fit General Linear Model.
2. Enter the response columns and factor columns.
3. Click Results and beside Coefficients select Full set of coefficients.
4. Click OK in each dialog box.
To calculate least squares means when you have a single covariate do the following:
1. Open Least Squares Data.MTW.
2. Choose Stat > ANOVA > General Linear Model > Fit General Linear Model.
3. In Responses, enter Size.
4. In Factors, enter PlasticMix and Subject.
5. In Covariates, enter Temperature.
6. Click Options, and beside Means select Main effects.
7. Click OK in each dialog box.
You should obtain the following results:
General Linear Model: Size versus Temperature, PlasticMix, Subject


Factor Information

Factor Type Levels Values
PlasticMix Fixed 3 1, 2, 3
Subject Fixed 6 1, 2, 3, 4, 5, 6


Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Temperature 1 158.30 158.304 17.83 0.000
PlasticMix 2 55.58 27.792 3.13 0.057
Subject 5 39.99 7.997 0.90 0.492
Error 34 301.91 8.880
Lack-of-Fit 32 242.97 7.593 0.26 0.969
Pure Error 2 58.95 29.473
Total 42 618.46


Model Summary

S R-sq R-sq(adj) R-sq(pred)
2.97990 51.18% 39.70% 23.15%


Coefficients

Term Coef SE Coef T-Value P-Value VIF
Constant -8.43 4.61 -1.83 0.076
Temperature 0.1877 0.0445 4.22 0.000 1.32
PlasticMix
1 -1.023 0.731 -1.40 0.171 1.61
2 2.031 0.813 2.50 0.017 2.31
Subject
1 -2.81 1.51 -1.86 0.072 3.75
2 -0.97 1.31 -0.74 0.466 3.09
3 2.22 1.43 1.55 0.131 3.34
4 1.35 1.35 1.00 0.323 3.50
5 0.207 0.913 0.23 0.822 2.26


Regression Equation

PlasticMix Subject
1 1 Size = -12.27 + 0.1877 Temperature

1 2 Size = -10.42 + 0.1877 Temperature

1 3 Size = -7.24 + 0.1877 Temperature

1 4 Size = -8.10 + 0.1877 Temperature

1 5 Size = -9.24 + 0.1877 Temperature

1 6 Size = -9.44 + 0.1877 Temperature

2 1 Size = -9.21 + 0.1877 Temperature

2 2 Size = -7.37 + 0.1877 Temperature

2 3 Size = -4.18 + 0.1877 Temperature

2 4 Size = -5.05 + 0.1877 Temperature

2 5 Size = -6.19 + 0.1877 Temperature

2 6 Size = -6.39 + 0.1877 Temperature

3 1 Size = -12.25 + 0.1877 Temperature

3 2 Size = -10.41 + 0.1877 Temperature

3 3 Size = -7.22 + 0.1877 Temperature

3 4 Size = -8.09 + 0.1877 Temperature

3 5 Size = -9.23 + 0.1877 Temperature

3 6 Size = -9.43 + 0.1877 Temperature


Fits and Diagnostics for Unusual Observations

Obs Size Fit Resid Std Resid
8 7.10 15.20 -8.10 -3.00 R
13 9.20 14.36 -5.16 -2.07 R
31 13.00 7.47 5.53 2.03 R

R Large residual


Means

Fitted
Term Mean SE Mean
PlasticMix
1 9.766 0.972
2 12.820 0.884
3 9.780 0.867
Subject
1 7.97 1.71
2 9.82 1.47
3 13.00 1.57
4 12.14 1.42
5 10.996 0.893
6 10.801 0.860


Covariate Data Mean StDev
Temperature 102.4 11.9

8. To get the Least Squares mean corresponding to PlasticMix = 1, you need to calculate 6
fitted values:
1. PlasticMix = 1, Subject = 1
Size = -12.27 + .1877 * 102.4 = 6.95048
2. PlasticMix = 1, Subject = 2
Size = -10.42 + .1877 * 102.4 = 8.80048
3. PlasticMix = 1, Subject = 3
Size = -7.24 + .1877 * 102.4 = 11.98048
4. PlasticMix = 1, Subject = 4
Size = -8.10 + .1877 * 102.4 = 11.12048
5. PlasticMix = 1, Subject = 5
Size = -9.24 + .1877 * 102.4 = 9.98048
6. PlasticMix = 1, Subject = 6
Size = -9.44 + .1877 * 102.4 = 9.78048
9. Now you need to average the 6 fitted values to obtain 9.766, the Least Squares Mean
corresponding to PlasticMix = 1. You can do the same thing for PlasticMix = 2 and PlasticMix
= 3.
Calculate plotted points on an analysis of means
interaction effects plot for the normal case
1. Choose Stat > ANOVA > General Linear Model > Fit General Linear Model.
2. In Responses, enter the column that contains the response variable.
3. In Factors, enter the two columns that contain the factors.
4. Click Storage and check Residuals. Click OK in each dialog box.
5. Choose Stat > Basic Statistics > Store Descriptive Statistics.
6. In Variables, enter the residuals column (RESI1 by default).
7. In By variables (optional), enter the two columns that contain the factors.
8. Click Statistics and check Mean.
9. Click OK in each dialog box.
The values of the plotted points will be stored in the mean column (Mean1 by default).
What are multiple comparisons?
Multiple comparisons let you assess the statistical significance of differences between
means using a set of confidence intervals, a set of hypothesis tests or both. As usual,
the null hypothesis of no difference between means is rejected if and only if zero is not
contained in the confidence interval.
Return to top
Are the individual error rates and family error rates exact?
Individual error rates are exact in all cases. Family error rates are exact for equal group
sizes. If group sizes are unequal, the true family error rate for Tukey, Fisher, and MCB
will be slightly smaller than stated, resulting in conservative confidence intervals. The
Dunnett family error rates are exact for unequal sample sizes.
Return to top
Perform multiple comparisons using One-Way ANOVA
1. Choose Stat > ANOVA > One-Way.
2. Complete the dialog with the appropriate settings for your situation.
3. Click Comparisons.
4. Select the comparisons you want Minitab to display.
5. Click OK in each dialog box.
Return to top
Perform multiple comparisons using a general linear model
1. Choose Stat > ANOVA > General Linear Model > Fit General Linear Model.
2. Complete the dialog with the appropriate settings for your situation and click OK.
3. Choose Stat > ANOVA > General Linear Model > Comparisons.
4. Under Method, select the comparisons you want Minitab to display.
5. Under Choose terms for comparisons, double click the factors that you want Minitab to make
multiple comparisons for.
6. Click OK.
Return to top
Which multiple comparison method should I use?
The selection of the appropriate multiple comparison method depends on the inference
that you want. It is inefficient to use the Tukey all-pairwise approach when Dunnett or
MCB is suitable, because the Tukey confidence intervals will be wider and the
hypothesis tests less powerful for a particular family error rate. For the same reasons,
MCB is superior to Dunnett if you want to eliminate factor levels that are not the best
and to identify those that are best or close to the best. The choice of Tukey versus
Fisher's LSD methods depends on which error rate, family or individual, you want to
specify.
The characteristics and advantages of each method are summarized in the following
table:
Method
Normal
Data
Strength
Comparison
with a Control
Pairwise
Comparison
Tukey Yes Most powerful test when
doing all pairwise
comparisons.
No Yes
Dunnett Yes Most powerful test when
comparing to a control.
Yes No
Bonferroni's - Robust procedure, but
produces larger
Yes Yes
Method
Normal
Data
Strength
Comparison
with a Control
Pairwise
Comparison
inequality confidence intervals.
Usually conservative.
Sidk's
inequality
Yes Slightly better than
Bonferronis procedure.
Usually conservative.
Yes Yes
Hsu's MCB
method
Yes The most powerful test
when you compare the
group with the highest or
lowest mean to the other
groups.
No Yes
Games-
Howell
Yes Used when you do not
assume equal variances.
No Yes
Note
One-Way ANOVA also offers Fishers LSD method for individual confidence intervals.
Fisher's is not a multiple comparison method, but instead contrasts the individual
confidence intervals for the pairwise differences between means using an individual
error rate. Fisher's LSD method inflates the family error rate, which is displayed in the
output.
Return to top
Which means should I compare?
It is important to consider which means to compare when using multiple comparisons; a
bad choice can result in confidence levels that are not what you think. Issues that
should be considered when making this choice might include:
1. How deep into the design should you compare means-only within each factor, within each
combination of first-level interactions, or across combinations of higher level interactions?
2. Should you compare the means for only those terms with a significant F-test or for those sets
of means for which differences seem to be large?
How deep within the design should you compare means? There is a trade-off: if you
compare means at all two-factor combinations and higher orders turn out to be
significant, then the means that you compare might be a combination of effects; if you
compare means at too deep a level, you lose power because the sample sizes become
smaller and the number of comparisons become larger. You might decide to compare
means for factor level combinations for which you believe the interactions are
meaningful.
Minitab restricts the terms that you can compare means for to fixed terms or interactions
between fixed terms. Nesting is considered to be a form of interaction.
Usually, you should decide which means you will compare before you collect your data.
If you compare only those means with differences that seem to be large, which is called
data snooping, then you are increasing the likelihood that the results indicate a real
difference where no difference exists. If you condition the application of multiple
comparisons on achieving a significant F-test, then you increase the probability that
differences exist among the groups but you do not detect them. Because the multiple
comparison methods already protect against the detection of a difference that does not
exist, you do not need the F-test to guard against this probability.
However, many people commonly use F-tests to guide the choice of which means to
compare. The ANOVA F-tests and multiple comparisons are not entirely separate
assessments. For example, if the p-value of an F-test is 0.9, you probably will not
discover statistically significant differences between means by multiple comparisons.
Return to top
What if the p-value from the ANOVA table conflicts with the multiple
comparisons output?
The p-value in the ANOVA table and the multiple comparison results are based on
different methodologies and can occasionally produce contradictory results. For
example, it is possible that the ANOVA p-value can indicate that there are no
differences between the means while the multiple comparisons output indicates that
some means that are different. In this case, you can generally trust the multiple
comparisons output.
You do not need to rely on a significant p-value in the ANOVA table to reduce the
chance of detecting a difference that doesn't exist. This protection is already
incorporated in the Tukey, Dunnett, and MCB tests (and Fisher's test when the means
are equal).
What are individual and family error rates?
The type I error rates associated with the multiple comparisons are often used to
identify significant differences between specific factor levels in an ANOVA.
What is the individual error rate?
The individual error rate is the maximum probability that one or more comparisons will
incorrectly conclude that the observed difference is significantly different from the null
hypothesis.
What is the family error rate?
The family error rate is the maximum probability that a procedure consisting of more
than one comparison will incorrectly conclude that at least one of the observed
differences is significantly different from the null hypothesis. The family error rate is
based on both the individual error rate and the number of comparisons. For a single
comparison, the family error rate is equal to the individual error rate which is the alpha
value. However, each additional comparison causes the family error rate to increase in
a cumulative manner.
It is important to consider the family error rate when making multiple comparisons
because your chances of committing a type I error for a series of comparisons is greater
than the error rate for any one comparison alone.
Example of setting the individual error rate and family error rate
You do a one-way ANOVA to examine steel strength from five different steel plants
using 25 samples from each plant.
You decide to examine all 10 comparisons between the five plants to determine
specifically which means are different. If you assign an alpha of 0.05 to each of the 10
comparisons (the individual error rate), Minitab calculates a family error rate of 0.28 for
the set of 10 comparisons. However, if you want the entire set of comparisons to have a
family error rate of 0.05, then Minitab automatically assigns each individual comparison
an alpha of 0.007.
Tukey's method, Fisher's least significant difference (LSD), Hsu's multiple comparisons
with the best (MCB), and Bonferroni confidence intervals are methods for calculating
and controlling the individual and family error rates for multiple comparisons.
Using confidence levels to identify significant
differences between factor levels in multiple
comparisons
The confidence levels associated with the confidence intervals often used in multiple
comparisons to identify significant differences between specific factor levels in an
ANOVA. These confidence levels are analogous to the individual and family error rates
but applied to confidence intervals.
What is an individual confidence level?
The percentage of confidence intervals that will include the true population parameter or
true difference between factor levels if the study were repeated multiple times.
What is a simultaneous confidence level?
The percentage of times that a group of confidence intervals will all include the true
population parameters or true differences between factor levels if the study were
repeated multiple times. The simultaneous confidence level is based on both the
individual confidence level and the number of confidence intervals. For a single
comparison, the simultaneous confidence level is equal to the individual confidence
level. However, each additional confidence interval causes the simultaneous confidence
level to decrease in a cumulative way.
It is important to consider the simultaneous confidence level when you examine multiple
confidence intervals because your chances of excluding a parameter or true difference
between factor levels for a family of confidence intervals is greater than for any one
confidence interval.
Tukey's method, Fisher's least significant difference (LSD), Hsu's multiple comparisons
with the best (MCB), and Bonferroni confidence intervals are methods for calculating
and controlling the individual and simultaneous confidence levels.
Example of individual confidence intervals and simultaneous confidence
intervals
For example, you measure the response times for memory chips. You take a sample of
25 chips from five different manufacturers.
You decide to examine all 10 comparisons between the five plants to determine
specifically which means are different. Using Tukey's method, you specify that the entire
set of confidence intervals should have a 95% simultaneous confidence level. Minitab
calculates that the 10 individual confidence levels need to be 99.35% to obtain the 95%
simultaneous confidence level. These wider Tukey confidence intervals provide less
precise estimates of the population parameters but limit the probability that one or more
of the confidence intervals does not contain the true difference to a maximum of 5%.
Understanding this context, you can then examine the confidence intervals to determine
whether any do not include zero, which would denote a significant difference.

Confidence intervals with 95% individual confidence levels

Confidence intervals with 99.35% individual confidence levels to obtain a 95% simultaneous confidence level using Tukey's
Comparison of 95% confidence intervals to the wider 99.35% confidence intervals used
by Tukey's in the previous example. The reference line at 0 shows how the wider Tukey
confidence intervals can change your conclusions. Confidence intervals that contain
zero denote no difference. (Only 5 of the 10 comparisons are shown because of space
considerations.)
What is Tukey's method for multiple comparisons?
Tukey's method is used in ANOVA to create confidence intervals for all pairwise
differences between factor level means while controlling the family error rate to a level
you specify. It is important to consider the family error rate when making multiple
comparisons because your chances of making a type I error for a series of comparisons
is greater than the error rate for any one comparison alone. To counter this higher error
rate, Tukey's method adjusts the confidence level for each individual interval so that the
resulting simultaneous confidence level is equal to the value you specify.
Example of Tukey confidence intervals
You are measuring the response times for memory chips. You sampled 25 chips from
five different manufacturers.
You decide to examine all 10 comparisons between the five plants to determine
specifically which means are different. Using Tukey's method, you specify that the entire
set of comparisons should have a family error rate of 0.05 (equivalent to a 95%
simultaneous confidence level). Minitab calculates that the 10 individual confidence
levels need to be 99.35% in order to obtain the 95% joint confidence level. These wider
Tukey confidence intervals provide less precise estimates of the population parameter
but limit the probability that one or more of the confidence intervals does not contain the
true difference to a maximum of 5%. Understanding this context, you can then examine
the confidence intervals to determine whether any do not include zero, indicating a
significant difference.

Confidence intervals with 95% individual confidence levels

Confidence intervals with 99.35% individual confidence levels to obtain a 95% joint confidence level using Tukey's
Comparison of 95% confidence intervals to the wider 99.35% confidence intervals used
by Tukey's in the previous example. The reference line at 0 shows how the wider Tukey
confidence intervals can change your conclusions. Confidence intervals that contain
zero indicate no difference. (Only 5 of the 10 comparisons are shown due to space
considerations.)
What is Dunnett's method for multiple comparisons?
Dunnett's method is used in ANOVA to create confidence intervals for differences
between the mean of each factor level and the mean of a control group. If an interval
contains zero, then there is no significant difference between the two means under
comparison. You specify a family error rate for all comparisons, and Dunnett's method
determines the confidence levels for each individual comparison accordingly.
Example of Dunnett's method
You are studying three weight loss pills to determine whether they are significantly
different from a placebo. In a double-blind experiment, fifty people receive Pill A, fifty
people receive Pill B, fifty people receive Pill C, and fifty people receive a placebo. The
placebo group is the control group. You record the average weight loss of each group
and perform ANOVA with Dunnett's method to determine whether any of the three pills
produce weight loss that is significantly different than the placebo. Dunnett's method
produces three confidence intervals: one for the difference in mean weight loss between
group A and the placebo group, one for the difference in mean weight loss between
group B and the placebo group, and one for the difference in mean weight loss between
group C and the placebo group. You set the family error rate for all three comparisons
at 0.10, so the confidence level for all the comparisons is 90%.
The confidence interval for the difference between Pill A and the placebo contains zero;
therefore, you conclude that there is no difference between the weight loss in group A
and the placebo group. The confidence interval for the difference between Pill B and the
placebo contains only negative numbers; therefore, you conclude that subjects in group
B lost less weight than subjects in the placebo group. In other words, Pill B prevents
weight loss. Finally, the confidence interval for the difference between Pill C and the
placebo contains only positive numbers; therefore, you conclude that Pill C produces
significantly greater weight loss than a placebo. As a result of this study, you
recommend Pill C.
What is Fisher's least significant difference (LSD)
method for multiple comparisons?
Fisher's LSD method is used in ANOVA to create confidence intervals for all pairwise
differences between factor level means while controlling the individual error rate to a
significance level you specify. Fisher's LSD method then uses the individual error rate
and number of comparisons to calculate the simultaneous confidence level for all
confidence intervals. This simultaneous confidence level is the probability that all
confidence intervals contain the true difference. It is important to consider the family
error rate when making multiple comparisons because your chances of committing a
type I error for a series of comparisons is greater than the error rate for any one
comparison alone.
Example of Fisher's LSD method
For example, you are measuring the response times for memory chips. You take a
sample of 25 chips from five different manufacturers. The ANOVA resulted in a p-value
of 0.01, leading you to conclude that at least one of the manufacturer means is different
from the others.
You decide to examine all 10 comparisons between the five plants to determine
specifically which means are different. Using Fisher's LSD method, you specify that
each comparison should have an individual error rate of 0.05 (equivalent to a 95%
confidence level). Minitab creates these ten 95% confidence intervals and calculates
that this set yields a 71.79% simultaneous confidence level. Understanding this context,
you can then examine the confidence intervals to determine whether any do not include
zero, identifying a significant difference.
What is Hsu's multiple comparisons with the best
(MCB)?
Multiple comparison method that is designed to identify factor levels that are the best,
insignificantly different from the best, and those that are significantly different from the
best. You can define "best" as either the highest or lowest mean. This procedure is
usually used after an ANOVA to more precisely analyze differences between level
means.
Hsu's MCB method creates a confidence interval for the difference between each level
mean and the best of the remaining level means. If an interval has zero as an end point,
there is a statistically significant difference between the corresponding means.
Specifically:

Highest is best Lowest is best
Confidence interval contains zero No difference No difference
Confidence interval entirely above zero Significantly better Significantly worse

Highest is best Lowest is best
Confidence interval entirely below zero Significantly worse Significantly better
For this method, you specify the family error rate and the individual error rate is adjusted
to achieve it. Hsu's MCB method only compares a subset of all possible pairwise
comparisons, unlike Tukey's method which does all comparisons. Therefore, Hsu's
MCB method will generate tighter confidence intervals and more powerful tests for any
specified family error rate.
Example of Hsu's MCB method
For example, a memory chip manufacturer randomly samples four production lines to
determine which line produces the chips with the fastest response time. The mean
response time for each production line is in the following table.
Production line Mean response time N
1 4.85 20
2 10.05 20
3 7.45 20
4 1.20 20
The analyst defines "best" as being the lowest (fastest) mean response time, which is
line 4, and uses Hsu's MCB method to identify any production lines that are significantly
different from the best. This produces the following confidence intervals [tested level -
best of remaining levels].
Production line (compared to best) Lower limit Center Upper limit
1 -1.2 3.65 8.5
2 0 8.85 13.3
3 0 6.25 10.2
4 -8.5 -3.65 1.2
Based on the confidence intervals, the analyst concludes that lines 2 and 3 are
producing chips that are significantly slower (higher mean) than line 4 because their
confidence intervals are entirely above zero. However, there is no evidence to indicate a
significant difference between lines 1 and 4 because their confidence intervals contain
zero (no difference). You might investigate the processes of lines 2 and 3 more
carefully.
Note
When the tested level is significantly better or worse than the comparison level, Hsu's
MCB method does not provide a minimum bound on how much better/worse.
What is the Bonferroni method?
Method for controlling the simultaneous confidence level for an entire set of confidence
intervals. It is important to consider the simultaneous confidence level when you
examine multiple confidence intervals because your chances that at least one of the
confidence intervals does not contain the population parameter is greater for a set of
intervals than for any single interval. To counter this higher error rate, the Bonferroni
method adjusts the confidence level for each individual interval so that the resulting
simultaneous confidence level is equal to the value you specify.
Example of Bonferroni confidence intervals
You want to examine the confidence intervals for delivery time in days from five shipping
centers. You generate the two sets of five confidence intervals using the same data.

Unadjusted 95% Confidence Intervals for Delivery Times by Shipping Center

Bonferroni 95% Confidence Intervals for Delivery Times by Shipping Center (99% Individual Confidence Intervals)
These graphs compare regular 95% confidence intervals to the Bonferroni 95%
confidence intervals. The wider Bonferroni confidence intervals provide less precise
estimates of the population parameter but limits the probability that one or more of the
confidence intervals does not contain the parameter to a maximum of 5%. In
comparison, the family error rate associated with the five regular 95% confidence
intervals is 25%.
This conservative method ensures that the overall confidence level is at least 1 - . To
obtain an overall confidence level of 1 - for the joint interval estimates, Minitab
constructs each interval with a confidence level of (1 - /g), where g is the number of
intervals. In the Bonferroni intervals, Minitab uses 99% confidence intervals (1.00 -
0.05/5 = 0.99) to achieve the 95% simultaneous confidence level.
Manually calculate Bonferroni confidence intervals for
the standard deviations (sigmas)
Complete the following steps to manually calculate the Bonferroni confidence intervals
for the standard deviations (sigmas) of your factor levels instead of using Stat > Basic
Statistics > 2 Variances or Stat > ANOVA > Test for Equal Variances.
Suppose you are using the Minitab sample data set FURNACE.MTW and analyzing the
response 'BTU.In' and factor Damper, which has 2 levels. To calculate the confidence
interval for level 1 of Damper, using a familywise confidence level of 95% (0.95), do the
following:
1. Calculate K and store the value in a constant called K1.
1. Open the Minitab sample data set FURNACE.MTW.
2. Choose Calc > Calculator.
3. In Store result in variable, enter K1.
4. In Expression, enter 0.05 / (2 * 2).
5. Click OK.
2. Calculate the variance and N for each level of Damper and store the results in the data
window.
1. Choose Stat > Basic Statistics > Store Descriptive Statistics.
2. In Variables, enter BTU.In.
3. In By variables (optional), enter Damper.
4. Click Statistics. Check Variance and N nonmissing.
5. Click OK in each dialog box.
For Damper level 1, var = 9.11960 and n = 40.
3. Calculate U and store the result in worksheet constant named U1.
1. Choose Calc > Probability Distributions > Chi-Square.
2. Choose Inverse cumulative probability.
3. In Degrees of freedom, type 39.
4. Choose Input constant, and enter K1.
5. In Optional storage, enter U1.
6. Click OK.
4. Calculate the upper bound for the confidence interval.
1. Choose Calc > Calculator.
2. In Store result in variable, enter UpperL1.
3. In Expression, enter ((39 * 9.11960) / U1)**0.5.
4. Click OK.
The upper boundary for the 95% Bonferroni confidence interval for level 1 of Damper is
4.02726. The lower bound is calculated the same way, using L instead of U.
5. Calculate L and store the result in worksheet constant named L1.
1. Choose Calc > Calculator.
2. In Store result in variable, enter K2.
3. In Expression, enter 1 - K1.
4. Click OK.
5. Choose Calc > Probability Distributions > Chi-Square.
6. Choose Inverse cumulative probability.
7. In Degrees of freedom, type 39.
8. Choose Input constant and enter K2.
9. In Optional storage, enter L1.
10. Click OK.
6. Calculate the lower bound for the confidence interval.
1. Choose Calc > Calculator.
2. In Store result in variable, enter LowerL1.
3. In Expression, enter ((39 * 9.11960) / L1)**0.5.
4. Click OK.
The lower bound for the 95% Bonferroni confidence interval for level 1 of Damper is 2.40655.
Note
When there is more than one factor, you need to consider each distinct factor level
combination as a separate factor level.
What is the adjusted p-value in multiple comparisons?
Use for multiple comparisons in General Linear Model ANOVA, the adjusted p-value
indicates which factor level comparisons within a family of comparisons (hypothesis
tests) are significantly different. If the adjusted p-value is less than alpha, then you
reject the null hypothesis. The adjustment limits the family error rate to the alpha level
you choose. If you use a regular p-value for multiple comparisons, then the family error
rate grows with each additional comparison. The adjusted p-value also represents the
smallest family error rate at which a particular null hypothesis will be rejected.
It is important to consider the family error rate when making multiple comparisons
because your chances of committing a type I error for a series of comparisons is greater
than the error rate for any one comparison alone.
Example of adjusted p-values
Suppose you compare the hardness of 4 different blends of paint. You analyze the data
and get the following output:

Tukey Simultaneous Tests for Differences of Means

Difference SE of
Difference of Levels of Means Difference 95% CI T-Value
Blend 2 - Blend 1 -6.17 2.28 (-12.55, 0.22) -2.70
Blend 3 - Blend 1 -1.75 2.28 ( -8.14, 4.64) -0.77
Blend 4 - Blend 1 3.33 2.28 ( -3.05, 9.72) 1.46
Blend 3 - Blend 2 4.42 2.28 ( -1.97, 10.80) 1.94
Blend 4 - Blend 2 9.50 2.28 ( 3.11, 15.89) 4.17
Blend 4 - Blend 3 5.08 2.28 ( -1.30, 11.47) 2.23

Adjusted
Difference of Levels P-Value
Blend 2 - Blend 1 0.061
Blend 3 - Blend 1 0.868
Blend 4 - Blend 1 0.478
Blend 3 - Blend 2 0.245
Blend 4 - Blend 2 0.002
Blend 4 - Blend 3 0.150

Individual confidence level = 98.89%
You choose an alpha of 0.05 which, in conjunction with the adjusted p-value, limits the
family error rate to 0.05. At this level, the differences between blends 4 and 2 are
significant. If you lower the family error rate to 0.01, the differences between blends 4
and 2 are still significant.
What is an interaction?
When the effect of one factor depends on the level of the other factor. You can use an
interaction plot to visualize possible interactions.
Parallel lines in an interaction plot indicate no interaction. The greater the difference in
slope between the lines, the higher the degree of interaction. However, the interaction
plot doesn't alert you if the interaction is statistically significant.
Example of an interaction plot
For example, cereal grains must be dry enough before the packaging process. Lab
technicians collect moisture data on grains at several oven times and temperatures.

This plot indicates an interaction between the oven temperature and oven time. The
grain has a lower moisture percentage when baked for a time of 60 minutes as opposed
to 30 minutes at 125 and 130 degrees. However, when the temperature is 135 degrees,
the grain has a lower moisture percentage when baked for 30 minutes.
Interaction plots are most often used to visualize interactions during ANOVA or DOE.
Minitab draws a single interaction plot if you enter two factors, or a matrix of interaction
plots if you enter more than two factors.
Which interaction plots are available in Minitab?
Minitab provides interaction plots to accompany various analyses. Use the interaction
option available through:
Stat > DOE > Factorial > Factorial Plots to generate interaction plots specifically for factorial
designs.
Stat > DOE > Mixture > Factorial Plots to generate interaction plots specifically for process
variables in mixture designs.
Stat > ANOVA > General Linear Model > Factorial Plots to generate interaction plots for the
fitted values from doing an analysis of variance.
Stat > Regression and then choose either Regression > Factorial Plots, Binary Logistic
Regression > Factorial Plots, or Poisson Regression > Factorial Plots to generate interaction
plots from a regression model.
Note
All of these options let you use fitted means. Usually, plots using fitted values and plots
created by the Interaction Plot command using the response will not be the same. They
will be the same if the data set is balanced and you fit a full model.
Enter factor levels in the worksheet with make
patterned data
You can use Calc > Make Patterned Data > Simple Set of Numbers to enter the level
numbers of a factor. Here is an easy way to enter the level numbers for a three-way
crossed design with a, b, and c levels of factors A, B, C, with n observations per cell:
1. Enter the levels for factor A.
1. Choose Calc > Make Patterned Data > Simple Set of Numbers, and press the F3 key
to reset defaults.
2. Enter A in Store patterned data in.
3. Enter 1 in From first value. Enter the number of levels in A in To last value.
4. Enter the product of bcn in Number of times to list the sequence.
5. Click OK.
2. Enter the levels for factor B.
1. Choose Calc > Make Patterned Data > Simple Set of Numbers, and press the F3 key
to reset defaults.
2. Enter B in Store patterned data in.
3. Enter 1 in From first value. Enter the number of levels in B in To last value.
4. Enter the number of levels in A in Number of times to list each value. Enter the
product of cn in Number of times to list the sequence.
5. Click OK.
3. Enter the levels for factor C.
1. Choose Calc > Make Patterned Data > Simple Set of Numbers, and press the F3 key
to reset defaults.
2. Enter C in Store patterned data in.
3. Enter 1 in From first value. Enter the number of levels in C in To last value.
4. Enter the product of ab in Number of times to list each value. Enter the sample size n
in Number of times to list the sequence.
5. Click OK.
What are reduced models and hierarchical models?
Reduced models
A reduced model is a model that does not include all the possible terms. For example,
suppose you have a three factor design, with factors, A, B, and C. The full model would
include:
all one factor terms: A, B, C
all two-factor interactions: A * B, A * C, B * C
the three-factor interaction: A * B * C
It becomes a reduced model by omitting terms. You might reduce a model if terms are
not significant or if you need additional error degrees of freedom and you can assume
that certain terms are zero.
Hierarchical models
A hierarchical model is a model where for each term in the model, all lower order terms
contained in it must also be in the model. For example, suppose there is a model with
four factors: A, B, C, and D. If the term A * B * C is in the model then the terms A, B, C,
A*B, A*C, and B*C must also be in the model, though any terms with D do not have to
be in the model. The hierarchical structure applies to nesting as well. If B (A) is in the
model, then A must be also.
A model is non-hierarchical if it does not contain all of the lower order terms for each
term in the model.
What are randomized block designs and Latin square
designs?
Some designed experiments can effectively provide information when measurements
are difficult or expensive to make or can minimize the effect of unwanted variability on
treatment inference. The following is a brief discussion of two commonly used designs.
To show these designs, two treatment factors (A and B) and their interaction (A*B) are
considered. These designs are not restricted to two factors, however. If your design is
balanced, you can use Balanced ANOVA to analyze your data. If it is not, use GLM.
IN THIS TOPIC
Randomized block design
Latin square with repeated measures design
Randomized block design
A randomized block design is a commonly used design for minimizing the effect of
variability when it is associated with discrete units (e.g. location, operator, plant, batch,
time). The usual case is to randomize one replication of each treatment combination
within each block. There is usually no intrinsic interest in the blocks and these are
considered to be random factors. The usual assumption is that the block by treatment
interaction is zero and this interaction becomes the error term for testing treatment
effects. If you name the blocking variable as Block, the terms in the model would be
Block, A, B, and A*B. You would also specify Block as a random factor.
Return to top
Latin square with repeated measures design
A repeated measures design is a design where repeated measurements are made on
the same subject. There are a number of ways in which treatments can be assigned to
subjects. With living subjects especially, systematic differences (because of learning,
acclimation, resistance, and so on) between successive observations can be suspected.
One common way to assign treatments to subjects is to use a Latin square design. An
advantage of this design for a repeated measures experiment is that it ensures a
balanced fraction of a complete factorial (that is, all treatment combinations
represented) when subjects are limited and the sequence effect of treatment can be
considered to be negligible.
A Latin square design is a blocking design with two orthogonal blocking variables. In an
agricultural experiment there might be perpendicular gradients that might lead you to
choose this design. For a repeated measures experiment, one blocking variable is the
group of subjects and the other is time. If the treatment factor B has three levels, b1, b2,
and b3, then one of twelve possible Latin square randomizations of the levels of B to
subjects groups over time is:

Time 1 Time 2 Time 3
Group 1 b2 b3 b1
Group 2 b3 b1 b2
Group 3 b1 b2 b3
The subjects receive the treatment levels in the order specified across the row. In this
example, group 1 subjects would receive the treatments levels in order b2, b3, b1. The
interval between administering treatments should be chosen to minimize carryover
effect of the previous treatment.
This design is commonly modified to provide information about one or more additional
factors. If each group was assigned a different level of factor A, then information about
the A and A*B effects could be made available with minimal effort if an assumption
about the sequence effect given to the groups can be made. If the sequence effects are
negligible compared to the effects of factor A, then the group effect could be attributed
to factor A. If interactions with time are negligible, then partial information about the A *
B interaction can be obtained. In the language of repeated measures designs, factor A
is called a between-subjects factor and factor B a within-subjects factor.
It is not necessary to randomize a repeated measures experiments with a Latin square
design.
Balanced and unbalanced designs in ANOVA models
In ANOVA and DOE, a balanced design has an equal number of observations for all
possible combinations of factor levels. An unbalanced design has an unequal number of
observations.
Balanced Design
You have exactly one observation for all possible combinations of the factor levels for
factors A, B, and C: (0, 0, 0); (0, 0, 1); (0, 1, 0); (0, 1, 1); (1, 0, 0); (1, 0, 1); (1, 1, 0); and
(1, 1, 1).
C1 C2 C3
A B C
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
Unbalanced Design
Here, you are missing the (1, 0, 0) factor level combination and you have two
observations of the (0, 1, 0) combination. Either one of these conditions, by itself,
makes this design unbalanced.
C1 C2 C3
A B C
C1 C2 C3
A B C
0 0 0
0 1 0
0 1 0
0 0 1
0 1 1
1 1 0
1 0 1
1 1 1
Analysis of a balanced design is usually straightforward because you can use the
differences between the raw factor level means for your estimates of the main and
interaction effects. If your design is not balanced, either by plan or by accidental loss of
data, differences in the raw factor level means may show the unbalanced observations
instead of changes in factor levels. For unbalanced designs, you can use fitted means
to predict the results a balanced design would have produced.
Determine whether your data are balanced
Your design must be balanced to use Balanced ANOVA. For a small data set, you can
look in the worksheet and easily see if the data are balanced.
To determine whether your data are balanced with large data sets, create a cross
tabulation table. To create this table, choose Stat > Tables > Cross Tabulation and Chi-
Square. Examine the cells in the resulting output. A cell is the intersection of a row and
a column. If a cell's count is not equal to the counts of all other cells, you have
unbalanced data.
Rank deficiency and full rank in ANOVA models
IN THIS TOPIC
About the "rank deficiency" message
What is rank deficiency?
What causes rank deficiency?
About the "rank deficiency" message
Linear models are full rank when there are an adequate number of observations per
factor level combination to be able to estimate all terms included in the model. When not
enough observations are in the data to fit the model, Minitab removes terms until the
model is small enough to fit. It is possible that other models may fit the data better.
Suppose you have a two-factor GLM model. You try to fit the model with terms A B and
A*B and receive an error about "rank deficiency." This indicates that there are not
enough observations per factor level combination. Try removing the interaction term
(A*B).
Return to top
What is rank deficiency?
Rank deficiency is a condition that can prevent Minitab from performing matrix
calculations. For example, consider the following data set with two predictor variables
and one response variable:
C1 C2 C3
X1 X2 Y
1.5 9.7 15.0
1.4 8.4 14.0
1.6 8.6 16.0
1.7 8.9 17.0
1.7 8.1 14.5
X1 and X2 are the predictor variables and Y is the response variable. The regression
analysis in Minitab uses least squares to calculate the estimated coefficients b0, b1, b2, in
the following linear equation:
Y = b0 + b1X1 + b2X2
The least squares procedure is equivalent to solving the set of matrix equations
b = (X
T
X)
-1
X
T
Y
where b is a column vector containing the estimated model coefficients, X is a matrix
whose first column is a column of ones (used for estimating the intercept/constant) and
whose remaining columns are the columns of predictor data (X1, X2,), and Y is the
column vector of response data. For the previous data set, the matrices are:

Minitab uses the QR decomposition to calculate the estimates of the parameters (b0, b1,
and b2) and the standard deviations of the parameters. The calculation depends on the
eigenvalues of the (X
T
X) matrix. If some eigenvalues of the (X
T
X) are essentially zero,
the square matrix (X
T
X) is either singular, or close to being singular, and Minitab will not
be able to do the calculations.
Return to top
What causes rank deficiency?
Rank deficiency occurs if any X variable columns can be written as a linear combination
of the other X columns. Two examples are shown, using C1, C2, and C3 as predictor
(X) variables:
Example 1
C1 C2 C3
X1 X2 X3
1 2 3
2 3 5
1.5 2.5 4
Example 2
C1 C2 C3
X1 X2 X3
1 2 3
2 4 5
1.5 3 4
In the first example, notice that C1 + C2 = C3.
In the second example, notice that 2*C1 = C2.
If you try to perform regression (or ANOVA) using these predictors, Minitab will remove
terms from the model in order to perform the analysis.
Rank deficiency can also occur with categorical data:
Example 3
C1 C2 C3
Machine Operator Response
1 Joel 15
C1 C2 C3
Machine Operator Response
1 Joel 18
1 Joel 17
2 Bill 14
2 Bill 15
2 Bill 16
In this example, notice that the machine column has the exact same pattern as the
operator column. If you perform ANOVA with this data set, Minitab will remove terms
from the model in order to perform the analysis.
When you perform ANOVA, rank deficiency can also occur for the following reasons:
An interaction term included does not have at least one observation for each combination of
the factor levels. For example, if A has 3 levels, B has 4 levels, and you include the A*B
interaction in the model without having at least one observation for all 12 combinations of the
factor levels.
There is unbalanced nesting.
A continuous variable in the model is not specified as a covariate.
The degrees of freedom for Error are negative.
What are factors, crossed factors, and nested
factors?
What is a factor?
Factors are predictor variables (also called independent variables) which you
choose to systematically vary during an experiment to determine their effect on
the response (dependent) variable.
For example, you want to inspect the surface finish of metal parts coming from
several machines and measured by several operators. Both 'Machine' and
'Operators' are factors in this experiment. 'Machine' and 'Operators' can be
crossed or nested factors, depending on how experimenters collect the data.
What is a crossed factor?
Two factors are crossed when each level of one factor occurs in combination with
each level of the other factor.

For example, if you use crossed factors in your experiment, the same three
operators would inspect surface finish from both machines.
What is a nested factor?
Two factors are nested when the levels of one factor are similar but not identical,
and each occurs in combination with different levels of another factor.

For example, if Machine 1 is in Galveston and Machine 2 is in Baton Rouge,
each machine will have different operators.
What is the difference between fixed and random
factors?
In ANOVA, factors are either fixed or random. Usually, if the investigator controls
the levels of a factor, then the factor is fixed. Conversely, if the investigator
randomly sampled the levels of a factor from a population, then the factor is
random.
Suppose you have a factor called "operator," and it has three levels. If you
intentionally select these three operators and want your results to apply to only
these operators, then the factor is fixed. However, if you randomly sample three
operators from a larger number of operators, and you want your results to apply
to all operators, then the factor is random.
In Minitab, different ANOVA commands can handle different types of factors.
General Linear Model and Balanced ANOVA can analyze both fixed and random
factors. Fully Nested ANOVA requires random factors.
Should I use a restricted or unrestricted mixed
model?
A mixed model is one with both fixed and random factors. There are two forms of
this model: one requires the crossed, mixed terms to sum to zero over subscripts
corresponding to fixed effects (this is called the restricted model), and the other
does not. Many textbooks use the restricted model. Most statistics programs use
the unrestricted model. Minitab fits the unrestricted model by default, but you can
choose to fit the restricted form. The reasons to choose one form instead of
different one are not well defined in the statistical literature.
Your choice of model form does not affect the sums of squares, degrees of
freedom, mean squares, or marginal and cell means. It does affect the expected
mean squares, error terms for F-tests, and the estimated variance components.
Factors and factor levels in ANOVA models
What are factors and factor levels?
Use factors during an experiment in order to determine their effect on the
response variable. Factors can only assume a limited number of possible values,
known as factor levels. Factors can be a categorical variable or based on a
continuous variable but only use a few controlled values in the experiment.
Example of factor levels
For example, you are studying factors that could affect plastic strength during the
manufacturing process. You decide to include Additive and Temperature in your
experiment. The additive is a categorical variable. It can only be type A or type B.
Conversely, temperature is a continuous variable, but here it is a factor because
only three temperatures settings of 100C, 150C and 200C are tested in the
experiment.
Factor Additive Temperature
Level A Low (100C)
Level B Medium (150C)
Level High (200C)
Using patterned data to set up factor levels
Minitab's make patterned data capability can be helpful when entering numeric
factor levels. For example, to enter the level values for a three-way crossed
design with a, b, and c (a, b, and c represent numbers) levels of factors A, B, C,
and n observations per cell, complete the Calc > Make Patterned Data > Simple
Set of Numbers dialog box and execute 3 times, one time for each factor, as
shown:
Dialog item Factor A Factor B Factor C
From first value 1 1 1
Dialog item Factor A Factor B Factor C
To last value a b c
Number of times to list each value bcn cn n
Number of times to list the sequence 1 a ab
Design matrix for general linear model (GLM) in
Minitab
General Linear Model uses a regression approach to fit the model that you specify. First
Minitab creates a design matrix, from the factors and covariates, and the model that you
specify. The columns of this matrix are the predictors for the regression.
The design matrix has n rows, where n = number of observations, and one block of
columns, often called indicator variables, for each term in the model. There are as many
columns in a block as there are degrees of freedom for the term. The first block is for
the constant and contains one column, a column of all ones. The block for a covariate
also contains one column, the covariate column itself.
Suppose A is a factor with 4 levels and the model uses -1, 0, +1 coding. Then it has 3
degrees of freedom and its block contains 3 columns, call them A1, A2, A3. Each row is
coded as one of the following:
Level of A A1 A2 A3
1 1 0 0
2 0 1 0
Level of A A1 A2 A3
3 0 0 1
4 -1 -1 -1
Suppose factor B has 3 levels nested within each level of A. Then its block contains (3 -
1) x 4 = 8 columns, call them B11, B12, B21, B22, B31, B32, B41, B42, coded as
follows:
Level of
A
Level of
B
B11 B12 B21 B22 B31 B32 B41 B42
1 1 1 0 0 0 0 0 0 0
1 2 0 1 0 0 0 0 0 0
1 3 -1 -1 0 0 0 0 0 0
2 1 0 0 1 0 0 0 0 0
2 2 0 0 0 1 0 0 0 0
2 3 0 0 -1 -1 0 0 0 0
Level of
A
Level of
B
B11 B12 B21 B22 B31 B32 B41 B42
3 1 0 0 0 0 1 0 0 0
3 2 0 0 0 0 0 1 0 0
3 3 0 0 0 0 -1 -1 0 0
4 1 0 0 0 0 0 0 1 0
4 2 0 0 0 0 0 0 0 1
4 3 0 0 0 0 0 0 -1 -1
To calculate the indicator variables for an interaction term, multiply all the corresponding
dummy variables for the factors and/or covariates in the interaction. For example,
suppose factor A has 6 levels, C has 3 levels, D has 4 levels, and Z and W are
covariates. Then the term A * C * D * Z * W * W has 5 x 2 x 3 x 1 x 1 x 1 = 30 indicator
variables. To obtain them, multiply each indicator variable for A by each for C, by each
for D, by the covariates Z one time and W twice.
Coefficients for the reference level in general linear
model (GLM) in Minitab
General Linear Model (GLM) uses a regression approach to fit your model. After, GLM
codes the factor levels as indicator variables it uses them to calculate the coefficients
for all terms. The interpretation of the coefficients depends on whether the indicator
variables use (-1,0,+1) coding or (1,0) coding. With (-1,0,+1) coding, the coefficients
represent the distance between the factor levels and the overall mean. With (1,0)
coding, the coefficients represent the difference between the other factor levels and the
reference level for the factor.
For both types of coding, one of the levels is the reference level. By default, Minitab
does not list the coefficient for the reference level in the coefficients table. Sometimes,
you may want to know the reference level coefficient to understand how the reference
value compares in size and direction to the overall mean.
How to display the coefficient for the reference level
In GLM, the coefficient for the reference level is in the single regression equation. To
display the equation in the coefficients table, follow these steps:
1. Choose Stat > ANOVA > General Linear Model > Fit General Linear Model.
2. Click Results.
3. For Coefficients, choose Full set of coefficients.
4. Click OK in each dialog box.
How to calculate the coefficient for the reference level
Suppose you perform a general linear model test with 2 factors. Factor 1 has 3 different
settings (35, 44, and 52). Factor 2 is 2 different times (1 and 2). Minitab uses (-1,0,+1)
coding. The factors and their indicator variables are in the tables that follow:
Factor 1 has 3 levels, so Factor 1 has 2 indicator variables. When the setting is 35, the
first indicator variable is 1 and the second indicator variable is 0. When the setting is 44,
the first indicator variable is 0 and the second indicator variable is 1. When the setting is
52, both indicator variables are -1. The level where the setting is 52 is the reference
level.
Factor 2 Indicator 1 Indicator 2
52 -1 -1
35 1 0
44 0 1
52 -1 -1
44 0 1
35 1 0
For Factor 2, when the time is 1 the indicator variable is also 1. When the time is 2, the
indicator variable is -1. The level where time is 2 is the reference level.
Factor 1 Indicator
1 1
1 1
2 -1
Factor 1 Indicator
2 -1
1 1
2 -1
You obtain the following table of coefficients:

Coefficients

Term Coef SE Coef T-Value P-Value VIF
Constant 68.22 1.28 53.36 0.000
Setting
35 -27.64 1.81 -15.29 0.000 1.33
44 4.86 1.81 2.69 0.011 1.33
Time
1 -0.50 1.28 -0.39 0.698 1.00

The ANOVA model is:

Regression Equation

Thickness = 68.22 - 27.64 Setting_35 + 4.86 Setting_44 + 22.78 Setting_52
- 0.50 Time_1 + 0.50 Time_2

Notice that the table does not include the coefficients for 52 (Factor 1) or 2 (Factor 2),
which are the reference levels for each factor. However, you can easily calculate these
values by subtracting the overall mean from each level mean. The constant term is the
overall mean.
To view the mean for each level in Minitab, follow these steps:
1. Choose Stat > Basic Statistics > Display Descriptive Statistics.
2. In Variables, enter the response variable.
3. In By variables (optional), enter the factor.
4. Click OK.
Repeat the steps for each factor.
The means for the example data follow:
Overall = 68.22
Setting 35 (Factor 1) = 40.583
Setting 44 (Factor 1) = 73.08
Setting 52 (Factor 1) = 91
Time 1 (Factor 2) = 67.72
Time 2 (Factor 2) = 68.72
The coefficients are calculated as the level mean overall mean. Thus, the coefficients
for each level are:
Setting 35 (Factor 1) = 40.58 68.22 = 27.64
Setting 44 (Factor 1) = 73.08 68.22 = 4.86
Setting 52 (Factor 1) = 91 68.22 = 22.78 (not shown in the coefficients table)
Time 1 (Factor 2) = 67.72 68.22 = 0.5
Time 2 (Factor 2) = 68.72 68.22 = 0.5 (not shown in the coefficients table)
Note
A quick way to obtain the coefficients for the reference level is to add the level
coefficients for a factor (excluding the intercept) and multiply by 1. For example, the
coefficient for Setting 52 = 1 * [(27.64) + (4.86)] = 22.78.
If you add a covariate or have unequal sample sizes within each group, coefficients are
based on weighted means for each factor level instead of the arithmetic mean (sum of
the observations divided by n).
Does the response need to follow a normal
distribution?
ANOVA does not assume that the entire response column follows a normal distribution.
ANOVA assumes that the residuals from the ANOVA model follow a normal distribution.
Because ANOVA assumes the residuals follow a normal distribution, residual analysis
typically accompanies an ANOVA analysis. Plot the residuals, and use other diagnostic
statistics, to determine whether the assumptions of ANOVA are met.
You can evaluate the assumption that the residuals follow a normal distribution from the
response data when the data do not include a covariate. In ANOVA, the entire response
column is typically nonnormal because the different groups in the data have different
means. If the data for each individual group follow a normal distribution, then the data
meet the assumption that the errors follow a normal distribution. This condition is
typically stronger than the condition that the residuals follow a normal distribution. If the
groups contain enough data, you can use normal probability plots and tests for
normality on each group.
Note
The ANOVA procedure with fixed factors and equal sample sizes works quite well even
when the assumption of normality is violated, unless one or more of the distributions are
highly skewed or the variances are very different.
Is the variance of the response appropriate for power
and sample size calculations for ANOVA?
The variance that is important in an ANOVA is not the variance of the response. The
important variance is the variance of the error term.
If a significant treatment effect is in the model, then the variation of the response is
much higher than the variance of the error term because the treatment explains some of
the variation in the response.
The mean square error is the best estimate of the variance of the error term. The mean
square error from a pilot study is typically useful when you want to perform power and
sample size calculations for an ANOVA.
Adding a covariate to general linear model (GLM)
When assessing data using a general linear model, adding covariates can greatly
improve the accuracy of the model and may significantly affect the final analysis results.
In a general linear model, a covariate is any continuous predictor, which might or might
not be controllable.
Example of adding a covariate to a general linear model
A textile company uses three different machines to manufacture monofilament fibers.
They want to determine whether the breaking strength of the fiber differs based on
which machine is used. They collect data on the strength and diameter for 5 randomly
selected fibers from each machine. Because fiber strength is related to its diameter,
they also record the fiber diameter for use as a possible covariate.
C1 C2 C3
Machine Diameter Strength
1 20 36
1 25 41
1 24 39
1 25 42
1 32 49
2 22 40
2 28 48
2 22 39
2 30 45
C1 C2 C3
Machine Diameter Strength
2 28 44
3 21 35
3 23 37
3 26 42
3 21 34
3 15 32
1. Verify that the covariate and response are linearly related. You can do this in Minitab by
analyzing the data with a fitted line plot.
1. Choose Stat > Regression > Fitted Line Plot.
2. In Response (Y) enter Strength.
3. In Predictor (X) enter Diameter.
4. Assess how closely the data fall beside the fitted line and how close R
2
is to a
"perfect fit" (100%).

The fitted line plot indicates a strong linear relationship (87.2%) between diameter and
strength.
2. Perform the GLM analysis with the covariate.
1. Choose Stat > ANOVA > General Linear Model > Fit General Linear Model.
2. In Responses, enter Strength.
3. In Factors, enter Machine.
4. In Covariates, enter Diameter.
5. Click OK.
For the fiber production data, Minitab displays the following results:

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Diameter 1 178.014 178.014 69.97 0.000
Machine 2 13.284 6.642 2.61 0.118
Error 11 27.986 2.544
Lack-of-Fit 7 18.486 2.641 1.11 0.487
Pure Error 4 9.500 2.375
Total 14 346.400
The F-statistic for machines is 2.61 and the p-value is 0.118. Because the p-value >0.05, you
fail to reject the null hypothesis that the fiber strengths do not differ based on the machine
used at the 5% significance level. You can assume the fiber strengths are the same on all
the machines. Notice that the F-statistic for diameter (covariate) is 69.97 with a p-value of
0.000. This indicates that the covariate effect is significant. That is, diameter has a
statistically significant impact on the fiber strength.
Now, suppose you rerun the analysis and omit the covariate. This will result in the following
output:

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Machine 2 140.4 70.20 4.09 0.044
Error 12 206.0 17.17
Total 14 346.4
Notice that the F-statistic is 4.09 with a p-value of 0.044. Without the covariate in the model,
you reject the null hypothesis at the 5% significance level and conclude the fiber strengths do
differ based on which machine is used.
This conclusion is completely opposite the conclusion you got when you performed the
analysis with the covariate. This example shows how the failure to include a covariate can
produce misleading analysis results.

Vous aimerez peut-être aussi