Académique Documents
Professionnel Documents
Culture Documents
LINEAR
REGRESSION
METHODS
for FOREST
RESEARCH
U. S DEPARTMENT OF AGRICULTURE
FOREST
PRODUCTS LABORATORY
FOREST SERVICE
MADISON. WIS
SUMMARY
This Research Paper discusses the methods of linear regression analysis that have
been found most useful in forest research. Among the topics treated are the fitting and
testing of linear models, weighted regression, confidence limits, covariance analysis,
and discriminant functions.
The discussions are kept at a fairly elementary level and the various methods are
illustrated by presenting typical numerical examples and their solution. The logical
basis of regression analysis is also presented to a limited extent.
ACKNOWLEDGMENTS
Appreciation is extended to Professor George W. Snedecor and the Iowa State University
Press, Ames, Iowa, for their permission to reprint from their book Statistical Methods
(ed. 5), the material in tables 1 and 8 of Appendix E of this Research Paper.
We are also indebted to the Literary Executor of the late Professor Sir Ronald A.
Fisher, F.R.S., Cambridge, to Dr. Frank Yates, F.R.S., Rothamsted, and to Messrs.
Oliver and Boyd Ltd., Edinburgh, Scotland, for their permission to reprint Table No. III
from their book Statistical Tables for Biological, Agricultural, and Medical Research
(Table 7 of Appendix E of this Research Paper): also Table 10.5.3 from Snedecors
Statistical Methods (ed. 5) shown as Table 6 in Appendix E of this Research Paper.
CONTENTS
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Fitting a Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Confidence Limits and Tests of Significance . . . . . . . . . . . . . .
Interpreting a Fitted Regression . . . . . . . . . . . . . . . . . . . . .
THE
MATHEMATICAL MODEL . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
15
15
19
21
22
23
23
24
27
.
.
.
.
28
30
31
33
.
.
.
.
.
.
34
34
37
37
39
42
ANALYSIS OF VARIANCE . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A General Test Procedure . . . . . . . . . . . . . . . . . . . . . . . .
Degrees of Freedom . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Problem IX--Test of the Hypothesis that 1 + 2 = 1 . . . . . . . . .
Problem X--Test of the Hypothesis that 2 = 0 . . . . . . . . . . . .
Problem XI--Working With Corrected Sums of Squares and Products
Problem XI--Test of the Hypothesis that 2 = 3 = 0 . . . . . . . . .
Problem XIII--Test of the Hypothesis that 1 + 22 = 0 . . . . . . . .
Problem XIV--Hypothesis Testing in a Weighted Regression . . . . .
An Alternate Way to Compute the Gain Due to a Set. of X Variables .
46
47
50
51
54
57
59
61
63
64
. . . . . . . .
Term . . . .
. . . . . . .
. . . . . . .
FPL 17
-i-
.
.
.
.
.
.
THE t-TEST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Problem XV--Test of a Non-Zero Hypothesis . . . . . . . . . . . . .
Page
66
69
CONFIDENCE LIMITS . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
^
Confidence Limits on Y . . . . . . . . . . . . . . . . . . . . . . . . .
Problem XVI--Confidence Limits in Multiple Regression . . . . . . .
Problem XVII--Confidence Limits on a Simple Linear Regression. .
Confidence Limits on Individual Values of Y . . . . . . . . . . . . . .
70
70
74
75
78
81
COVARIANCE ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Problem XVIII--Covariance Analysis . . . . . . . . . . . . . . . . . .
Covariance Analysis with Dummy Variables . . . . . . . . . . . . . .
Problem XIX--Covariance Analysis with Dummy Variables . . . . . .
81
84
86
88
DISCRIMINANT FUNCTION . . . . . . . . . . . . . . . . . . . . . . . . . .
Use and Interpretation of the Discriminant Function . . . . . . . . . .
Testing a Fitted Discriminant . . . . . . . . . . . . . . . . . . . . . .
Testing the Contribution of Individual Variables or Sets of Variables
Reliability of Classifications . . . . . . . . . . . . . . . . . . . . . . .
Reducing the Probability of a Misclassification . . . . . . . . . . . .
Basic Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
96
97
98
99
100
101
ELECTRONIC COMPUTERS . . . . . . . . . . . . . . . . . . . . . . . . . .
101
CORRELATION COEFFICIENTS . . . . . . . . . . . . . . . . . . . . . . . .
General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Simple Correlation Coefficient . . . . . . . . . . . . . . . . . . .
Partial Correlation Coefficients . . . . . . . . . . . . . . . . . . . . .
The Coefficient of Determination . . . . . . . . . . . . . . . . . . . .
Tests of Significance . . . . . . . . . . . . . . . . . . . . . . . . . . .
102
102
103
104
106
107
107
SELECTED REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . .
110
111
111
112
116
118
123
128
APPENDIX
Table
Table
Table
133
133
135
136
FPL 17
E.--TABLES . . . . . . . . . . . . . . . . .
6.--The F-Distribution . . . . . . . . . . . .
7.--The t-Distribution . . . . . . . . . . . . .
8.--The Cumulative Normal Distribution
-ii-
.
.
.
.
. . . .
. . . .
. . . .
.....
. . . .
. . . .
. . . .
.....
.
.
.
.
LINEAR
REGRESSION METHODS
for
FOREST RESEARCH
PRODUCTS
LABORATORY
FOREST SERVICE
U. S. DEPARTMENT OF AGRICULTURE
FOREST
INTRODUCTION
ij
i=1
j=1
Underlined numbers in parentheses refer to Selected References at the end of this report.
FPL 17
-2-
th
unit
th
1 = the difference between the Y value of the i unit and the population
mean (Yi -y). This is sometimes called a deviation or error.
A measure of how widely the individual values are spread around the mean is
known as the variance, and the square root of the variance is called the standard
deviation. For the population, the variance is the average squared deviation.
Now, think in terms of a series of such populations, each with its own mean and
variance. Often there will be some other characteristic (X) that has the same value
for all units within a given population but varies among populations. It also happens
at times that there is some sort of functional relationship between the mean Y values
for the populations and the associated X values. Graphically, such a relationship
might appear as shown in figure 1.
M 124 614
Figure 1.--Y and X values for four populations
FPL 17
-3-
The line showing the relationship between mean Y and X is called a regression line,
and its mathematical expression is called a regression function. If the relationship
between the mean value of Y (y) and the value of X is a straight line, we could write
where: X =The value of X for the population having a mean Y value of .
y
a,b = constants, indicating the level and the slope of the straight line.
Thus, a regression can be thought of as a form of average that changes with changes
in the value of the X variable--amoving average. One of the aims in regression
analysis is to find an equation representing this relationship. In this relationship, Y
is usually called the dependent variable and X the independent variable. This does not
mean, however, that there has to be a cause and effect relationship: it only indicates
that the Y values are associated with the X values in some manner that can be
described approximately by some mathematical equation. Knowing the value of X
gives us some information about the value of Y. The person concerned with the
r e g r e s s i o n makes his own inferences as to what is implied by the indicated
relationship,
The equation
y = a + bX
specifies the relationship between the mean values of Y and the level of X. To indicate
that the individual values of Y vary about the mean, we might write
Or, since y varies linearly with the level of X,
In other words, this says that they value of any individual unit is due to the regression
of mean Y on X plus a deviation (i) from the mean.
If the spread (as measured by the variance) of the Y values about their mean ( y)
is the same for all of the populations, regardless of the value of the associated X
variable, the variance is said to be homogeneous. If the variance is not the same for
all populations, it is said to be heterogeneous.
Frequently, the populations can be characterized by more than one X variable (for
example, X1, X2, and X3) and it may happen that the mean (y) associated with each
combination of values of these variables is functionally related to these values. Thus,
we might have the regression equation
FPL 17
-4-
= + X + X + X
0
1 1
2 2
3 3
where: 0, 1, 2, and 3 are constants (usually called regression coefficients)
X1 , X2, and X3 are the numerical values of three associated characteristics.
y
This equation merely says that ifwespecifyvalues for X1, X2, and X3 then we would,
on the average, expect the characteristic labeled Y to have the value (y) given by the
equation. The relationship of y to the independent variables is sometimes spoken of
as a regression surface or a response surface, even though a direct geometric
analogy breaks down beyond two X variables.
Since y represents a mean value, some individual values of Y will have to be
Yi = y + i
Y i = 0 + 1 X1 + 2 X2 + 3 X3 + i
And again, if the spread of the Y values about their mean is the same for all
points on the regression surface (that is. at all combinations of the independent
variables), the variance is said to be homogeneous. If the spread of Y values is not
the same at all points, the variance is heterogeneous.
In this introduction to the idea of a regression, we have talked as though there were
a number of separate populations--one for each value of X or one for each possible
combination of values for several different Xs. It is also possible (and more common)
to think in terms of a single population of units, each unit being characterized by a
Y value and one or more X values. There is a regression surface representing the
relationship of the Y value to the associated X values, but the Y values are not all
right on the surface: some of them are above it and some below. A given point on the
surface represents the mean Y value of all the units having the same X values as
those associated with that point. The spread of Y values above and below the surface
may be the same for all points (homogeneous variance) or it may differ from point to
point (heterogeneous variance).
Fitting a Regression
If there is a relationship between y and the independent variables (X1, X2, etc.),
it may be very desirable to know what the relationship is. To illustrate, it might be
used in predicting the value of Y that would, on the average, be associated with any
FPL 17
-5-
(3) Y = + X + X + X
0
1 1
2 2
3 3
1
(4) Y = + X + ( )
1 1
0
2X
1
Note that the model can be linear in the coefficients even though it is nonlinear as
far as the variables (Y and X) are concerned
An equation in which the coefficients are raised to other than the first power,
appear as exponents, or are combined other than by addition or subtraction are said
to be nonlinear in the coefficients. The following are examples:
(1) Y = a + b
b
(2) Y = aX
c
(3) Y = a(X-b)
In some cases models that are nonlinear in the coefficients can be put into a linear
form by a transformation of the variables. Thus, the second equation above could be
converted to a linear form by taking the logarithm of both sides, giving
or
FPL 17
where:
Y ' = log Y
X ' = log X
This Research Paper will be confined to the fitting and testing of linear models.
The fitting of nonlinear models requires more mathematical skill than is assumed
here. While this will be an inconvenient restriction at times, it will often be found
that a linear model provides a very good approximation to the nonlinear relationship.
Having selected a mathematical model, we should next examine the variability of
the Y values about the regression surface. Two aspects of this variability are of
interest:
(1) Is it the same or nearly so at all points of the regression surface
(homogeneous variance), or does it vary (heterogeneous variance)? In the latter case,
we would also like to know how the variance changes with changes in the independent
variables.
(2) What is the form of the distribution of individual Y values about the
regression surface? In many populations, the values will follow the familiar Normal
or Gaussian Distribution.
The answer to the first question affects the method of estimating the regression
coefficients. If the variance is homogeneous, we can use equal weighting of all
observations. If the variance is not homogeneous, it will be more efficient (and more
work) to use unequal weighting. The answer to the second question is needed if tests
of significance are to be made and confidence limits obtained for the estimated
regression coefficients or for functions of these coefficients.
These questions are not easily answered, and the less fastidious users of regression
tend to bypass them by making a number of assumptions. Given sufficient critical
familiarity with a population, the assumptions as to homogeneity of variance and form
of distribution may be quite valid. Without this familiarity, special studies may have
to be made to obtain the necessary information.
Confidence Limits and Tests of Significance
When the mean of a population is estimated from a sample of the units of that
population, it is well known that this sample estimate is subject to variation. Its
value will depend on which units were, by chance, included in the sample. Such an
estimate would be worthless without some means of determining how far it might be
from the true value. Fortunately, the statisticians have shown that the variability of
FPL 17
-7-
the individual units in a sample can be used in obtaining an indication of the variability
of the estimated mean. This in turn enables us to test some hypothesis about the
value of the true mean or to determine confidence limits that have a specified
probability of including the true mean,
In fitting a regression, the estimated regression coefficients are also subject to
sampling variation. Again it is important to have a method of testing various
hypotheses about the coefficients and of determining some limits within which the
true coefficients or the true regression may be found. This would include testing
whether or not any or all of the coefficients could actually be zero, which would imply
no association between Y and a particular X or set of X variables. The statisticians
have provided the means of doing this. The procedures that have been devised for
testing various hypotheses about the regression coefficients or for setting confidence
limits on the regression estimates will be discussed following the description of the
fitting techniques.
Interpreting a Fitted Regression
Deriving the meaning of a fitted equation is one of the very difficult and dangerous
phases. Here there are no strict rules to follow. On the assumption that it has not
been copyrighted, 'THINK' is suggested as the guiding principle.
In searching for the meaning of a regression, the fact that it is man-made should
never be overlooked. It is an attempt to describe some phenomenon that may be
controlled by very complex biological, physical, or economic laws. It may, at times,
be an excellent description, but it is not a law in itself; only a mathematical
approximation.
Not only is the fitted regression an artificial description of a relationship, but it
is also a description that may not be reliable beyond the range of the sample
observations used in fitting the regression. For example, a straight line may be a
very good approximation of the relationship between two variables over the range of
the sample data, but this does not mean that outside of this range the relationship is
-8-
---
X = an independent variable
i
= the regression coefficient for X (to be estimated in fitting).
i
i
This does not mean that we will only be able to fit straight lines or flat surfaces. For
example, the general equation for a second-degree parabola is
Y = a+ bX+
CX
Graphically, this might look roughly like one of the curves shown in figure 2.
M 124 619
2
To fit this curve with the general linear model, we merely let X = X, and X = X ,
1
2
then fit the model,
Y = + X + X
0
1 1
2 2
FPL 17
-9-
As another example, we might want to fit a hyperbolic function with the general
equation
Y= a + b (1 )
X
The form of this curve is illustrated in figure 3.
M 124 619
M 124 622
The curve can be fitted by a linear model if we take the logarithm of both sides, giving
log Y = log A + X log B
This is the same as the linear model
Y' =
where we let Y' = log Y.
+ X
1
As noted earlier, when we speak of a linear model we are referring to a model that
is linear in the coefficients. The above examples show that a linear model may be
nonlinear as far as the variables are concerned
FPL 17
-10-
C = k/g
in which k = a constant.
(4) This reasoning then led to the model
bg
S = a + dT + k
3
Stage, Albert R. Specific gravity and tree weight of single-tree samples of grand fir. U.S. Forest
Serv. Res. Paper INT-4, 11 pp., Intermountain Forest and Range Expt. Sta., Ogden, Utah. 1963.
FPL 17
-11-
S= + X + X
2 2
0
1 1
X =
2
In many cases, our knowledge of the subject will be less specific, but the same line
of development must still be followed. If we were studying the relationship between Y
and two independentvariables (X and X ), we might have an idea that the relationship
1
2
between Y and one of the variables (say X ) could be represented by a straight line
1
(fig. 5).
M 124 615
Now, to work X into the model, we have to consider how changes in X might affect
2
2
the relationship of Y to X . Equal increments in X might result in a series of
1
2
equally spaced parallel straight lines for the relationship of Y to X (fig. 6).
1
This suggests that in the equation Y = a + bX the slope (b) remains unchanged, but
1
the value of the Y intercept is a linear function of X (that is, a = a' + b' X ). Then
2
2
substituting for (a) in the relationship between Y and X , we have
1
Y = a ' + b ' X + bX
2
1
FPL 17
-12-
Y= 0 + X + X
1 1
2 2
2 1
or
=a + a' X
+ b' X X
1
1 2
Y=
where: X' = X X
2
1 2
+ X + X'
1 1
2 2
In cases such as this, we say that there is an interaction between the effects of X
1
and X , and the variable X' = X X is called an interaction term. It implies that the
2
2
1 2
effect that one variable has on changes in Y depends on (interacts with) the level of
the other variable.
Most likely, if the slope changes, the Y intercept will also change. If both of these
changes are thought to be linear, then
a = a' + b' X2
b = a" + b"X2
and the model becomes
or
where: X = X X
3
1 2
Y = a' + b' X
+ a" X
+ b" X X
1 2
Y= + X + X + X
0
1 1
2 2
3 3
-13-
If absolutely nothing is known about the form of relationship, then the selection of a
model gets to be a rather loose and frustrating process. There are no good rules to
follow. Plotting the data will sometimes suggest the appropriate model to fit. For a
(say X and X ), we can plot Y over X using different symbols to represent different
1
2
1
levels of X As an example, the following set of hypothetical observations has been
2'
plotted in figure 7.
2
different lines represent the relationship of Y to X at the various
1
relationship of Y to X seems to be linear (Y = a + bX ), and it
1
the Y intercept and the slop of the line increases linearly with X
2
2
Y=
+ X + X + X
1 1
2 2
3 3
where: X
=X X .
3
1 2
Of course, real data will rarely behave so nicely.
FPL 17
M 124 617
-14-
If there are more than two independent variables, the graphics may not be much
more illuminating than the basic data tabulation. The usual procedure is to plot the
single variables in the hope of spotting some overall trend, and then perhaps to plot
pairs of independent variables (as above) to reveal some of the two-variable interactions. There are other graphical manipulations that can be tried, but the probability
of success is seldom high,
With the advent of electronic computers, much of the computational drudgery has
been removed from the fitting of a regression. This had led to what might be called a
'shotgun' technique of fitting. A guess is made as to the most likely form for each
variable, a number of interaction terms are introduced, usually up to the capacity of
the program, and then a machine run is made. This may consist of fitting all possible
linear combinations of a set of independent variables (as discussed in the section on
Electronic Computers) or may employ a stepwise fitting technique (Appendix A
Method III). From the output of the machine, the variables that seem best are selected,
The analysis may end here or further trials may be made using new variables or new
forms of the variables tried in the first run. Statistically, this technique has some
flaws. Nonetheless, it is useful when little or nothing is known about the nature of the
relationships involved. But, it should be recognized and used strictly as an exploratory
procedure.
FITTING
LINEAR
MODEL
FPL 17
ji
sample (i = 1, 2,
, n)
--th
= the value of the j
independent variable (j = 1, 2,
sample unit.
-15-
--- ,
th
unit in the
k) on the i
th
th
independent variable
= Y
- --- - kXki).
- X - X
1 1i
2 2i
We do not, of course, know the values of the coefficients, but must estimate them
from the sample data. The principle of least squares says that under certain
conditions, the best estimates of the coefficients are those that make the sum of
squared deviations a minimum
Now, for the i
th
---
FPL 17
-16-
These are known as least squares normal equations (LSNE), and the solutions
are called the least squares estimates of the regression
coefficients. The first equation is called the equation, the second the equation,
0
1
etc.
For those who are familiar with differential calculus, it can be mentioned that the
equation is obtained by taking the derivative of the sum of squared deviations with
j
respect to and setting it equal to zero: the familiar procedure for finding the value
j
of a variable for which a function is a maximum or minimum Thus,
Setting this equal to zero and moving the term with no coefficient to the righthand
side gives the equation
But, writing the normal equations for a particular linear model does not require a
knowledge of calculus. Merely use the set of equations given above as a general set
and select those needed to solve for the coefficients in the model to be fitted,
eliminating unwanted coefficients from the selected equations. Thus for the model
Y= + X + X
0
1 1
6 6
the equations, with unwanted coefficients eliminated, would be:
Coefficient
Equation
6
For the model
Y= X + X
2 2
1 1
FPL 17
-17-
Equation
2
When the model contains a constant term ( ), it is possible to simplify the normal
0
equations and their solution. The simplification arises from the fact that the solution
of the normal equations will give as the estimate of ,
0
_ _
where: Y , X , _ _ _ = The sample means of Y, X , etc.
1
1
Using this value, we can rewrite the model
or
where: y = Y - Y
where:
FPL 17
etc.
-18-
sums of squares and products. Some details of the analysis of variance depend on
which fitting procedure is used, as will be noted later on.
Problem1 - Multiple Linear Regression With a Constant Term
A number of units (n = 13) were selected at randomfrom a population. On each unit,
measurements were made of a Y variable and three independent variables (X ,X ,
1 2
and X ). The model to be fitted is of the form
Since the model contains a constant term, it will be simpler to work with the
corrected sums of squares and products. For this method, the normal equations will be
Coefficient
FPL 17
Equation
-19-
and similarly,
In computing the sums of squares and products, it should be noted that a sum of
products may be either positive or negative, but a sum of squares must always be
positive; a negative sum of squares indicates a computational error.
Substituting the sums of squares and products into the normal equations, we have
The solution
FPL 17
-20-
Equation
or,
FPL 17
-21-
This presents no additional problems. Since the model contains no constant term, we
will have to work with uncorrected sums of squares and products. The normal
equations to be solved are:
Coefficient
Equation
or
= 3.1793 and
is just a simple case of the general methods used in Problems I and 11. Since the
model has a constant term, we can work with the corrected sums of squares and
products. This results in the single normal equation
Coefficient
Equation
The solution is
FPL 17
-22-
Equation
FPL 17
-23-
As the model contains a constant term, the normal equations can be written
Coefficient
Equation
FPL 17
-24-
variables. Suppose, for example, that we were going to fit the model
Y= + X + X
0
1 1
2 2
to the data of Problem I, but we wished to impose the restriction that
+ = 1.
1
2
This is equivalent to
= 1 -
1
2
and writing this into the original model gives
Y = + X + (1 - )X
0
1 1
1 2
or
(Y - X ) = + (X - X ).
2
0
1 1
2
+ X'
1
- X
There are two ways of getting the sums of squares and products of the revised
variables; one is to compute revised values for each observation. Thus we would have
Sums
Y'
12
25
X'
and
FPL 17
-4
-2
34
30
30
-2 -2
11
-2
10
-2 -4
-25-
-6 -1
-2
-2
Means
14
143
11
26
Often it will be easier to work directly with the original values, thus:
x'
= X'
( X')
= (X
- X )
2
[ (X1 -
]2
X )
2
2
( X - X )
2
1
2
2
= (X
- 2X X + X ) n
1
1 2
2
2
2
X ) - 2( X )(X ) + ( X )
2
1
2
1
2
2 .
- 2 X X + X
= X
n
1
1 2
2
Then, using the values that have already been computed for the original variables,
2
2
2
(117) - 2(117)(91) + (91) = 298, as before.
x' = 1,235 - 2(847) + 809 13
Similarly,
x'y' = X'Y' -
(X')(Y')
= (X1 - X2)(Y - X2) n
[ (X 1 -
][
X ) (Y - X )
2
2
n
(X )( Y) - ( X )( Y) - (X )( X ) + ( X )
2
2
2
1
2
1
= X 1Y - X Y - X X + X 1 2
2
2
n
(117)(234) - (91)(234) - (117)(91) + (91)
= 2554 - 1382 - 847 + 809 13
= 848, as before.
298 = 848
1
^
and
= 2.8456
1
^
= 11 - (2.8456)(2) = 5.3088.
0
FPL 17
-26-
(Y - X ) = 5.3088 + 2.8456(X
2
1
- X )
2
or
^
Y = 5.3088 + 2.8456 X
- 1.8456 X
FPL 17
-27-
It should be noted that fitting a regression by the least squares principle does not
require that the Y values be normally distributed about the regression surface.
However, the commonly used procedures for computing confidence limits and making
tests of significance (t and F tests) do assume normality.
The regression fitting procedures that have been described will give unbiased
estimates of the regression coefficients, whether thevariance is homogeneous or not.
However, if the variance is not homogeneous, a weighted regression procedure may
give more precise estimates of the coefficients. In a weighted regression, each
squared deviation is assigned a weight (w ), and the regression coefficients are
estimated so as to minimize the weighted sum of squared deviations. That is, values
^
are found for the s so as to minimize
j
Equation
The weights are usually made inversely proportional to the known (or assumed)
variance of Y about the regression surface. To understand the reasoning behind this,
refer to figure 8, in which a hypothetical regression of Y on X has been plotted along
with a number of individual unit values.
FPL 17
-28-
M 124 616
It is obvious that the variance of Y about the regression line is not homogeneous:
it is larger for large values of X than for small values. It is also fairly obvious that
a single observation from the lower end of the line tells much more about the location
of the line than does a single observation from the upper end. That is, units that are
likely to vary less from the line (small variance) give more information about the
location of the line than do the units that are subject to large variation. It stands to
reason that in fitting this regression the units with small variance should be given
more weight than the units withlargevariance. This can be accomplished by assigning
weights that are inversely proportional to the variance. Thus, if the variance is known
to be proportional to thevalueof one of the X variables (say X ), then the weight could
j
be
-29-
Y=
+ X .
1 1
1i
w X
i 1i
= 117, w X Y = 234.
i 1i i
^
^
= w Y
( w ) + ( w X )
i 0
i 1i 1
i i
( w X ) + ( w X
)
= w X Y
i 1i 0
i 1i
1
i 1i i
Equation
or
^
1.9129
^
+ 13
^
13 + 117
= 24.631
= 234.
Y = 2.3247X
FPL 17
- 2.9224.
-30-
Y = R.
X
This is equivalent to fitting the regression model
Y= X .
1 1
The appropriate estimate of will depend on how the variance of Y changes with
1
the level of X . Three situations will be considered: (1) the variance of Y is propor1
2
tional to X , (2) the variance of Y is proportional to X , and (3) the variance is
1
1
homogeneous.
(1) Variance of Y proportional to X . In this case, we would fit a weighted
1
regression using the weights
w = 1 .
i X
1i
The normal equation for would be
1
( w X )1 = w X Y
i 1i i
i 1i
so that
w X Y
i 1i i
=
1 w X
^
i 1i
2
1
However, w =
, so that w X Y = Y , and w X ,
= X . Hence,
i
X
i 1i i
i
i 1
1i
1i
Y
i
^
Y .
= nY
=
1
X
X
1i nX1
1
FPL 17
-31-
1
X
1i
But, if w =
i
w X Y
i 1i i ,
2
S
w X
i 1i
( )
2
X
1i = n. Hence,
1
=
, then w X Yi = (Y / X ), and w X
2
2
i 1i
i 1i
i 1i
X
X
1i
1i
2
Y
( i /X1i) .
=
n
1
2
, then the ratio of Y to X is estimated
1
by computing the ratio of Y to X for each unit and then taking the average of these
ratios. In sampling, this is called the mean-of-ratiosestimator.
2^
) = ( X Y)
1
1 1
or
^
=
1
X Y
2
1
FPL 17
SY = 234 = 2.0000
SX 117
-32-
(Y/X
)
1 = 24.631 = 1.8947
13
n
X Y
1
2
X
1
2,554 = 2.0680
1,235
---
+ X
k k
where:
Transformations
Fitting a regression in the presence of heterogeneous variance may be a lot of work.
Special study of the variance is often required to select the proper weighting procedure, and the computations involved in a weighted fitting can be quite laborious.
To avoid the computations of a weighted regression, some workers resort to a
transformation of the variables. The hope is that the transformation will largely
eliminate the heterogeneity of variance, thus permitting the use of equal weighting
procedures. The most common transformations are log Y, arc sin
(used where Y
(frequently used if Y is a count rather than a measured
is a percentage), and
variable).
FPL 17
-33-
This may be perfectly valid if the transformation does actually induce homogeneity.
But, there is some tendency to use transformations without really knowing what
happens to the variance. Also, it should be remembered that the use of a transformation may also change the implied relationship between Y and the X variables. Thus,
if we fit
log Y = + X
0
1 1
we are implying that the relationship of Y to X is of the form
X
Y = ab 1
Fitting
= + X
0
1 1
implies the quadratic relationship
Y = a + bX
SOME
+ cX
ELEMENTS
2
1
OF
MATRIX
ALGEBRA
It is not necessary to know anything about matrix algebra (as such) in order to make
a regression analysis. If you can compute and use the c-multipliers as discussed in
the sections dealing with the t-test and confidence limits, then you have the essentials.
However, an elementary knowledge of matrix algebrais very helpful in understanding
certain procedures and terms that are used in regression work.
Definitions and Terminology
A matrix is simply a rectangular array of numbers (or letters). The array is
usually enclosed in brackets. The dimensions of a matrix are specified by the number
of rows and columns (in that order) that it contains. Thus in the matrices,
FPL 17
-34-
This is an m by n or (m x n) matrix.
A square matrix is one in which the number of rows equals the number of columns.
In a square matrix, the elements alongtheline from the upper left corner to the lower
, a ).
right corner constitute the diagonal of the matrix (that is, a , a ,
nn
11 22
----
If the elements above the diagonal of a square matrix are a mirror image of those
below the diagonal (that is, a = a for all values of i and j), the matrix is said to be
ji
ij
symmetrical. Some examples of symmetrical matrices are
A square matrix in which every element of the diagonal is a one and every other
element a zero is called the identity matrix and is usually symbolized by the letter I.
The last matrix above is an identity matrix.
FPL 17
-35-
Two matrices are equal if they have the same dimensions and if all corresponding
elements are equal. Thus,
only if
The transpose of a matrix is formed by 'rotating' the matrix so that the rows
become the columns and the columns become the rows. The transpose of
The transpose of
and of [3
2]
is
FPL
- 36 -
Note that the sum (or difference) matrix has the same dimensions as the matrices
that were added (or subtracted).
Matrix Multiplication
Two matrices can be multiplied only if the number of columns of the first matrix is
equal to the number of rows of the second. If A is a (4 x 3) matrix, B is a (3 x 2),
and C is a (2 x 3), then the multiplications AB, BC, and CB are possible, while the
multiplications AC, BA and CA are not possible.
Y
The dimensions of theproduct matrix will be (r x m). In words, the above rule states
th
th column of the product matrix is obtained as
that the element in the i row and the j
the sum of the products of elements from the ith row of the first matrix and the
th
corresponding elements from the j column of the second matrix.
Most persons find it easier to spot the pattern of matrix multiplication than to
follow the above rule. A few examples may be helpful:
FPL 17
-37-
(1)
(2)
(3)
(4)
(5)
Multiplication is not possible; the number of columns in
the first matrix does not equal the number of rows in the
second.
(6)
FPL 17
-38-
(3 x 2)(4 x 3)
(4 x 3)(3 x 2)
(1 x 200)(200 x 1)
A second point to note is that even though the multiplications AB and BA may both
be possible (if A and B are both square matrices), the products will generally not be
the same. That is, matrix multiplication is, in general, not commutative. This is
illustrated by examples (1) and (6). The identity matrix (I) is one exception to this
rule; it will give the same results whether it is used in pre- or post-multiplication
(IA =AI =A).
Finally, it should be noted that any matrix is unchanged when multiplied by the
identity matrix, as in example (4).
The Inverse Matrix
ordinary algebra, an equation such as
ab = c
can be solved for b by dividing c by a (if a is not equal to zero). In the case of
matrices, this form of division is not possible. In place of division, we make use of
the inverse matrix, which basically is not too different from ordinary algebraic
division.
The inverse of a square matrix (A) is a matrix (called A inverse and symbolized
by A-1 ) such that the product of the matrix and its inverse will be the identity matrix
A
FPL 17
-1
A=I
-39-
is
since
Finding the inverse of a matrix is not too complicated though it may be a lot of work
if an electronic computer is not available. One method is to work from the basic
definition. In the matrix (A) given above, we can symbolize the elements of the inverse
matrix by the letter c, with subscripts to identify the row and column of the element.
Thus we can write,
FPL 17
-40-
or
Now, two matrices are equal only if all of their corresponding elements are equal;
therefore, we have three sets of simultaneous equations, each involving three
unknowns.
as given before.
When the matrix to be inverted is symmetrical, the inversion process is not quite
so laborious, for it turns out that the inverse of a symmetrical matrix will also be
symmetrical. This means that only the elements in and above the diagonal will have
to be computed; the elements below the diagonal can be determined from those above
the diagonal. A calculating routine for inverting a symmetrical matrix is given in
Appendix B.
As a final note, it may be mentioned that although matrix multiplication is not
usually commutative (AB BA),
the product of a matrix and its inverse does
-1
-1
commute (A
A = -1AA ).
FPL 17
-41-
Or even better,
where:
A is the matrix of sums, and sums of squares and products (computed from
the data).
^
is the matrix of estimated regression coefficients (to be computed).
the data).
FPL 17
-42-
^
^
could be solved for simply by dividing both sides by A, giving = R/A. This does
not work in matrix algebra, but there is a comparable process that will lead to the
desired result. This is to multiply each side of the equation by the inverse of A
-1
(= A ) giving
-1
We have seen that a matrix is unchanged when multiplied by the identity matrix, so
^
^
I = and the above equation is
(Note
again that c
symmetric.)
ij
FPL 17
-43-
Then, since two matrices are equal only if corresponding elements are equal, we
have
or in general
To reassure ourselves that this procedure actually works, let us take a simple
numerical example. Suppose we have the normal equations
FPL 17
-44-
The inverse can (and should) be checked by multiplying it and the original matrix to
see that the result is the identity matrix
check
Then, the regression coefficients are given by
Or,
This is only one of the uses of the inverse matrix. It is, in fact, one of the leas
important uses, since the regression coefficients are just as easily computed by the
more familiar simultaneous equation techniques. Other uses of the inverse will be
discussed later under testing and computing confidence limits.
A word on notation is in order here. In regression work, the subscripting of the
c-multipliers depends on the form of the normal equations. If the normal equations
contain the constant term ( ) so that
0
FPL 17
-45-
If the normal equations do not contain the constant term, the subscripts for the
c-multipliers will be
It will be recalled that there are two situations in which the normal equations will
not have a constant term. The first is, of course, when the model being fitted does not
contain a constant term. The second is when the model being fitted does have a
constant term but the fitting is being done with corrected sums of squares and products.
ANALYSIS OF VARIANCE
It is important to keep in mind that when a linear model such as
is fitted to a set of sample data, we are in effect, obtaining sample estimates of the
population regression coefficients , , , . . . . These estimates will obviously
0
1
2
k
be subject to sampling variation: their values will depend on which units were, by
chance, selected for the sample.
If we have some hypothesis concerning what one or more of the coefficients should
be, this leaves us with the problem of determining whether the differences between
the observed and hypothesized values are real or could have occurred by chance. For
example, suppose we have a hypothesis that the relationship between Y and X is
linear. From a sample of Y and X values, we obtain the estimated equation
To test our hypothesis of a linear relationship, we would want to test whether the
^
observed value of = .082 represents a real or only a chance departure from a
1
true value of = 0.
1
FPL 17
-46-
2
in sampling a population for which
= 0 and
= K.
The exact form of the hypothesis will depend on the objectives of the research. The
main requirements are that the hypothesis be specified before the equation is fitted
and that it be meaningful in terms of the research objective.
This portion of the paper will deal with the use of analysis-of-varianceprocedures
in testing hypotheses about the coefficients. Some hypotheses can also be tested by
the t-test and this will be described later.
A General Test Procedure
There is a basic procedure that may be used in all situations, but in practice the
computational routine often varies with the method of fitting and the hypothesis to be
tested. First, lets look at the basic procedure and then illustrate the computations
or some of the more common testing situations. To illustrate the discussion of the
basic procedure, assume that we have a set of n observations on a Y-variable and
FPL 17
-47-
th
normal equation.
2
) is called the total sum of squares. The second
In this equation the first term (SY
^
term (S
R ) is called the reduction or regression sum of squares.
j
j
The next step is to rewrite the basic model, imposing the conditions specified by
the hypothesis. In this case the hypothesis is that = 1 and = 2 , so we have
2
3
or
where:
FPL 17
-48-
Now we fit this "hypothesis" model by the standard least squares procedures and
again compute
Residual
th
In this equation, the Rk term is the right-hand side of the k normal equation of the
^
set used to fit the hypothesis model, and the
are the resulting solutions of that set.
k
The analysis of variance can now be outlined as follows:
Degrees of
freedom
Source
Sum of
squares
Mean
square
for
testing
hypothesis
In this table, the residual sums of squares for the hypothesis and maximum models
are computed according to the equations given above. The difference is obtained by
subtraction. The degrees of freedom for a residual sum of squares will always be
equal to the number of observations (n) minus the number of independently estimated
coefficients. Thus, in the maximum model we estimated five coefficients so the
residual sum of squares would have n-5degrees of freedom. In the hypothesis model,
we estimated three coefficients so the residuals will have n-3 degrees of freedom.
The degrees of freedom for the difference (2) are then obtained by subtraction. The
mean squares are equal to the sum of squares divided by the degrees of freedom.
Finally, the test of the hypothesis can be made by computing
F =
This value is compared to the tabular value of F(table 6, Appendix E) with 2 and n-5
degrees of freedom (in this instance). If the computed value exceeds the tabular value
at the selected probability level, the hypothesis is rejected.
FPL 17
-49-
-50-
turn out that every point will lie on the regression surface. The sum of squared
residuals will be zero. This will be true regardless of the independent variables used
in the model. This is a statistical form of the geometrical fact that two points may
define a straight line, three points may define a plane in three-dimensional space,
and n points define a "hyperplane" in n-dimensional space.
Problem IX - Test of the Hypothesis that
=1
Whenever the hypothesis specifies that a coefficient or some linear function of the
coefficients have a value other than zero, the basic test procedure must be used. To
illustrate the test of this non-zerohypothesis we will assume that we have fitted the
model (maximum model)
FPL 17
-51-
Since three
freedom.
coefficients
were
fitted,
the
= Total
Reduction
The next step is to fit the hypothesis model. Under the hypothesis that + = 1
1
2
(or = 1 - ) the model becomes
1
or
where:
FPL
17
-52-
At this stage individual values could be found for the new variables X' and Y' and
1
these could be used to compute the sums, and sums of squares and products needed.
However, with a little algebraic manipulation, we can save ourselves a lot of work.
Thus,
FPL 17
-53-
= Total
- Reduction
= 4091 - 3986.0545
= 104.9455, with 13 - 2 = 11 df.
With these values we can now summarize the analysis of variance and F-test.
df
Sum of
squares
Residual-Hypothesis Model
11
104.9455
Residual-Maximum Model
10
101.7204
3.2251
Source
Difference
10.17204
3.2251
level.
Mean
square
= 0.
One of the most common situations in regression is to test whether the dependent
variable (Y) is significantly related to a particular independent variable. We might
want to test this hypothesis when the variableis fitted alone, in which instance (if the
variable is X ) we might fit
2
FPL 17
-54-
= 0.
Or, we might want to test the same hypothesis when the variable has been fitted in
the presence of one or more other independent variables. We could, for example, fit
= 0.
The latter situation will be illustrated with the data of Problem I. If we work with
uncorrected sums of squares and products, the normal equations for fitting the
maximum model are:
FPL 17
-55-
Residual
= Total
- Reduction
= 6046 - 5948.5182
= 97.4818, with 13 - 4 = 9 df.
Under the hypothesis that
At this point we depart slightly from the basic procedure. Ordinarily, the residuals
for the hypothesis model would next be computed and then the difference in residuals
between the maximum and hypothesis models would be obtained. But, where the
hypothesis results in no change in the Y-variable, the difference between the residuals
is the same as the difference between the reductions for the two models.
Difference in residuals = Hypothesis models residuals - Maximum model residuals.
FPL 17
-56-
df
5948.5182
5335.4130
613.1052
613.1052
97.4818
10.8313
9
13
6046
As F exceeds the tabular value at the .01 level, the hypothesis would be rejected at
this level. We say that X makes a significant reduction (in the residuals) when fitted
2
after X and X
1
3'
Problem XI - Working With Corrected Sums of Squares and Products
It has been shown that if the model contains a constant term ( ), the fitting may be
0
accomplished with less effort by working with the corrected sums of squares and
products. That is, instead of fitting
we can fit
_
When this is done, it must be remembered that y = Y - Y is the deviation of Y
from its mean and that with n observations only n-1 of the deviations are free to vary
(since the deviations must sum to zero). Thus, the total sum of squares (corrected
sum of squares) will have only n-1 df s. Also, in the maximum model we now estimate
three rather than four coefficients, so the reduction will have only three degrees of
freedom.
FPL 17
-57-
Despite these changes, the test will lead (as it should) to exactly the same conclusion
that was obtained in working with uncorrected sums of squares and products.
Thus, using the data of Problem I and testing the same hypothesis that was tested
in Problem X, the normal equations for the maximum model are:
^
The solutions are = 2.7288,
1
reduction sum of squares is
^
^
= -0.0975. Therefore the
= -1.9218, and
2
2
Reduction
= Total
- Reduction
= 0, the reducedmodelwouldbecome y = x + x
2
1 1
2 2
for which the normal equations are
FPL 17
-58-
^
^
The solutions are = 2.3990 and = -0.2149, so the reduction clue to fitting this
1
3
model is
Reduction
= (2.3990)(448) + (-0.2149)(-226)
= 1123.3194, with 2 df.
df
1736.5182
613.1988
613.1988
97.4818
10.8313
Total
(corrected)
12
1834.
To show what variables are involved in the two models, the various sources in the
analysis of variance may be relabeled as follows:
Source
Due to X , X , and X
1
3
Due to X , and X
3
1
- - - - - - - - - - - - - - - - -
Residuals
Total
= 0.
FPL 17
-59-
= 1736.5182,
= 1834,
Total (corrected)
Residual
with 3 df
with 12 df
= 97.4818,
with 9 df
giving
df
Reduction due to X , X , X
1 2
1736.5182
Reduction due to X
1
1102.7520
- - - - - - - - -1 - - - - - - - - - - - - - - - - - - - - - - - - - - - - Gain due to X
Residuals
Total
FPL 17
and X
after X
2
9
12
-60-
633.7662
316.8831
97.4818
10.8313
1834.
= 1736.5182,
=
with 3 df
1834,
with 1 2 df
with 9 df.
= 97.4818,
Under the hypothesis that 1 + 22 = 0 (or 1 = -22), the model can be written
or
where:
The normal equations for this model are:
Coefficient
Equation
make use of the sums of squares and products that were computed in fitting the
maximum model.
FPL 17
-61-
Thus,
Source
Maximum model reduction
df
1736.5182
9
12
97.4818
1834
-62-
10.8313
In
or
FPL 17
-63-
This model gives a reduction of 317.156 with 1 df, so the test of the hypothesis is:
Source
df
154.842
Residual
11
Total
13
81.743
7.431
553.741
-64-
Two examples that illustrate the method will be given. In Problem XII we fitted
the model
^
^
and tested the hypothesis that 2 = 3 = 0. We found 2 =-1.9218and 2
=-0.0975,
The inverse is
FPL 17
-65-
= 633.7408
^
= 0. We found 2 = -1.9218, and
By the present method, the gain
could be computed as
Whether or not this method saves any time or labor will depend on the number of
variables being tested, the number of variables in the hypothesis model, and the
individual's facility at solving simultaneous equations and inverting matrices. If only
one of the
variables is being tested (as in the last example), the t-test
as described in the next chapter will usually be the easiest. If the hypothesis model
involves only one or two variables or if the several variables are to be tested, it
may be easiest to fit each model and find the reduction due to each rather than
with the c-multipliers. For the beginner, the best method will usually be the one
with which he is most familiar.
THE
t-TEST
The coefficients of a fitted regression are sample estimates which, like all
estimates, are subject to sampling variation. The statistical measure of the
of a variable is the variance, and the measure of the association in the
of two variables is the covariance.
-66-
where:
is
is
The covariance of two coefficients estimated from the same set of normal equations
2.
3.
where:
a linear
function
of
The t computed by this equation will have degrees of freedom equal to those of the
mean square used in the denominator.
Putting these three items together, we could, for example, test the hypothesis that
1 + 22 = 8 by
FPL 17
-67-
Since this last is the hypothesis that was tested in Problem XI, the t-test can be
^
illustrated with the same data. In that example, we had = -1.9218, and the residual
2
mean square with 9 degrees of freedom was Residual
= 10.8313.
The matrix of coefficients from the normal equations was
022
If the absolute value (algebraic sign ignored) of t exceeds the tabular (table 7,
Appendix E) value for 9 df at the desired probability level, then the hypothesis is
rejected. In this case tabular t = 2.262 (at the .05 level), so we would reject the
hypothesis that = 0.
2
The F-test of this hypothesis also leads to a rejection. Those who are not familiar
with the relationship between the t and F distributions sometimes ask which is
FPL 17
-68-
the best test. The answer is that where both are applicable, the easiest one is best
because essentially there is no difference between them. If a given set of data is
used to test some hypothesis by both the t and F tests, it will be found that F with 1
and k degrees of freedom is equal to the square of t with k degrees of freedom. In the
2
example given above, we found t = -7.525 so t = 56.63. The F value for testing the
same hypothesis was
except for rounding errors, they should be identical.
Problem XV - Test of a Non-Zero Hypothesis
The main advantage of the t-test over the analysis of variance is in the test of a
non-zero hypothesis. It will be recalled that an F-test of this type of hypothesis
required the computation of separate total and residual sums of squares for both the
maximum and the hypothesis model.
the t-test, a non-zero hypothesis is
handled as easily as any other. This can be illustrated using the data in Problem XI
for a test of the hypothesis that
t
= 1.
12
^
^
In that problem we had 1 = 2.7288 2 = -1.9218, and Residual Mean Square =
10.8313 with 9 df. As we have seen, the inverse matrix is:
The absolute value of t is less than the tabular value (2.262) at the .05 level, so
the hypothesis would not be rejected.
It should be noted that this is a test of the hypothesis that 1 + 2 = 1, when fitted
in the model
This is not the same as the hypothesis that was tested in Problem IX. In that problem
= 1 was tested when fitted in the model
CONFIDENCE
LIMITS
General
If you have ever asked for estimates on the cost of repairing a car or a TV set,
you are probably well aware of the fact that there are good and there are bad
estimates. Sample-based regression estimates can also be good or bad, and it is
important to provide some indication of just how good or bad they might be.
The variation that may be encountered in fitting sample regressions can be
illustrated by five separate samples of 10 units each, selected at random from a
population in which Y was known to have no relationship to the X variable. The simple
linear regressions that resulted from the five samples were as follows:
FPL 17
-70-
The sample regressions have been plotted in figure 9. The heavy horizontal line
represents the mean value of Y for the population.
M 124 626
These sample regressions illustrate two points that should never be forgotten by
those who work with regression. The first is that the fitting procedure can be applied
to any set of data. Put in the numbers, turn the crank, and out will come an equation
that expresses one variable in terms of one or more other variables. But no
how hard or how many times the crank is turned, it is impossible to induce relationships that did not exist to start with. It may all look very scientific, but the mere
existence of an equation with coefficients computed to eight decimal places on a
$3 million computer does not prove that there is a relationship.
The second point is that sample estimates are subject to variation. The variation in
these regressions may be quite startling to those who have had little experience with
the behavior of sample estimates. These results should not, however, be allowedto
shatter the beginners hopes for regression analysis. Ten units from this population
is far too light a sample for fitting even a simple linear regression and the erratic
results are no more than might be expected.
FPL 17
-71-
where:
The t would have degrees of freedom equal to those for the residual mean square.
In the section on the analysis of variance, we fitted a regression of Y on X , X ,
1 2
^
The residual
and X3 (Problem XI). The estimated value of 2 was 2 =
mean square was 10.8313 with 9 degrees of freedom, and the c-multiplier was
c22
= 0.006 022 9 (Problem XV). The 95-percent confidence limits for 2 would
then be given by
or
The confidence limits can be used as a test of some hypothesized value of the
coefficient. Since these limits do not include zero as a possible value, we would
reject the hypothesis that = 0. This is the sameconclusion that we reached by the
2
FPL 17
-72-
F and t tests. The confidence limit approach is usually more informative than F or
t tests of a hypothesis.
If we wish to place confidence limits on a linear function of the regression
coefficients, we must remember the rule for the variance of a linear function. This
rule was given previously, but will he repeated here.
If
are
set
of
^
variance of the linear function (a1 1
estimated
+ a 2 2
where:
This is nowhere near as difficult as it appears. For example, the variance of the
^ would be
function 2^ 1 - 3
2
or
FPL 17
-73-
the
If
where t has degrees of freedom equal to the df for the residual mean square and is
selected for the desired level of confidence.
^ (Predicted Mean Y)
Confidence Limits on Y
The preceding rule may be applied to theproblem of determining confidence limits
^
for Y (a predicted value of mean Y for a given set of values for the X variables). If
the predicted value of mean Y is
then the confidence limits can be obtained by treating the specified values of the Xs
as constants and applying the preceding rule. Thus,
Confidence Limits
FPL 17
-74-
Note that in this case where the fitted model has no constant term, we will have no
c-multipliers with a zero in the subscript.
All of the
regressions .
The reader who has always used corrected sums of squares and products in
regression work may find that the confidence limit equations given are not quite the
same as those with which he is familiar. They will, however, give exactly the same
confidence limits. The somewhat
familiar equation (for unweighted regression)
for the confidence limits is
In the following problem, we will see that these equations lead to the same result.
Problem XVI - Confidence Limits In Multiple Regression
In Problem I, we fitted a regression of Y as a linear function of X1, X2, and X3.
The fitted equation was
The same result was obtained whether we used corrected or uncorrected sums of
squares and products. The residual mean square (Problem XI) was 10.8313 with
9 degrees of freedom.
Suppose now that we predicted the mean value of Y associated with the values
X1 = 6, X2 = 8, and X 3 = 12. We would have
FPL 17
-75-
The method of computing the confidence interval for this estimate will depend on
whether corrected or uncorrected sums of squares and products were used in the
fitting. We will assume first that the fitting was done with uncorrected terms. In
this instance the normal equations were
The inverse is
FPL 17
-76-
Thus, unless a 1 in 20 chance occurred in sampling, we can say that the true mean
of Y associated with X1 = 6, X2 = 8, and X3 = 12 is somewhere between 4.0948 and
Note that this does not implythat individual values of Y will be found between
these limits. This is a confidence interval on regression Y which is the mean value of
Y associated with a specified combination of X values.
Suppose now that in fittingthis equation, we had used the corrected sums of squares
and products so that the normal equations were
is
FPL 17
-77-
Note that this is the same as the inverse of the matrix of uncorrected sums of
squares and products with the first row and column deleted.
The confidence limits can be computed by the equation
FPL 17
-78-
The normal equation for the simple linear regression fitted with corrected sums of
squares and products is
is simply
or
For a value of X
= 7, we would have
FPL 17
-79-
or
If regression Y and the confidence limits are computed for several values of X,
the confidence limits can be displayed graphically. In the above example, we would
have
In figure 10 these points have been plotted and connected by smooth curves.
FPL 17
-80-
620
The formula that can be used when the corrected sums of squares and products
have been used in the fitting would be
COVARIANCE
ANALYSIS
It frequently happens that the unit observations to be analyzed can be classified into
two or more groups. A set of tree heights and diameters might, for example, be
grouped according to tree species. This raises the question of whether separate
prediction equations should be used for each group or could some or all of the groups
be represented by a single equation? Covariance analysis provides a means of
answering this question.
In the case of simple linear equations, group regressions may differ either because
they have different slopes or, if the slopes are the same, because they differ in level
(fig. 11).
FPL 17
-81-
M 124 623
The standard covariance analysis first tests the hypothesis of no difference in slope.
Then if there is no evidence of a difference in slopes, the hypothesis of no difference
in levels is tested. If no significant difference is found in either the slopes or levels,
then a single regression may be fitted ignoring the difference in
The following set of data will be used in the problems illustrating the analysis of
covariance.
Group C
Group B
Group A
Y
5.9
10.7
11.4
9.6
12.6
8.0
12.8
7.5
12.5
14.2
8.4
Sums
Means
FPL 1 7
113.6
10.3273
0.8
3.1
4.4
1.6
4.6
2.6
5.5
1.1
3.9
4.9
1.4
33.9
3.0818
5.2
13.4
10.0
7.5
10.1
11.9
10.7
6.8
9.0
1.6
5.8
3.6
2.0
4.3
5.8
4.8
3.3
2.6
84.6
9.4
33.8
3.7556
Y
7.8
12.4
10.9
16.8
13.9
11.4
8.9
13.7
16.0
0.6
3.4
1.5
.7
4.5
4.1
2.3
1.3
3.1
4.6
121.7
12.17
26.1
2.61
11
242.72
9
848.20
10
1559.73
132.73
145.98
89.47
390.91
346.69
356.57
69.541 819
52.96
78.641
28.256 364
19.042 222
21.349
40.815 455
28.97
38.933
-82-
2
= 93.8; Y 2 = 3650.65; y =
667;
1
2
2
X1 Y = 1094.17; x1y = 93.949 334; X 1= 368.18; x = 74.898 667.
1
n = 30; Y = 319.9; X
Using corrected sums of squares and products, the normal equation for a linear
regression with constant term is:
= 1.444 469
1
= (1.444 469)(40.815 455) = 58.956 659, with 1 df
Reduction
2
- Reduction
Residual
= y
= 69.541 819 - 58.956 659 = 10.585 160, with 9 df.
Group B:
^
19.042 2221 = 28.97
^
356
=
1
= (1.521 356)(28.97) = 44.073 683, with 1 df
Reduction
Residual
Group C:
^
21.349 = 38.933
1
^
1 = 1.823 645
Reduction
Residual
^
74.898 667 = 93.949 334
1
^
1 = 1.254 353
Reduction
Residual
353)(93.949 334) =
FPL 17
-83-
629, with 1 df
Two approaches to the analysis of covariance will be illustrated. The first method
is that given by Snedecor (7), while the second is a general method involving the
introduction of dummy variables.
Problem XVIII - Covariance Analysis
Snedecor (7) presents the analysis of covariance in a very neat form. The steps in
this procedure are summarized in table 1.
Table 1.--Analysis of covariance
not significant at
0.05 level.
The first three lines in this table summarize the results of the fitting of separate
linear regressions for each group. In line 4, the residuals about the separate
regressions and the associated degrees of freedom are pooled. This pooled term can
be thought of as the sum of squared residuals about the maximum model; it represents
the smallest sum of squares that can be obtained by fitting straight lines to these
observations.
FPL 17
-84-
Skipping to line 6 for the moment, the first four columns are the pooled degrees of
freedom and corrected sums of squares and products for the groups. The last three
columns summarize the result of using the pooled sums of squares and products to
fit a straight line. The normal equation and solution for this fitting would be:
^
68.647 586 = 108.718 455
1
^
= 1.583 719
1
Thus the reduction sum of squares with 1 degree of freedom is
Reduction
= y2
- Reduction
FPL 17
-85-
The significant value of F suggests that the group regressions are different. The
difference is mostly due to a difference m levels. There is no evidence of a real
difference in slopes.
If we are not interested in finding out whether the difference (if any) in the group
regressions is in the slopes or the levels, an overall test could be made using the
difference between lines 8 and 4. We would have:
It is possible to test more complex hypotheses than these. We could, for example,
test for difference in slope and level between groups A and C, or for the average of
groups A and B versus group C. We could also deal with multiple or curvilinear
regressions and test for differences between specified coefficients or sets of
coefficients. It is probably safe to say that readers who have sufficient understanding
of regression to derive meaningful interpretations of such tests will usually know
how to make them.
Covariance Analysis With
Variables
Since X is equal to 1 for all observations, the normal equations are equivalent to:
0
FPL 17
-86-
GPO 815-411-6
So, the end result will be the same as that given by the methods previously described.
The idea of a dummy variable comes in quite handy in dealing with the problem of
group regressions. There are several ways of applying the idea, but the most easily
understood is to introduce a dummy variable for each group. The dummy variable
would be defined as equal to 1 for every observation in that particular group, and
equal to zero for any observation that is in a different group. As we are interested in
linear regressions of Y on X ,
the dummy variables could (in the case of three
1
groups) be labeled X , X , and X where
2
3
4
X
With these variables, we can now express the idea of separate linear regressions
for each group by the general linear model
After solving for the coefficients, the regression for any group can be obtained by
assigning the appropriate value to each dummy variable. Thus for Group A, X =1,
2
X = 0, and X = 0.
3
4
So we have:
The equation will be exactly the same as that we would get by fitting a simple linear
regression of Y on X for the observations in group A only.
1
FPL 17
-87-
Under the hypothesis that the three groups have regressions that differ in level
but not in slope, the model would be
In this model, 1 is the common slope, while 2, 3, and 4 represent the different
levels.
The difference in reduction sum of squares for these two models could then be
used in a test of the hypothesis of common slopes.
Under the hypothesis that there is no difference in either slope or level, the model
becomes
The difference in reduction between this model and the model assuming common
slopes can be used to test the hypothesis that there is no difference in levels.
Problem XIX - Covariance Analysis With Dummy Variables
The use of dummy variables for a covariance analysis will lead to exactly the same
result as the method of Snedecor (7). Applying the procedure to the data of the
previous example, the values of the variables would be as follows:
FPL 17
-88-
FPL 17
-89-
FPL 17
-90-
or
The solutions, which are easily obtained by working with pairs of equations involving
the same coefficients, are:
These are the same as the equations that would have been obtained by fitting separate
regressions in each group.
The reduction due to this maximum model would be:
Reduction
FPL 17
-91-
Under the hypothesis for common slopes but different levels, the model becomes
or,
giving a reduction of
Reduction
FPL 17
-92-
as before.
Now t o test the hypothesis of no difference in levels (assuming no difference in
slopes),
must fit
model
or
Then
common slopes) is
a s before.
FPL 17
Although the two procedures will lead to exactly the same results, the computational
routine of Snedecors procedure (7) is probably easier to follow and, therefore, better
for the beginner. The advantage (if any) of the dummy variable approach might be
that it gives a somewhat clearer picture of the hypotheses being tested. Once the
dummy variable approach has beenlearned, it maybe easier to work out its extension
to the testing of more complex hypotheses.
Dummy variables are also useful in introducing the relationship between regression
analysis and the analysis of variance for various experimental designs. Those
interested in this subject will find a brief discussion in Appendix D.
DISCRIMINANT FUNCTION
Closely related to the methods of regression analysis is the problem of finding a
linear function of one or more variables which will permit us to classify individuals
as belonging to one of two groups. The function is known as a discriminant. In
forestry, it might be used, for example, to find a function of several measurements
which would enable us to assign fire or insect damaged trees to one of two classes:
will live or will
The methods will be illustrated by the intentionally trivial example of classifying
an individual as male or female by means of the individuals height (X ), weight (X ),
1
2
and age (X ). To develop the discriminant, measurements of these three variables
3
were made on 10 men and 10 women.
FPL 17
-94-
The first step is to compute the difference in the group means for each variable and
the corrected sums of squares and products for each group.
Mean differences:
The next step is to compute the pooled variances and covariances. The pooled
variance for X will be symbolized by s and computed as
j
jj
where:
n = number of males
m
n = number of females
f
FPL 17
-95-
1
classify an individual as male or female. Fisher has
be determined by solving the normal equations
FPL 17
-96-
where:
Thus,
= 8.908; significant at the .01 level.
FPL 17
-97-
This test tells us that there is a significant difference in the mean values of the
discriminant between males and females. Looking at it another way, we have shown
a significant difference between the two groups using measurements on several
characteristics. This is analogous to the familiar t and F tests where a significant
difference is shown between two groups using only a single variable. In fact, if we
fit and test a discriminant function using just a single variable, for example, weight,
we will get the same F value (29.824) as we would by testing the difference in weight
between male and female using an F test of a completely randomized experimental
design.
Testing the Contribution of Individual Variables or Sets of Variables
To test the contribution of any set of q variables in the presence of some other set
2
of p variables, first fit a discriminant to the p variables and compute D . Then fit
p
2
. The test of the contribution of the q variables
all p + q variables and compute D
p +q
in the presence of the p variables is:
F (with q and N-p-q-1degrees of freedom)
Thus, to test the contribution of weight and age in the presence of height, we first
fit a discriminant function for height alone. The single equation is:
2
= 6.0127. The test is:
3
Hence, weight and age do not make a significant contribution to the discrimination
between male and female when used after height.
FPL 17
-98-
2
= 5.9660
2
2
= 6.0127
3
-99-
For a discriminant involving height and weight but not age, the probability of a
misclassification would be about 0.11096. For the discriminant involving weight
alone, the probability of a misclassification is about 0.11098. Thus, we see, (as
previous tests indicated) that our classifications using weight alone would be almost
as reliable as those using weight, height, and age. For a discriminant involving
height alone, the probability of a misclassification is about 0.18, and with a discriminant using age alone, about 0.368 of our classifications would be in error.
Reducing the Probability of a Misclassification
There are two possible procedures for reducing the proportion of misclassifications.
One of these is to look for more or better variables to be used in the discriminant
function. The second possibility is to set up a doubtful region within which no
classification will be made. This requires determining two values, Ym and Yf. All
individuals for which Y is greater than Y
and Y
f
m
To determine Y
f
no classification will be made.
and Y
it is first necessary to decide the proportion of
m
f
misclassifications we are willing to tolerate. Suppose, for example, we will use our
three-variable discriminant function but we wish to make no more than 5-percent
misclassifications. The procedure is to look in a table of the cumulative normal for
a value of such that the probability of getting a standard normal deviate greater
than is 0.05. The value of meeting this requirement is =
appropriate limit values are
Then the
FPL 17
-100-
Basic Assumptions
The methods described here assume that for each variable, the within-group
variance is the same in both groups. Different variables can, of course, have different
variances. Also, any given pair of variables is assumed to have the same within-group
covariance in each group. All variables must follow (within the group) a multivariate
normal distribution. Since the methods are based on large sample theory, it is
ordinarily desirable to have at least 30 observations in each group.
ELECTRONIC COMPUTERS
The present popularity of regression analysis is due in no small way to the
advances that have been made in electronic computers. The computations involved in
fitting regressions with more than two or three independent variables are quite
tedious,
and with a large number of observations, fitting even a simple linear
regression may be an unpleasant task.
Also, the possibilities for simple but
devastating
arithmetical
mistakes
are
great.
Modern electronic computers have
overcome both of these obstacles. They can handle huge masses of raw data and
subject it to numerous mathematical operations in a matter of minutes, and they
seldom make mistakes.
Nearly every phase of regression analysis can be handled by one or more of the
computers and almost every computing center has programs for obtaining sums of
squares and products, fitting multiple regressions, inverting matrices, computing
reduction and residual sums of squares, etc. Despite the high per hour rental on
these computers, the cost of doing a particular regression computation will usually
be a small fraction of what it would cost to do the same job with a desk calculator,
and the work will rarely contain serious errors.
Because of the numerous variations in these programs and the rate at which new
ones are being produced, no attempt will be made to list everything that is available
or to describe the use of such programs. This information can best be obtained by
first learning what regression is, how and why it works, and then discussing your
needs with a computer specialist.
To merely indicate what can be done by a computer, a brief description will be
given of a few of the existing programs.
TV REM is the designation of a regression program for the IBM 704 computer.
It will take up to 586 sets of observations on a Y and up to 9 independent (X) variables
and compute the mean of each variable and the corrected sums of squares and products
FPL
17
-101-
for all variables. It will also fit the regressions of Y on all possible linear
combinations of up to nine independent variables (a total of 511 different equations)
and the reduction sum of squares associated with each fitted equation. The cost of
this may vary from $40 to $200, depending largely on the machine rental rate and
to a lesser extent on the volume of data. This program is described in a publication
by L. R Grosenbaugh (6).
SS XXR is another program for the IBM 704. It will take up to 999,999 sets of
up to 41 variables and compute their means, all possible uncorrected sums of squares
and products, the corrected sums of squares and products, and the simple correlation
coefficients for all possible pairs of variables. The cost may run from $5 to $50,
depending again on machine rental rates and on the number of observations and
variables.
These give just a faint idea of what is available. Other programs will compute
sums of squares and products and give the inverse matrix for 40 or 50 variables, fit
regressions for as many as 60 independent variables, or fit weighted regressions
and regressions subject to various constraints One programwill follow what is known
as a stepwise fitting procedure (see Appendix A, Method III), in which the best single
independent variable will be fitted and tested first; then from the remaining variables,
the program will select and fit the variable that will give the greatest reduction in
the residual sum of squares. This will continue until a variable is encountered that
does not make a significant reduction. The program can also be altered so as to
introduce a particular variable at any stage of the fitting.
No space need be devoted in this Research paper to encouraging the reader to look
into the computer possibilities, for he will be a convert the first time he has occasion
to fit a four-variable regression and compute the c-multipliers--ifnot sooner.
CORRELATION
COEFFICIENTS
General
In earlier literature, there is frequent reference to and use of various forms of
correlation coefficients. They were used as a guide in the selection of independent
variables to be fitted, and many of the regression computations were expressed in
terms of correlation coefficients. In recent years, however, their role has been
considerably diminished, and in at least one of the major texts on regression analysis,
correlation is mentioned less than a half-dozen times. The subject will be touched
upon lightly here so that the reader will not be entirely mystified by references to
correlation in the earlier literature.
FPL 17
-102-
GPO 815-411-5
The
Simple
Correlation
Coefficient
The correlation
would indicate
approaching -1
0 would suggest
In regression work we will seldom be dealing with strictly random samples. Usually
we try to get a wide range of values of the independent variable (X) in order to have
more precise estimates of the regression coefficients or to spot the existence of
curvilinear relationships. In addition, the data may not be from a normal population.
Far these reasons, the sample correlation coefficient computed from regression data
will usually not be a valid estimate of the population correlation coefficient.
It will, however, give a measure of the degree of linear association between the
sample values Y and X and this has been one of its primary uses in regression. If we
have observations on a Y and several X variables, the X variable having the strongest
(nearest to
or -1) correlation with Y will give the best association with Y in a
simple linear regression. That is, a linear regression of Y on this X will have a
smaller residual than that of the simple linear regression of Y on any of the other
X variables.
In this use of the correlation coefficient, it must be remembered that it is a measure
of linear association. A low correlation coefficient may suggest that there is little
or no linear relationship between the observed values of the two variables. There may,
however, be a very strong curvilinear relationship. The simple correlation between
Y and the X variables and among the X variables themselves may also be used as a
somewhat confusing guide in the selection of independent variables to be used in the
fitting of a multiple regression. In general, when two independent variables are
highly correlated with each other, it is unlikely that a linear regression involving both
of these variables will be very much better than a linear regression involving only
one of them. If we had, for example:
FPL 17
-103-
+ X , or Y
= + X . Of the two simple regressions,
0
1 1
2
0
2 2
Y
=
+ X would give the better fit, since the correlation of Y and X is greater
1
0
1 1
than the correlation of Y and X . In practice, the correlations usually are not so
2
large or the indications so clearcut. When a number of X variables are under consideration for use in a multiple regression, inspection of the simple correlation
coefficients between Y and each X and between pairs of Xs provides little more than
a rough screening.
1
FPL 17
y1
= .84, r
y2
= .78, and r
-104-
21
(= r
12
This tells us that after fitting the linear regression of Y on X , there would be little
1
associations between Y and X . A more exact way of putting this is that the corre2
lation between X and the residuals about the regression of Y on X is very low (.034).
1
2
The general equation for the partial correlation between Y and X after fitting the
j
linear regression of Y on X is
k
y21
, r
y31
The process can be extended to the extent of the individual's inclination and energy.
The correlation of X with the residuals of Y after fitting the regression of Y on X .
1
4
X , andX would be:
2
3
The general equation for the correlation between X and the residuals of Y after
j
fitting the regression of Y on X , X ,
and X
is
k
1
2 ---
The use of partial correlation coefficients as an aid in the selection of the best
independent variables to be fitted in a multiple regression, has lost much of its
popularity since the advent of electronic computers. With these machines it has
FPL 17
-105-
commonly used measure of how well a regression fits a set of data is the
2
coefficient of determination, symbolized by r if the regression involves only one
2
independent variable, and by R if it involves more than one independent variable. For
the common case of a regression with a constant term ( ) which has been fitted with
0
corrected sums of squares and products, the coefficient of determination is calculated
as
2
Thus, R represents the proportion of the variation in Y that is associated with the
regression on the independent variables.
If the regression has been fitted and the reduction computed with uncorrected
2
sums of squares the formula for R is
Then, since the reduction sum of squares is equal to the estimated coefficient
times the right-hand side of its normal equation, we have
Thus,
FPL 17
-106-
-107-
FPL 17
-108-
Then,
not significant at the .05 level.
Thus, the linear regression of Y on X is not significantly better (from the stand2
point of precision) than the regression of Y on X . If X were significantly better than
1
2
X but X were more easily measured, thenselecting the best of the two regressions
1
1
becomes a matter of deciding how much the extra precision of X is worth.
2
By working with partial correlation coefficients it is possible to extend this test
to the problem of which of two variables is better, when fitted after some specified
set of independent variables. The test cannot, unfortunately, be extended to the
comparison of two sets of independent variables.
FPL 17
-109-
SELECTED REFERENCES
FPL 17
-110-
In each of these
are warranted by
impossible to get
have been checked,
mensurate with the
methods, and throughout this Paper, more digits are carried than
the rules for significant digits. Unless this is done it is usually
any sort of check on the computations. After the computations
the coefficients should be rounded off to a number of digits comprecision of the original data.
Method I. --Basic Procedure. Basically, all methods involve manipulating the equations
so as to eliminate all but one unknown, and then solving for this unknown. Solutions
for the other unknowns are then obtained by substitution in the equations that arise
at intermediate stages. This may be illustrated by the following direct approach
which may be applied to any set of simultaneous equations.
^
Step 1. Divide through each equation by the coefficient of , giving
1
^
Step 2. Eliminate by subtracting any one of the equations (say the first) from
1
each of the others
FPL 17
-111-
^
Step 3. Divide through each equation by the coefficient of giving
2
^
Step 4. Subtract either equation (say the first) from the other to eliminate .
2
^
Step 5. Solve for
^
^
Step 6. To solve for , substitute the solution for in one of the equations
2
3
(say the second) of Step 3.
so,
^
^
^
Step 7. To solve for substitute for and in one of the equations (say the
1
2
3
third) of Step 1.
^
Now substitute the solutions for
^
^
, , and in this equation,
2
3
Check.
Method 11.--Forward Solution. A systematic procedure for solving the normal
equations is the so-called Forward Solution. It is a more mechanical routine and
perhaps a bit more difficult to learn and remember, but it has the advantage of
providing some supplementary information along the way. The steps will be described
using the symbols of table 2. The numerical results of these steps will be presented
FPL 17
-112-
FPL 17
-113-
In the last column are the reduction sums of squares that would be obtained by
linear regression.
fitting the
These are computed as
or,
Step 2. Rewrite the sums of squares and products from the X row.
1
Step 3. Divide each element in row 2 by the first element in that row (a ).
11
Thus, q
= a /a
12
12 11
Step 4. Compute the matrix of sums of squares and products adjusted for the
regression of Y on X The general equation is
1'
Thus,
The coefficients obtained at this stage are those that would be obtained for X or X
2
3
when fitted along with X . To indicate this the symbol often used is b
(or sometimes
1
21
b
) and b
(or sometimes b
). In the last colums are the reductions that would
Y21
31
Y31
be attributable to X (or X ) whenfitted after X . In this example the reduction due to
1
2
3
X alone is 1102.7690 and the reductiondueto X after X (i.e., the gain due to X ) is
1
2
1
2
629.5758; so the total reduction due to fitting X and X would be the sum of these or
1
2
1732.3448.
At this stage we could, if desired, compute a residual sum of squares and mean
square and test whether X or X made a significant reduction when fitted after X .
2
3
1
If neither did, we might not wish to continue the fitting.
FPL 17
-115-
Step 5. Copy the adjusted sums of squares and products in the first row of Step 4.
.
221)
Step 7. Compute the matrix of sums of squares and products adjusted for the
regression of Y on X and X .
1
2
Step 6. Divide each element of Step 5 by the value of the first element (a
and X
1
2
^
here to distinguish it from b
and b ). The other two
as , but we use b
312
31
3
3
^
^
terms b
(or ) and b
(or ) are easily obtained from lines 6 and 3.
213
2
123
1
Thus, from line 6
FPL 17
-116-
The
procedure is helpful for screening a large number of independent
variables in order to select those that are likely to give a good fit to the sample
data. It should be noted, however, that the procedure is strictly exploratory. The
probabilities associated with tests of hypotheses that are selected by examination of
the data are not what they seemtobe. Significance tests made in this way do not have
the same meaning that they have when applied to a single preselected hypothesis.
It might also be noted that though the stepwise procedure will frequently lead to
the linear combination of the independent variables that will result in the smallest
residual mean square, it does not always do so. This can only be done by fitting all
possible combinations and then comparing their residuals. Here again, tests of
significance may be informative, but the exact probabilities are unknown.
FPL 17
-117-
ij
of squares and products we will let i and j start at zero. If we were working with
corrected sums of squares and products we would usually let i and j start at one.
The results of each step in the method will be shown symbolically in table 4 and
numerically in table 5. In following these steps it is important to notice the pattern of
the computations. Once this has been recognized, the extension to a matrix of any
size will be obvious.
FPL 17
-118-
GPO
815-411-4
FPL 17
-119-
Table
Step 1. In the A Columns write the upper right-halfof the matrix to be inverted.
Step 2. In the I Columns write a complete identity matrix of the same dimensions
as the matrix to be inverted.
Step 3. In the check column perform the indicated summations. For row 0 the
+a
+a
+ a
+ 1. For row 1 the sum will be a
+a
+
sum will be a
00
01
02
03
10
11
a + a
+ 1, and so forth. Note that a
= a
(the matrix is symmetrical).
12
13
10
01
Step 4. Copy the entries from row 0. In table 4, the entry in the first I Column
(=1) has been symbolized by d .
00
Step 5. Divide each element (including the check sum) of line 4 by the first
element (a ) in that line. The sum of all of the elements in the A and I Columns
00
will equal the value in the check column if no error has been made.
Step 6. The elements in this line (including the check) are obtained by
multiplying each element of line 4 (except the first) by b
and subtracting this
01
quantity from the corresponding elements of row 1. Thus, a
=a - b a
110
11
01 01
and a
= a - b a . The sum of these elements must equal the value in
12 0
12 01 02
the check column.
Step 7. Divide each element in line 6 by the first element in that line (a
.
11 0)
Check.
Step 8. The elements in this line are obtained by subtracting two quantities
from each element of row 2. The two quantities are b times (the element in
02
line 4 below the row 2 element) and b
times (the element in line 6 below
12 0
the row 2 element. Thus,
and
The elements in line 8 must equal the value computed for the check column.
Step 9. Divide each element of line 8 by the first element in this line. Check.
Step 10. The elements of this line are obtained by subtracting three quantities
from each element of row 3. The three quantities are b times (the line 4
03
element below the row 3 element), b
times (the line 6 element below the
13 0
row 3 element), and b
times (the line 8 element below the row 3 element).
23 01
Thus,
-121FPL 17
and
The sum of the elements in line 10 must equal the computed value in the check
column.
Step 11. Divide each element in line 10 by the first element in that line (a
33 012
).
Step 13. As a final check, multiply the original matrix by the inverse. The
product should be the identity matrix
FPL 17
-122-
I.
Y = a + bX -- Straight line
Linear Model:
Y = b0 + b X
M 124 625
11.
(Y - a) = k(X - b)
Linear Model:
Y =b
Y-intercept is at Y = kb
X-intercepts are at X
+ b X + b X
2
1
M 124 625
+ a
(complex if
Estimates:
FPL 17
-123-
is negative)
III.
(Y - a) = k/X -- Hyperbola
IV.
Estimates:
FPL 17
-124-
V.
Estimates:
FPL 17
-125-
VI.
M124618
^
Estimates: a = anti-log b
^
=b
b
^
c
= anti-log b
FPL 17
-126-
VII.
Y = ab
(X - c)
Linear Model:
+ b X + b X
log Y = b
M 124 624
Estimates:
VIII. 10
= aX
Linear Model:
Y =b
+ b
(log X)
M 124 624
A
Estimates: a = anti-log b
b^ = b
FPL 17
-127-
Yields
Treatment
I
12
17
16
15
60
II
14
13
12
48
III
11
20
18
13
62
170
For the standard analysis of variance of this design we first calculate the correction
term and the sums of squares for total, treatment, and error.
Correction term (CT) =
Total
Treatment
Error
FPL 17
-128-
df
SS
MS
Treatments
28.6667
14.3333
Error
81.0000
9.0000
Total
11
109.6667
F
1.593
where:
or
where:
FPL 17
-129-
For any plot receiving treatment I, the independent variables will have values
X' = 1, and X' = 0; for treatment II plots, the values are X' = 0, and X' = 1; and for
2
treatment III plots, X' = -1, and X' = -1. Thus the study data can be listed as follows:
2
X'
1
Treatment
Y =Yield
12
17
16
15
1
1
1
1
0
0
0
0
II
14
9
13
12
0
0
0
0
1
1
1
1
III
11
20
18
13
-1
-1
-1
-1
-1
-1
-1
-1
170
Sums
X'
2
The normal equations for fitting the revised model (with uncorrected sums of squares
and products) are:
or
FPL 17
-130-
Residual
- Reduction
or
^
The solution is =
0
Reduction
df
SS
MS
2437.0062
2408.3390
28.6672
14.3336
80.9938
8.9993
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Total
12
F
2/9df
2518
= 14.3336 = 1.593
8.9993
Except for rounding errors and differences in terminology, this is the same result
as the standard test procedure.
FPL 17
-131-
where
plot of the
th
treatment
= 0).
plot of the i
treatment.
For the randomized block design with one replication of each treatment in each
block, the model is
where i = The effect of block i expressed as a departure from the overall mean, so
that = 0.
i i
Other terms are as previously defined.
Thus, each experimental design is defined by some linear model. The analysis of
variance for the design involves a least-squares fitting of the model under various
hypotheses and testing the differences in residuals, As in any regression analysis,
the hypothesis to be tested should be specified prior to examination of the data.
FPL 17
-132-
APPENDIX E-Tables
Table 6.--The distribution of F
M 124 644
133
M 124 645
Reproduced by permission of the author and publishers from table 10.5.3 of Snedecor's Statistical
Methods (ed. 5), 1956, Iowa State University Press, Ames, Iowa. Permission has also been
granted by the literary executor of the late Professor Sir Ronald A. Fisher and Oliver and Boyd
Ltd., publishers, for the portion of the table computed from Dr. Fisher's table VI in Statistical
Methods for Research Workers.
134
GPO
815-411-3
M 124 643
Table reproduced in part from table III of Fisher and Yates' Statistical Tables for Biological.
Agricultural. and Medical Research. published by Oliver and Boyd Ltd., Edinburgh. Scotland.
Permission has been given by Dr. F . Yates. by the literary executor of the late Professor Sir
Ronald A. Fisher. and by the publishers.
135
Table
.00
.01
.02
.0
0000
0398
0793
1179
1554
0040
0438
0832
1217
1591
0080
0478
0871
1255
1628
0120
0517
0910
1293
1664
0160
0557
0948
1331
1700
0199
0596
0987
1368
1736
1915
2257
2580
2881
3159
1950
2291
2611
2910
3186
1985
2324
2642
2939
3212
2019
2357
2673
2967
3238
2054
2389
2704
2995
3264
1
1.1
1.2
1.3
1.4
3413
3643
3849
4032
4192
3438
3665
3869
4049
4207
3461
3686
3888
4066
4222
3485
3708
3907
4082
4236
1.5
1.6
1.7
1.8
1.9
4332
4452
4554
4641
4713
4345
4463
4564
4649
4719
4357
4474
4573
4656
4726
2.0
2.1
2.2
2.3
2.4
4772
4821
4861
4893
4918
4778
4826
4864
4896
4920
2.5
2.6
2.7
2.8
2.9
4938
4953
4965
4974
4981
3.0
3.1
3.2
3.3
3.4
.05
.07
.08
.09
0239
0636
1026
1406
1772
0279
0675
1064
1443
1808
0319
0714
1103
1480
1844
0359
0753
1141
1517
1879
2088
2422
2734
3023
3289
2123
2454
2764
3051
3315
2157
2486
2794
3078
3340
2190
2517
2823
3106
3365
2224
2549
2852
3133
3389
3508
3729
3925
4099
4251
3531
3749
3944
4115
4265
3554
3770
3962
4131
4279
3577
3790
3980
4147
4292
3599
3810
3997
4162
4306
3621
3830
4015
4177
4319
4370
4484
4582
4664
4732
4382
4495
4591
4671
4738
4394
4505
4599
4678
4744
4406
4515
4608
4686
4750
4418
4525
4616
4693
4756
4429
4535
4625
4699
4761
4441
4545
4633
4706
4767
4783
4830
4868
4898
4922
4788
4834
4871
4901
4925
4793
4838
4875
4904
4927
4798
4842
4878
4906
4929
4803
4846
4881
4909
4931
4808
4850
4884
4911
4932
4812
4854
4887
4913
4934
4817
4857
4890
4916
4936
4940
4955
4966
4975
4982
4941
4956
4967
4976
4982
4943
4957
4968
4977
4983
4945
4959
4969
4977
4984
4946
4960
4970
4978
4984
4948
4961
4971
4979
4985
4949
4962
4972
4979
4985
4951
4963
4973
4980
4986
4952
4964
4974
4981
4986
4987
4990
4993
4995
4997
4987
4991
4993
4995
4997
4987
4991
4994
4995
4997
4988
4991
4994
4996
4997
4988
4992
4994
4996
4997
4989
4992
4994
4996
4997
4989
4992
4994
4996
4997
4989
4992
4995
4996
4997
4990
4993
4995
4996
4997
4990
4993
4995
4997
4998
3.6
4998
4998
4999
4999
4999
4999
4999
4999
4999
4999
3.9
5000
M 124 646
136
1.5-137
Furniture Manufacturers,
Woodworkers, and Teachers
of Woodshop Practice
Wood Preservation
Architects, Builders, Engineers,
and Retail Lumbermen
NOTES
NOTES
NOTES
NOTES
In
Cooperation
with
the
University
of
Wisonsin