Vous êtes sur la page 1sur 12

9/6/2017 Excel's LINEST() Function Deconstructed | | InformIT

the trusted technology learning source

Home > Articles > Home & Oce Computing > Microsoft Applications

Related Resources
Excel's LINEST() Function Deconstructed Store Articles Blogs
By Conrad Carlberg
Feb 25, 2013 Predictive Analytics:
Microsoft Excel 2016,
2nd Edition
By Conrad Carlberg
Print + Share This Page 1 of 1
Book $31.99

Analyzing Data with


Power BI and Power
Microsoft Excel has for many years included a worksheet function called Pivot for Excel
LINEST(), which returns a multiple regression analysis of a single By Alberto Ferrari, Marco
outcome or predicted variable on one or more predictor variables. Russo
LINEST() returns a regression equation, standard errors of regression Book $31.99
coecients, and goodnessoft statistics. In the rst of three articles,
Excel expert Conrad Carlberg, author of Predictive Analytics: Microsoft
Microsoft Excel Data
Excel, discusses issues regarding LINEST() that have not been covered Analysis and Business
suciently, or even accurately, in the past. Modeling, 5th Edition
By Wayne Winston
Book $39.99
From the author of
Predictive Analytics: Microsoft Excel See All Related Store Items

Learn More Buy

Microsoft Excel's LINEST() worksheet function has a long and checkered history. It is
capable of returning a multiple regression analysis with up to 64 predictor variables
and one outcome or "predicted" variable. (Early versions permitted up to 16 predictor
variables.)

From
LINEST() performs quite well in most situations. It returns accurate regression co the author of
ecients and intercepts, the standard errors of the coecients and of the intercept,
and six summary statistics regarding the regression: R2, the standard error of esti Predictive Analytics: Microsoft Excel
mate, the F ratio for the full regression, the degrees of freedom for the residual, and
the sums of squares for the regression and for the residual. Learn More Buy

But LINEST has some drawbacks, ranging from the inconvenient to the potentially
disastrous.

http://www.informit.com/articles/article.aspx?p=2019171 1/12
9/6/2017 Excel's LINEST() Function Deconstructed | | InformIT
Note

Before continuing with the article, please download the Excel workbook on which
this article is based.

An Inconvenient Problem
One diculty is that the regression coecients and their standard errors are shown in
reverse order in which their associated underlying variables appear on the worksheet.
See Figure 1.

Figure 1 LINEST() returns coecients in reverse order of the worksheet.

In Figure 1, the predictor variables are years of education and years of age. Education
data is in column A, and Age data is in column B. The predicted variable, Income, is in
column C.

The formula that uses the LINEST() function is arrayentered (with Ctrl+Shift+Enter)
in the range E5:G9. The formula in this example is:

=LINEST(C2:C21,A2:B21,TRUE,TRUE)

Note

LINEST()'s third argument, called const, is set to TRUE in the example just given.
You can also omit the argument and Excel regards that as setting it to TRUE:

=LINEST(C2:C21,A2:B21,,TRUE)

Only by setting the third argument to FALSE can you force LINEST() to remove
the constant from the regression equation.

From the author of

The problem is that the regression coecient for Age is in cell E5, and the coecient Predictive Analytics: Microsoft Excel
for Education is in cell F5: in lefttoright order, the coecient for Age comes before
the coecient for Education. But in the underlying data set, the Education data (col Learn More Buy
umn A) precedes the Age data (column B).

(The intercept, in cell G5 in Figure 1, always appears rightmost in the LINEST()


results.)

http://www.informit.com/articles/article.aspx?p=2019171 2/12
9/6/2017 Excel's LINEST() Function Deconstructed | | InformIT
So if you wanted to use the regression equation to estimate the income of the rst
person in Row 2, you would need to use this formula (parentheses included for clarity
only):

=(E5*B2)+(F5*A2)+G5

instead of the more natural and more easily interpreted:

=(E5*A2)+(F5*B2)+G5

With just two variables, this is a really minor issue. But with 5, 10, perhaps 20 vari
ables, it becomes exasperating. To complete the regression equation, you need to
proceed lefttoright for the variables and righttoleft for the coecients. With 20
of each, it's tedious and errorprone.

And there is absolutely no good reason for itstatistical, theoretical or programmatic.


I recognize that one could use the TREND() function instead of assembling the re
gression formula, coecient by coecient and variable by variable, but there are of
ten times when you need to see the result of modifying one variable or coecient; the
only way to do that is to call them out separately in the full equation.

Nevertheless, this is principally a matter of convenience. The issues that I'm going to
discuss in subsequent papers are more serious, particularly if you're still using a ver
sion of Excel prior to 2003.

This paper continues with a discussion of how the results provided by LINEST() can
be calculated, and how you can replicate those results using Excel's native worksheet
functions. A little matrix algebra is needed and it will be necessary for you to be fa
miliar with the concepts behind the worksheet functions MMULT(), MINVERSE(), and
TRANSPOSE().

Once you've seen how to replicate the LINEST() results using straightforward matrix
algebra, you'll be in a position to see how Microsoft got it badly wrong when it oered
LINEST()s third option, const. That option calculates regression statistics "without
the constant," also known as "forcing the intercept through zero." While the associat
ed problems have been xed, anyone who is still using a version of Excel prior to
2003 is in trouble if that option is selected, whether in LINEST(), TREND(), or the
Regression tool in the Data Analysis addin.

Note

In fairness, I should note that Microsoft was in good company. In 1986, well be
fore LINEST() came along, Lee Wilkinson wrote in the manual for Systat, in its
discussion of the MGLH program, "The total sum of squares must be redened
for a regression model with zero intercept. It is no longer centered about the
mean of the dependent variable. Other denitions of sums of squares can lead to
strange results like negative squared multiple correlations." Alas, Microsoft's
code developers were not expert in statistical theory, any more than were the
other developers Wilkinson was referring to.

You will see in a subsequent paper how Microsoft has changed its algorithm to avoid
returning a negative R2, and how it came about in the rst place. This is necessary in
formation for anyone needing to migrate a regression analysis from, say, Excel 2002
to Excel 2010, or to understand how Excel 2002's results can be so dierent from Ex
cel 2010's. Even if you're using a version subsequent to Excel 2003, the problems still
show up in the R2 values associated with chart trendlines.
From the author of
Microsoft has also included in the code for LINEST() a method for dealing with se Predictive Analytics: Microsoft Excel
vere multicollinearity in the X matrix. (Multicollinearity is just a hifalutin word for
two or more predictor variables that are perfectly correlated, or virtually so.) Microsoft Learn More Buy
deserves kudos for recognizing that the problem existed. But the way that the solu
tion is manifested in the results of LINEST() since Excel 2003 is potentially disas
trous. With the information in this paper, you'll be in a position to avoid that particu
lar LINEST()well, call it a feature.

http://www.informit.com/articles/article.aspx?p=2019171 3/12
9/6/2017 Excel's LINEST() Function Deconstructed | | InformIT

Assembling LINEST() Results from Other Functions


In this section, I'm going to show you how to assemble the dierent results you get
from LINEST() using other worksheet functions. Some of these methods will be
clear, even obvious. Others will seem unclear, and they aren't at all intuitively rich. But
by taking things apart, I think you'll nd it much easier to understand the way they
work together.

Getting the Regression Coecients


The rst step is to lay out the data as shown in Figure 2.

Figure 2 Add a column that contains nothing but 1's to the range of predictor
variables.

Figure 2 shows that a column containing 1's is included with the other predictor, or X,
values. This column enables the matrix operations described below to calculate an in
tercept and its standard error. Although you don't see that column of 1's when you
run LINEST() directly on your input data, Excel adds it (invisibly) on your behalf.

Note

If you add the column of 1's and then call LINEST() without the constant (setting
LINEST()s third argument to FALSE), Excel doesn't add the 1's for you, and
you'll get the same regression coecients and standard errors as you would if
you omitted the 1's and used LINEST() with the third argument set to TRUE.

Getting the Sum of Squares and Cross Products (SSCP)


You'll need access to what's called the transpose of the data in B3:E22. You can do
that explicitly on the worksheet using Excel's TRANSPOSE() function. In Figure 2, the
range H2:AA5 contains this array formula:

=TRANSPOSE(B3:E22)

(Recall that you enter an array formula using Ctrl+Shift+Enter instead of simply
Enter.) From the author of
With those two matrices set up, you can get what's called the sum of squares and
crossproducts matrix, often called the SSCP matrix. Use this array formula: Predictive Analytics: Microsoft Excel

=MMULT(H2:AA5,B3:E22) Learn More Buy

http://www.informit.com/articles/article.aspx?p=2019171 4/12
9/6/2017 Excel's LINEST() Function Deconstructed | | InformIT

Note

In the notation used by matrix algebra, it's conventional to show in boldface a


symbol such as "X" that represents a matrix. Matrix transposition is denoted with
an apostrophe, so X' means the transposition (or simply the transpose) of X. And
the inverse of a matrix is indicated by the "1" superscript. The inverse of the
matrix Y is indicated by Y1.

If you don't want to bother putting the transpose of the X matrix directly on the
worksheet, you could use this array formula instead to get the SSCP matrix:

=MMULT(TRANSPOSE(B3:E22),B3:E22)

Excel's MMULT() function performs matrix multiplication. Here, the transpose of the X
matrix (B3:E22) is postmultiplied by the X matrix.

Note

Unlike regular algebra, matrix multiplication is not commutative. If X and Y are


both matrices, XY does not necessarily give the same result as YX.

Getting the Inverse of the SSCP Matrix


The next step is to get the inverse of the SSCP matrix. A matrix's inverse is analogous
to an inverse in simple arithmetic. The inverse of the number 4 is 1/4: When you mul
tiply a number by its inverse, you get 1.

Similarly, when you multiply a matrix by its inverse, you get a new matrix with 1's in
its main diagonal and 0's everywhere else. Figure 3 shows the SSCP matrix in G3:J6,
its inverse in G10:J13, and the result of the multiplication of the two matrices in
L10:O13.

From the author of


Figure 3 The matrix in L10:O13 is called an identity matrix.
Predictive Analytics: Microsoft Excel

Learn More Buy


Calculating the Regression Coecients and Intercept
I mentioned earlier that much of the derivation of the results that LINEST() returns is
not intuitively rich. The inverse of the SSCP matrix is an example of that. There's
much information buried in the matrix inverse, but no ash of intuition will tell you
that it's hidden there, or even why it's there. For example, see Figure 4.
http://www.informit.com/articles/article.aspx?p=2019171 5/12
9/6/2017 Excel's LINEST() Function Deconstructed | | InformIT

Figure 4 The SSCP matrix and its inverse, combined with the X and Y matri
ces, return the regression coecients and the intercept.

In Figure 4, notice the range G18:J18. It contains this array formula:

=TRANSPOSE(MMULT(G10:J13,MMULT(TRANSPOSE(B3:E22),A3:A22)))

In words, the formula uses matrix multiplication via the MMULT() function to combine
the transposed X matrix (B3:E32) with the Y matrix (A3:A32) with the inverse of the
SSCP matrix (G10:J13). The result in G18:J18 is the intercept (G18) and the regression
coecients (H18:J18). The coecients are in the same order that the underlying val
ues appear on the worksheetthat is, columns C, D, and E contain the values for
variables X1, X2, and X3, respectively, and cells H18, I18, and J18 contain the associ
ated regression coecients.

Cells G21:J21 contain the rst row of the LINEST() results for the same underlying
data set (except that the 1's in column B are omitted from the LINEST() arguments
because LINEST() supplies them for you). Notice that the values for the intercept and
the coecients are identical to those in row 18. The only dierence is that LINEST()
has returned them out of order.

In sum, to get the intercept and regression coecients using matrix algebra instead
of using LINEST(), take the following general steps:

1. Get the SSCP matrix using X'X. Use MMULT() and TRANSPOSE() to
postmultiply the transpose of the X matrix by the X matrix.
2. Use MINVERSE() to calculate the inverse of the SSCP matrix.
3. Use the array formula given above and repeated here to calculate the
intercept and coecients:

=TRANSPOSE(MMULT(G10:J13,MMULT(TRANSPOSE(B3:E22),A3:A22)))

Getting the Sum of Squares Regression and Residual


It probably seems a little perverse to go from the calculation of regression coecients
to sums of squares, skipping over standard errors, R2, F tests, and so on. But you
need the sums of squares to calculate those other statistics.

Before getting to the matter of calculating the sums of squares, it's helpful to review
the meaning of the sum of squares regression and the sum of squares residual.

From
A sum of squares, in most statistical contexts, is the sum of the squares of the dier the author of
ences (or deviations) between individual values and the mean of the values. So if our
values are 2 and 4, the mean is 3. 2 3 is 1, and the squared deviation is +1. 4 3 Predictive Analytics: Microsoft Excel
is 1, and the squared deviation is +1. Therefore, the sum of squares is 1 + 1 or 2.
Learn More Buy

http://www.informit.com/articles/article.aspx?p=2019171 6/12
9/6/2017 Excel's LINEST() Function Deconstructed | | InformIT

Note

The term "sum of squares" dates to the early part of the 20th century and is
something of a misnomer. The term suggests that the task is to nd the sum of
the squared values, not the sum of the squared deviations from the mean. In this
case, Excel's function names are more descriptive than the statistical jargon. Ex
cel uses the function DEVSQ() to sum the squared deviations, and the function
SUMSQ() to sum the squares of the raw values.

Our purpose in calculating those two sums of squares is to divide the total sum of
squares into two parts:

The sum of squares regression is the sum of the squared deviations of the
Y values that are predicted by the regression coecients and intercept,
from the mean of the predicted values.
The sum of squares residual is the sum of the squared deviations of the
dierences between the actual Y values and the predicted Y values, from
the mean of those deviations.

Calculating the Predicted Values


Those two denitions of sums of squares are fairly dense when written in English. Its
usually easier to understand what's going on if you think about them in the context of
an Excel worksheet. See Figure 5.

Figure 5 Calculating the sums of squares

In Figure 5, I have repeated the regression coecients and the intercept, as calculated
using the matrix algebra discussed earlier, in the range G3:J3. Because they appear in
the correct order, you can easily use them to calculate the predicted Y values as
shown in the range L3:L22. This is the formula that's used in cell L3:

=$G$3+SUMPRODUCT(C3:E3,$H$3:$J$3)
From the author of
The intercept and coecients in G3:J3 are identied using dollar signs and therefore
absolute addressing. The X values in C3:E3 are identied using relative addressing. Predictive Analytics: Microsoft Excel
Therefore, you can drag and drop or copy and paste from cell L3 into the range
L4:L22. Learn More Buy

Just as a check, Figure 5 also shows the predicted Y values in M3:M22, using this ar
ray formula in that range:

=TREND(A3:A22,C3:E22)

http://www.informit.com/articles/article.aspx?p=2019171 7/12
9/6/2017 Excel's LINEST() Function Deconstructed | | InformIT
You'll note that the predicted values using matrix algebra are identical to the predict
ed values using TREND(). There are actually slight dierences, but they do not begin
to show up until the 14th decimal place. (For example, the dierence between cell L8
and cell M8 is 0.000000000000057.)

Note

These minuscule dierences are due to the fact that Excel 2003 and subsequent
versions do not use traditional matrix algebra in LINEST(), but instead use an
approach called QR Decomposition. The third article in this series has a brief dis
cussion of that approach and the rationale for its usage.

Calculating the Prediction Errors


The values shown in Figure 5, in the range O3:O22, are the errors in the predicted
values. They are simply the dierences between the actual Y values in A3:A22 and the
predicted values in L3:L22. So, for example, the formula in cell O3 is =A3L3.

Calculating the Sums of Squares


With the predicted values and the errors of prediction, we're in a position to calculate
the sums of squares. The sum of squares regression is found with this formula in cell
G24:

=DEVSQ(L3:L22)

and the sum of squares residual is found with a similar formula in cell H24:

=DEVSQ(O3:O22)

Notice that the two sums of squares total to 21612.95. This is the same value as ap
pears in cell G26. The formula in G26 is:

=DEVSQ(A3:A22)

which is the sum of the squared deviations of the original Y values. So, the process
described in this section has accomplished the following:

Predicted Y values on the basis of the combination of the X values and the
regression coecients and intercept.
Obtained the sum of squared deviations of the predicted Y values (the sum
of squares regression).
Calculated the errors of prediction by subtracting the predicted Y values
from the actual Y values.
Obtained the sum of squared deviations of the errors of prediction (the sum
of squares residual).
Demonstrated that the total sum of squares of the actual Y values has been
divided into two portions: the sum of squares regression and the sum of
squares residual.

Calculating the Regression Diagnostics


Now that we have the sum of squares regression and the sum of squares residual, it's
easy to get the results that help you diagnose the accuracy of the regression
equation.

Calculating R2
The R2 is simply the proportion of variability in the Y values that can be attributed From
to the author of
variability in the best combination of the X variables. That best combination is the re
sult of applying the regression coecients to the X variablesthat is, the best combi
nation is represented by the predicted Y values. Predictive Analytics: Microsoft Excel

Learn More Buy


Therefore, the R2 is calculated by this ratio:

(Sum of Squares Regression) / (Sum of Squares Total)

http://www.informit.com/articles/article.aspx?p=2019171 8/12
9/6/2017 Excel's LINEST() Function Deconstructed | | InformIT
Because the sum of squares total is the sum of the regression and the residual sums
of squares, you can easily calculate R2 on the worksheet as shown in Figure 6.

Figure 6 Calculating the goodnessoft statistics

In Figure 6, cell G14 contains this formula:

=G12/(G12+H12)

which returns the ratio of the regression sum of squares to the total sum of squares.

Calculating the Standard Error of Estimate


At this point, you need to keep in mind the way that youve set up your inputs. In Fig
ure 6, Ive set things up so that the column of 1's is shown explicitly on the work
sheet. That's because the column is needed if you're going to obtain the value of the
regression equations intercept by means of matrix algebrathe instances of
MMULT(),TRANSPOSE(), and MINVERSE() that Ive discussed in this paper.

In that caseif you're showing the column of 1's explicitlyyou get the degrees of
freedom for the sum of squares residual by subtracting the number of X variables on
the worksheet from the number of observations, or rows, in the matrix of X values.

In the example shown in Figure 6, the number of observations is 20, found in rows 3
through 22. The number of variables is 4, found in columns B through E. Therefore,
the number of degrees of freedom for the sum of squares residual is 16: 204. You
can conrm this from the LINEST() results in Figure 6, cells G6:J10, where the de
grees of freedom shows up in cell H9.

On the other hand, if you want to use LINEST() directly, you don't need to supply the
column of 1's on the worksheet: Excel supplies the 1's for you and you never see
them. But if you're going to determine the degrees of freedom residual for yourself,
then subtract the number of X variables (in this case, 3) from the number of observa
tions (20) and then subtract 1 from the result to get 16.

In fact, you'll nd that most intermediate statistics texts tell you that the degrees of
freedom for the residual sum of squares is Nk1, where N is the number of observa
tions, k is the number of predictor variables, and 1 is for the column of 1's that you
never see unless you arrange for them yourself. From the author of
So, to get the standard error of estimate, divide the sum of squares residual by the
degrees of freedom for the residual, and take the square root of the result. The for Predictive Analytics: Microsoft Excel
mula used in cell G15 of Figure 6 is:
Learn More Buy
=SQRT(H12/16)

The result is identical to that provided in the LINEST() results in cell H8.

Calculating the F Ratio for the Regression


http://www.informit.com/articles/article.aspx?p=2019171 9/12
9/6/2017 Excel's LINEST() Function Deconstructed | | InformIT
There are a couple of ways to go about calculating the F ratio for the full regression.
Both involve using the degrees of freedom for the residual and the degrees of free
dom for the regression.

The prior section discussed how to get the degrees of freedom for the residual. The
degrees of freedom for the regression is the number of X variables minus 1. So, if you
have supplied the column of 1's explicitly on the worksheet, as in Figure 6, there are
four X variables, and the degrees of freedom for the regression is 3.

If you use LINEST() and do not supply a column of 1's to it as an X variablebecause


Excel does that on your behalfyou still have four X variables; it's just that you're not
looking at one of them. So the degrees of freedom for the regression is still 41, or 3,
in this example.

One way to calculate the F ratio is to use the R2 value. Figure 6 does that in cell G17,
where the formula is:

=(G14/3)/((1G14)/16)

In words, the numerator is the R2 value divided by the regression degrees of freedom.
The denominator is (1 R2) divided by the residual degrees of freedom.

Another way uses the sums of squares instead of the R2 value. It's mathematically
equivalent because we use the sums of squares to calculate the R2 value. The formula
used in cell G18 of Figure 6 is:

=(G12/3)/(H12/16)

The numerator is the sum of squares regression divided by its degrees of freedom.
The denominator is the sum of squares residual divided by its degrees of freedom.

You may know that a sum of squared deviations divided by its degrees of freedom is a
variance, often termed a mean square. That's what we have in cell G18: one variance
divided by another. And the ratio of two variances is an F ratio.

Here, we have the variance of the Y scores as predicted by the regression equation,
divided by the variance of the errors in those predictions. If the resulting ratio is
meaningfully larger than 1.0, we regard the regression as a reliable one: an outcome
that we expect to be similar if we repeat this research with a dierent but similarly
obtained sample of observations. And you can test the reliability of the observed F ra
tio by using Excel's F.DIST() function.

Getting the Standard Errors


The nal task in deconstructing the LINEST() function is to calculate the values of
the standard errors of the intercept and the regression coecients. These values are
returned in the second row of the LINEST() results. Figure 7 shows the required
calculations.

From the author of

Predictive Analytics: Microsoft Excel

Learn More Buy

http://www.informit.com/articles/article.aspx?p=2019171 10/12
9/6/2017 Excel's LINEST() Function Deconstructed | | InformIT

Figure 7 Calculating the standard errors

Figure 7 shows the SSCP matrix and its inverse, shown earlier in Figure 4. To get the
standard errors of the regression coecients and the intercept, we need to multiply
the inverse of the SSCP matrix by the mean square for the residual.

Figure 7 shows the inverse of the SSCP matrix in cells G12:J15.

The prior section showed how to calculate the mean square residual: simply divide the
sum of squares residual by the residual degrees of freedom. Figure 7 does that for
this example in cell M14, using this formula:

=L14/16

Note that L14 contains the sum of squares residual, and 16 is the degrees of freedom
for the residual.

Note

Cell L14 in Figure 7 calculates the sum of squares residual in a more concise
fashion than is done in Figures 5 and 6, where the errors of prediction (the resid
uals) are shown explicitly and the DEVSQ() function is used to get the sum of
squares. Cell L14 in Figure 7 uses this array formula instead:

=SUM(((A3:A22)(MMULT(B3:E22,TRANSPOSE(G3:J3))))^2)

which accomplishes the same result within the formula instead of showing the
intermediate calculations on the worksheet.

The matrix shown in Figure 7, cells G18:J21, is the result of multiplying the inverse of
the SSCP matrix by the mean square residual. The array formula is:

=G12:J15*M14

The square roots of the elements in the main diagonal of the matrix in G18:J21 are
the standard errors for the regression equation. They are shown in Figure 7, in cells
G24:J24. The formulas are as follows:

G24: =SQRT(G18) From the author of


H24: =SQRT(H19)
Predictive Analytics: Microsoft Excel
I24: =SQRT(I20)
Learn More Buy
J24: =SQRT(J21)

The relevant portion of the LINEST() results is also shown in Figure 7, in cells
L24:O24. Note that the values in that range are identical to those in G24:J24, but of

http://www.informit.com/articles/article.aspx?p=2019171 11/12
9/6/2017 Excel's LINEST() Function Deconstructed | | InformIT
course LINEST() returns them in reverse of the order in which the original variables
are entered on the worksheet.

+ Share This Save To Your Account Page 1 of 1

From the author of

Predictive Analytics: Microsoft Excel

Learn More Buy

http://www.informit.com/articles/article.aspx?p=2019171 12/12

Vous aimerez peut-être aussi