Académique Documents
Professionnel Documents
Culture Documents
It is also used
extensively in the application of data mining techniques. This article provides an overview of linear
regression, and more importantly, how to interpret the results provided by linear regression. We will
discuss understanding regression in an intuitive sense, and also about how to practically interpret the
output of a regression analysis. In particular, we will look at the different variables such as p-value, t-stat
and other output provided by regression analysis in Excel. We will also look at how regression is
connected to beta and correlation.
Imagine you have data on a stocks daily return and the markets daily return in a spreadsheet, and you
know instinctively that they are related. How do you figure out how related they are? And what can you do
with the data in a practical sense? The first thing to do is to create a scatter plot. That provides a visual
representation of the data.
Consider the figure below. This is a scatter plot of Novartiss returns plotted against the S&P 500s
returns (data downloaded from Yahoo finance).
Here is the spreadsheet with this data, in case you wish to see how this graph was built.
A regression model expresses a dependent variable as a function of one or more independent variables,
generally in the form:
What we also see above in the Novartis example is the fitted regression line, ie the line that expresses the
relationship between the y variable, called the dependent variable, in this case the returns for the
Novartis stock; with the x variable, in this case the S&P 500 returns that are considered 'independent' or
the regressor variable. What we are going to do next is go deeper into how regression calculations work.
(For this article, I am going to limit myself to one independent variable, but the concepts discussed apply
equally to regressing on multiple independent variables.)
Regression with a single dependent variable y whose value is dependent upon the independent
variable x is expressed as
where is a constant, so is . x is the independent variable and is the error term (more on the error term
later). Given a set of data points, it is fairly easy to calculate alpha and beta and while it can be done
manually, it can be done using Excel using the SLOPE (for calculating )and the INTERCEPT () functions.
= mean of the dependent variable (ie y) - * mean of the independent variable (ie x)
Now that we have seen how to calculate and (ie, either using the formulae, or using Excel), it is
probably possible to say that we can predict y if we know the value of x. The predicted value of y is
provided to us by the regression equation. This is unlikely to be exactly equal to the actual observed value
of y. The difference between the two is explained by the error term - . This is a random error error not
in the sense of being a mistake but in the sense that the value predicted by the regression equation is
not equal to the actual observed value. This error is random and not biased, which means that if you sum
up across all data points, you get a total of zero. Some observations are farther away from the predicted
value than others, but the sum of all the differences will add up to zero. (If it weren't zero, the model
would be biased in the sense it was likely to either overstate or understate the value of y.)
Intuitively, the smaller the individual observed values of , even though adding up to zero, the better is our
regression model. How do we measure how small the values of are? One obvious way would be to add
them up and divide by the number of observations to get an average value per data point but that would
just be zero as just explained. So what we do is the next best thing: take a sum of the squares of and
divide by the number of observations. For a variable whose mean is zero, this is nothing but its variance.
This number is called the standard error of the regression and you may find it referred to as standard
error of the regression line, standard error of the estimate or even standard error of the line, the last
phrase being from the PRMIA handbook.
This error variable is considered normally distributed with a mean of zero, and a variance equal to 2.
The standard error can be used to calculate confidence intervals around an estimate provided by our
regression model, because using this we can calculate the number of standard deviations either side of
the predicted value and use the normal distribution to compute a confidence interval. We may need to use
a t-distribution if our sample size is small.
(RMS error: We can also then take a square root of this variance to get to the standard deviation
equivalent, called the RMS error. RMS stands for Root Mean Square which is exactly what we did, we
squared the errors, took their mean, and then the square root of the resultant. This takes care of the
problem that the standard error is expressed in square units.)
Coming back to the standard error - what do we compare the standard error to in order to determine how
good our regression is? How big is big? This takes us to the next step understanding the sums of
squares TSS, RSS and ESS.
The value of interest to us is = (yi y^ )2. Since this value will change as the number of observations
change, we divide by n to get a per observation number. (Since this is a square, we take the root to get a
more intuitive number, ie the RMS error explained a little while earlier. Effectively, RMS gives us the
standard deviation of the variation of the actual values of y when compared to the observed values.)
s = sqrt(RSS/(n 2))
(where n is the number of observations, and we subtract 2 from this to take away 2 degrees of freedom*.)
Now
R2 = ESS/TSS
R2 is also the same thing as the square of the correlation (stated without proof, but you can verify it in
Excel). Which means that our initial intuition that the quality of our regression model depends upon the
correlation of the variables was correct. (Note that in the ratio ESS/TSS, both the numerator and
denominator are squares of some sort which means this ratio explains how much of the variance is
explained, not standard deviation. Variance is always in terms of the square of the units, which makes it
slightly difficult to interpret intuitively, which is why we have standard deviation.)
We can calculate the standard deviation of both alpha and beta but the formulae are pretty complex if
calculated manually. Excel does a great job of providing these standard deviations as part of its Data
Anaslysis, Regression functionality as we shall see in a moment.
Once the standard deviations, or the standard errors of the coefficients are known, we can determine
confidence levels to determine the ranges within which these estimated values of the coefficients lie at a
certain level of significance.
Intuitively, we know that alpha and beta are meaningless if their value is zero, ie, if beta is zero, it means
the independent variable does not impact the dependent variable at all. Therefore one test often
performed is determining the likelihood that the value of these coefficients is zero.
This can be done fairly easily consider this completely made up example. Assume that the value of beta
is 0.5, and the standard error of this coefficient is 0.3. We want to know if at 95% confidence level this
value is different from zero. Think of it this way: if the real value were to be zero, how likely is it that we
ended up estimating it to be 0.5? Well, if the real value were to be zero, and were to be distributed
according to a normal distribution, then 95% of the time we would have estimated it to be in the range
within which the normal distribution covers 95% of the area under the curve on either side of zero. This
area extends from -1.96 standard deviations to +1.96 standard deviations on either side of zero. This
should be -0.59 (=0.3*1.96) to +0.59. Since the value we discovered was 0.5, it was within the range -0.59
to 0.59, which means it is likely that the real value was indeed zero, and that our calculation of that as 0.5
might have been just a statistical fluke. (What we did just now was hypothesis testing in plain English.)
Determining the goodness of the regression - The
significance of R2 & the F statistic
Now the question arises as to how significant is any given value of R2? When we speak of significance in
statistics, what we mean is the probability of the variable in question being right. It means that we
believe that the variable or parameter in question has a distribution, and we want to determine if the
given value falls within the confidence interval (95%, 99% etc) that we are comfortable with.
Estimates follow distributions, and often we see statements such as a particular variable follows the
normal or lognormal distribution. The value of R2 follows what is called an F-distribution. (The answer to
why R2 would follow an F distribution is beyond the mathematical abilities of this author - so just take it as
a given.) The F-distribution has two parameters the degrees of freedom for each of the two variables
ESS and TSS that have gone into calculating R2. The F-distribution has a minimum of zero, and approaches
zero to the right of the distribution. In order to test the significance of R2, one needs to calculate the F
statistic as follows:
F statistic = ESS / (RSS/(T-2)), where T is the number of observations. We subtract 2 to account for the loss
of two degrees of freedome. This F statistic can then be compared to the value of the F statistic at the
desired level of confidence to determine its significance.
All of the above about the F statistic is best explained with an example:
Imagine ESS = 70, TSS = 100, and T=10 (all made up numbers).
Assume we want to know if this F statistic is significant at 95%, in other words, could its value have been
zero and we accidentally happened to pick a sample that we got an estimate that was 5.33? We find out
what the F statistic should be at 95% - and compare that to the value of 5.33 we just calculated. If 5.33
is greater than the value at 95%, we conclude that R2 is significant at the 95% level of confidence (or is
significant at 5%). If 5.33 is less than what the F value is at the 95% level of confidence, we conclude the
opposite.
The value of F distribution at the desired level of confidence (= 1 level of significance) can be calculated
using the Excel function =FINV(x, 1, T 2). In this case, =FINV(0.05,1,8)= 5.318. Since 5.33>5.318, we
conclude that our R2 is significant at 5%.
We then go one step further we can determine at what level does this F statistic become critical and
we can do this using the FDIST function in Excel. In this case, =FDIST(5.33,1,8) =0.0498, which happens to
be quite close to 5%. The larger the value of the F statistic, the lower the value the FDIST function
returns, which means a higher value of the F statistic is more desirable.
Summing up the above:
1. We figured out how to get estimates of alpha and beta, and therefore be able to calculate the
regression equation,
2. We calculated the Residual Sum of Squares, that provides us with an idea of how scattered around
our regression line the real observations are,
3. We calculated the root of RSS to get RMS error that provides us an idea of the 'average' error
observed in each estimate - the larger this is, the poorer our regression model,
4. We calculated R2, the square of correlation, which tells us how good our regression model is by
informing us of the ratio between explained variance and total variance in the dependent variable
(note: R2 talks about variance, not standard deviation).
6. We then figured out how to calculate confidence limits relating to our estimates of alpha and beta.
We then perform the regression analysis and get the results as follows. I have provided explanations of
all the parameters that the Excel provides as an output, either in the picture below or as notes referenced
therein.
Note 1: Coefficient of determination (cell E5)
The coefficient of determination is nothing but R^2, something we discussed in detail earlier (ie, the
ESS/TSS), and equal to the square of the correlation.
The adjusted R2 takes into account the number of independent variables in the model, and the sample
size, and provides a more accurate assessment of the reliability of the model.
Adjusted R^2 is calculated as 1 (1 R^2)*((n-1)/(n-p-1)); where n is the sample size and p the number of
regressors in the model.
Ie, Adjusted R2 =
The F statistic is explained earlier in this article. Calculated as ESS / (RSS/(T-2)), in this case that is =64/
(56.1/8) (where 8 was obtained as T-2=10-2=8)
Note 5: Significance F
Significance F gives us the probability at which the F statistic becomes critical, ie below which the
regression is no longer significant. This is calculated (as explained in the text above) as =FDIST(F-
statistic, 1, T-2), where T is the sample size. In this case, =FDIST(9.126559714795,1,8) =
0.0165338014602297
Note 6: t Stat
The t Stat describes how many standard deviations away the calculated value of the coefficient is from
zero. Therefore it is nothing but the coefficient/std error. In this case, these work out to
3.86667/1.38517=2.7914 and 0.6667/0.22067 = 3.02101 respectively.
Why is this important? This is because if the coefficient for a variable is zero, then the variable doesnt
really affect the predicted value. Though our regression may have returned a non-zero value for a variable,
the difference of that value from zero may not be significant. The t Stat helps us judge how far is the
estimated value of the coefficient from zero measured in terms of standard deviations. Since the value
of the coefficients follows the t distribution, we can check, at a given level of confidence (eg 95%, 99%),
whether the estimated value of the coefficient is significant. All of Excels regression calculations are
made at the 95% level of confidence by default, though this can be changed using the initial dialog box
when the regression is performed.
Note 7: p value
In the example above, the t stat is 2.79 for the intercept. If the value of the intercept were to be depicted
on a t distribution, how much of the area would lie beyond 2.79 standard deviations? We can get this
number using the formula =TDIST(2.79,8,2) = 0.0235. That gives us the p value for the intercept.
If the estimated value of the coefficient lies within this area, then there is a 95% likelihood that the real
value could be anything within this area, including zero. These ranges allow us to judge whether the
values of the coefficients are different from zero at the given level of confidence.
How is this calculated? In the given example, we first calculate the number of standard deviations for the
given confidence level either side of zero that we can go, and we assume a t distribution. In this case,
there are 8 degrees of freedom and therefore the number of SDs is TINV(0.05,8) = 2.306. We multiply this
by the standard error for the coefficient in question and add and subtract the result from the estimate.
For example, for the intercept, we get the upper and lower 95% as follows:
Upper 95% = 3.866667 + (TINV(0.05,8) * 1.38517) = 7.0608 (where 3.866667 is the estimated value of the
coefficient per our model, and 1.38517 is its standard deviation)
We do the same thing for the other coefficient, and get the upper and lower 95% limits