Vous êtes sur la page 1sur 16

Regression and correlation analysis:

Regression analysis involves identifying the relationship between a dependent variable and one
or more independent variables. A model of the relationship is hypothesized, and estimates of the
parameter values are used to develop an estimated regression equation. Various tests are then
employed to determine if the model is satisfactory. If the model is deemed satisfactory, the
estimated regression equation can be used to predict the value of the dependent variable given
values for the independent variables.

Regression model.
In simple linear regression, the model used to describe the relationship between a single
dependent variable y and a single independent variable x is y = a0 + a1x + k. a0and a1 are
referred to as the model parameters, and is a probabilistic error term that accounts for the
variability in y that cannot be explained by the linear relationship with x. If the error term were
not present, the model would be deterministic; in that case, knowledge of the value of x would be
sufficient to determine the value of y.

Least squares method.
Either a simple or multiple regression model is initially posed as a hypothesis concerning the
relationship among the dependent and independent variables. The least squares method is the
most widely used procedure for developing estimates of the model parameters.
As an illustration of regression analysis and the least squares method, suppose a university
medical centre is investigating the relationship between stress and blood pressure. Assume that
both a stress test score and a blood pressure reading have been recorded for a sample of 20
patients. The data are shown graphically in the figure below, called a scatter diagram. Values of
the independent variable, stress test score, are given on the horizontal axis, and values of the
dependent variable, blood pressure, are shown on the vertical axis. The line passing through the
data points is the graph of the estimated regression equation: y = 42.3 + 0.49x. The parameter
estimates, b0 = 42.3 and b1 = 0.49, were obtained using the least squares method.

Correlation.
Correlation and regression analysis are related in the sense that both deal with relationships
among variables. The correlation coefficient is a measure of linear association between two
variables. Values of the correlation coefficient are always between -1 and +1. A correlation
coefficient of +1 indicates that two variables are perfectly related in a positive linear sense, a
correlation coefficient of -1 indicates that two variables are perfectly related in a negative linear
sense, and a correlation coefficient of 0 indicates that there is no linear relationship between the
two variables. For simple linear regression, the sample correlation coefficient is the square root of
the coefficient of determination, with the sign of the correlation coefficient being the same as the
sign of b1, the coefficient of x1 in the estimated regression equation.

Neither regression nor correlation analyses can be interpreted as establishing cause-and-effect
relationships. They can indicate only how or to what extent variables are associated with each
other. The correlation coefficient measures only the degree of linear association between two
variables. Any conclusions about a cause-and-effect relationship must be based on the judgment
of the analyst.

Excerpt from the Encyclopedia Britannica without permission.
Chapter 9 - Correlation and Regression
Terminology: Correlation vs. Regression
This chapter will speak of both correlations and regressions
both use similar mathematical procedures to provide a measure of relation; the degree to which
two continuous variables vary together ... or covary
The correlations term is used when 1) both variables are random variables, and 2) the end goal is
simply to find a number that expresses the relation between the variables
The regression term is used when 1) one of the variables is a fixed variable, and 2) the end goal is
use the measure of relation to predict values of the random variable based on values of the fixed
variable

Examples
In this class, height and ratings of physical attractiveness (both random variables) vary across
individuals. We could ask, "What is the correlation between height and these ratings in our class"?
Essentially, we are asking "As height increases, is there any systematic increase (positive
correlation) or decrease (negative correlation) in ones rating of their own attractiveness?"
Alternatively, we could do an experiment in which the experimenter compliments a subject on
their appearance one to eight times prior to obtaining a rating (note that number of
compliments is a fixed variable). We could now ask "can we predict a persons rating of their
attractiveness, based on the number of compliments they were given"?

Scatterplots
The first way to get some idea about a possible relation between two variables is to do a
scatterplot of the variables

Lets consider the first example discussed previously where we were interested in the possible
relation between height and ratings of physical attractiveness

The Regression Line
Often scatterplots will include a regression line that overlays the points in the graph:
The regression line represents the best prediction of the variable on the Y axis (Weight) for each
point along the X axis (Height)
For example, my (Steves) data is not depicted in the graph. But if I tell you that I am 75 inches
tall, you can use the graph to predict my weight

Computing the Regression Line
Going back to your high school days, you perhaps recall that any straight line can be depicted by
an equation of the form:


since the regression line is supposed to be the line that provides the best prediction of Y, given
some value of X, we need to find values of a & b that produce a line that will be the best-fitting
linear function (i.e., the predicted values of Y will come as close as possible to the obtained
values of Y)
Correlation or Regression ?

Correlation makes no a priori assumption as to whether one variable is dependent on the other(s)
and is not concerned with the relationship between variables; instead it gives an estimate as to the
degree of association between the variables. In fact, correlation analysis tests for interdependence
of the variables.

As regression attempts to describe the dependence of a variable on one (or more) explanatory
variables; it implicitly assumes that there is a one-way causal effect from the explanatory
variable(s) to the response variable, regardless of whether the path of effect is direct or indirect.
There are advanced regression methods that allow a non-dependence based relationship to be
described (eg. Principal Components Analysis or PCA) and these will be touched on later.

The best way to appreciate this difference is by example.

Take for instance samples of the leg length and skull size from a population of elephants. It
would be reasonable to suggest that these two variables are associated in some way, as elephants
with short legs tend to have small heads and elephants with long legs tend to have big heads. We
may, therefore, formally demonstrate an association exists by performing a correlation analysis.
However, would regression be an appropriate tool to describe a relationship between head size
and leg length? Does an increase in skull size cause an increase in leg length? Does a decrease in
leg length cause the skull to shrink? As you can see, it is meaningless to apply a causal regression
analysis to these variables as they are interdependent and one is not wholly dependent on the
other, but more likely some other factor that affects them both (eg. food supply, genetic makeup).

Consider two variables: crop yield and temperature. These are measured independently, one by
the weather station thermometer and the other by Farmer Giles' scales. While correlation anaylsis
would show a high degree of association between these two variables, regression anaylsis would
be able to demonstrate the dependence of crop yield on temperature. However, careless use of
regression analysis could also demonstrate that temperature is dependent on crop yield: this
would suggest that if you grow really big crops you'll be guaranteed a hot summer!
Abstract
The present review introduces methods of analyzing the relationship between two quantitative
variables. The calculation and interpretation of the sample product moment correlation coefficient
and the linear regression equation are discussed and illustrated. Common misuses of the
techniques are considered. Tests and confidence intervals for the population parameters are
described, and failures of the underlying assumptions are highlighted.

Introduction
The most commonly used techniques for investigating the relationship between two quantitative
variables are correlation and linear regression. Correlation quantifies the strength of the linear
relationship between a pair of variables, whereas regression expresses the relationship in the form
of an equation. For example, in patients attending an accident and emergency unit (A&E), we
could use correlation and regression to determine whether there is a relationship between age and
urea level, and whether the level of urea can be predicted for a given age.

Scatter diagram
When investigating a relationship between two variables, the first step is to show the data values
graphically on a scatter diagram. Consider the data given in Table Table1.1. These are the ages
(years) and the
logarithmically transformed admission serum urea (natural logarithm [ln] urea) for 20 patients
attending an A&E. The reason for transforming the urea levels was to obtain a more Normal
distribution [1]. The scatter diagram for ln urea and age (Fig. (Fig.1)1) suggests there is a positive
linear relationship between these variables.

Correlation
On a scatter diagram, the closer the points lie to a straight line, the stronger the linear relationship
between two variables. To quantify the strength of the relationship, we can calculate the
correlation coefficient. In algebraic notation, if we have two variables x and y, and the data take
the form of n pairs (i.e. [x1, y1], [x2, y2], [x3, y3] ... [xn, yn]), then the correlation coefficient is
given by the following equation:

where is the mean of the x values, and is the mean of the y values.
This is the product moment correlation coefficient (or Pearson correlation coefficient). The value
of r always lies between -1 and +1. A value of the correlation coefficient close to +1 indicates a
strong positive linear relationship (i.e. one variable increases with the other; Fig. Fig.2).2). A
value close to -1 indicates a strong negative linear relationship (i.e. one variable decreases as the
other increases; Fig. Fig.3).3). A value close to 0 indicates no linear relationship (Fig. (Fig.4);4);
however, there could be a nonlinear relationship between the variables (Fig. (Fig.55).

For the A&E data, the correlation coefficient is 0.62, indicating a moderate positive linear
relationship between the two variables.

Regression
In the A&E example we are interested in the effect of age (the predictor or x variable) on ln urea
(the response or y variable). We want to estimate the underlying linear relationship so that we can
predict ln urea (and hence urea) for a given age. Regression can be used to find the equation of
this line. This line is usually referred to as the regression line.

Note that in a scatter diagram the response variable is always plotted on the vertical (y) axis.
Equation of a straight line

The equation of a straight line is given by y = a + bx, where the coefficients a and b are the
intercept of the line on the y axis and the gradient, respectively. The equation of the regression
line for the A&E data (Fig. (Fig.7)7) is as follows: ln urea = 0.72 + (0.017 age) (calculated
using the method of least squares, which
is described below). The gradient of this line is 0.017, which indicates that for an increase of 1
year in age the expected increase in ln urea is 0.017 units (and hence the expected increase in urea
is 1.02 mmol/l). The predicted ln urea of a patient aged 60 years, for example, is 0.72 + (0.017
60) = 1.74 units. This transforms to a urea level of e1.74 = 5.70 mmol/l. The y intercept is 0.72,
meaning that if the line were projected back to age = 0, then the ln urea value would be 0.72.
However, this is not a meaningful value because age = 0 is a long way outside the range of the
data and therefore there is no reason to believe that the straight line would still be appropriate.


Conclusion

Both correlation and simple linear regression can be used to examine the presence of a linear
relationship between two variables providing certain assumptions about the data are satisfied. The
results of the analysis, however, need to be interpreted with care, particularly when looking for a
causal relationship or when using the regression equation for prediction. Multiple and logistic
regression will be the subject of future reviews.
Correlation Coefficient
A correlation coefficient is a number between -1 and 1 which measures the degree to which two
variables are linearly related. If there is perfect linear relationship with positive slope between the
two variables, we have a correlation coefficient of 1; if there is positive correlation, whenever one
variable has a high (low) value, so does the other. If there is a perfect linear relationship with
negative slope between the two variables, we have a correlation coefficient of -1; if there is
negative correlation, whenever one variable has a high (low) value, the other has a low (high)
value. A correlation coefficient of 0 means that there is no linear relationship between the
variables.

There are a number of different correlation coefficients that might be appropriate depending on
the kinds of variables being studied.

Spearman Rank Correlation Coefficient
The Spearman rank correlation coefficient is one example of a correlation coefficient. It is usually
calculated on occasions when it is not convenient, economic, or even possible to give actual
values to variables, but only to assign a rank order to instances of each variable. It may also be a
better indicator that a relationship exists between two variables when the relationship is non-
linear.

Commonly used procedures, based on the Pearson's Product Moment Correlation Coefficient, for
making inferences about the population correlation coefficient make the implicit assumption that
the two variables are jointly normally distributed. When this assumption is not justified, a non-
parametric measure such as the Spearman Rank Correlation Coefficient might be more
appropriate.
Least Squares
The method of least squares is a criterion for fitting a specified model to observed data. For
example, it is the most commonly used method of defining a straight line through a set of points
on a scatterplot.
Regression Equation
A regression equation allows us to express the relationship between two (or more) variables
algebraically. It indicates the nature of the relationship between two (or more) variables. In
particular, it indicates the extent to which you can predict some variables by knowing others, or
the extent to which some are associated with others.

A linear regression equation is usually written
Y = a + bX + e
where
Y is the dependent variable
a is the intercept
b is the slope or regression coefficient
X is the independent variable (or covariate)
e is the error term

The equation will specify the average magnitude of the expected change in Y given a change in X.
The regression equation is often represented on a scatterplot by a regression line.
Regression Line
A regression line is a line drawn through the points on a scatterplot to summarise the relationship
between the variables being studied. When it slopes down (from top left to bottom right), this
indicates a negative or inverse relationship between the variables; when it slopes up (from bottom
right to top left), a positive or direct relationship is indicated.

The regression line often represents the regression equation on a scatterplot.












11. Correlation and regression

The word correlation is used in everyday life to denote some form of association. We might say
that we have noticed a correlation between foggy days and attacks of wheeziness. However, in
statistical terms we use correlation to denote association between two quantitative variables. We
also assume that the association is linear, that one variable increases or decreases a fixed amount
for a unit increase or decrease in the other. The other technique that is often used in these
circumstances is regression, which involves estimating the best straight line to summarise the
association.
Correlation coefficient
The degree of association is measured by a correlation coefficient, denoted by r. It is sometimes
called Pearson's correlation coefficient after its originator and is a measure of linear association.
If a curved line is needed to express the relationship, other and more complicated measures of the
correlation must be used.
The correlation coefficient is measured on a scale that varies from + 1 through 0 to - 1. Complete
correlation between two variables is expressed by either + 1 or -1. When one variable increases as
the other increases the correlation is positive; when one decreases as the other increases it is
negative. Complete absence of correlation is represented by 0. Figure 11.1 gives some graphical
representations of correlation.

Figure 11.1 Correlation illustrated.
Looking at data: scatter diagrams

When an investigator has collected two series of observations and wishes to see whether there is a
relationship between them, he or she should first construct a scatter diagram. The vertical scale
represents one set of measurements and the horizontal scale the other. If one set of observations
consists of experimental results and the other consists of a time scale or observed classification of
some kind, it is usual to put the experimental results on the vertical axis. These represent what is
called the "dependent variable". The "independent variable", such as time or height or some other
observed classification, is measured along the horizontal axis, or baseline.

The words "independent" and "dependent" could puzzle the beginner because it is sometimes not
clear what is dependent on what. This confusion is a triumph of common sense over misleading
terminology, because often each variable is dependent on some third variable, which may or may
not be mentioned. It is reasonable, for instance, to think of the height of children as dependent on
age rather than the converse but consider a positive correlation between mean tar yield and
nicotine yield of certain brands of cigarette.' The nicotine liberated is unlikely to have its origin in
the tar: both vary in parallel with some other factor or factors in the composition of the cigarettes.
The yield of the one does not seem to be "dependent" on the other in the sense that, on average,
the height of a child depends on his age. In such cases it often does not matter which scale is put
on which axis of the scatter diagram. However, if the intention is to make inferences about one
variable from the other, the observations from which the inferences are to be made are usually put
on the baseline. As a further example, a plot of monthly deaths from heart disease against
monthly sales of ice cream would show a negative association. However, it is hardly likely that
eating ice cream protects from heart disease! It is simply that the mortality rate from heart disease
is inversely related - and ice cream consumption positively related - to a third factor, namely
environmental temperature.
Calculation of the correlation coefficient

A paediatric registrar has measured the pulmonary anatomical dead space (in ml) and height (in
cm) of 15 children. The data are given in table 11.1 and the scatter diagram shown in figure 11.2
Each dot represents one child, and it is placed at the point corresponding to the measurement of
the height (horizontal axis) and the dead space (vertical axis). The registrar now inspects the
pattern to see whether it seems likely that the area covered by the dots centres on a straight line or
whether a curved line is needed. In this case the paediatrician decides that a straight line can
adequately describe the general trend of the dots. His next step will therefore be to calculate the
correlation coefficient.


When making the scatter diagram (figure 11.2 ) to show the heights and pulmonary anatomical
dead spaces in the 15 children, the paediatrician set out figures as in columns (1), (2), and (3) of
table 11.1 . It is helpful to arrange the observations in serial order of the independent variable
when one of the two variables is clearly identifiable as independent. The corresponding figures
for the dependent variable can then be examined in relation to the increasing series for the
independent variable. In this way we get the same picture, but in numerical form, as appears in
the scatter diagram.

Figure 11.2 Scatter diagram of relation in 15 children between height and pulmonary anatomical
dead space.

The calculation of the correlation coefficient is as follows, with x representing the values of the
independent variable (in this case height) and y representing the values of the dependent variable
(in this case anatomical dead space). The formula to be used is:



Spearman rank correlation

A plot of the data may reveal outlying points well away from the main body of the data, which
could unduly influence the calculation of the correlation coefficient. Alternatively the variables
may be quantitative discrete such as a mole count, or ordered categorical such as a pain score. A
non-parametric procedure, due to Spearman, is to replace the observations by their ranks in the
calculation of the correlation coefficient.

This results in a simple formula for Spearman's rank correlation, Rho.

where d is the difference in the ranks of the two variables for a given individual. Thus we can
derive table 11.2 from the data in table 11.1 .

In this case the value is very close to that of the Pearson correlation coefficient. For n> 10, the
Spearman rank correlation coefficient can be tested for significance using the t test given earlier.
The regression equation

Correlation describes the strength of an association between two variables, and is completely
symmetrical, the correlation between A and B is the same as the correlation between B and A.
However, if the two variables are related it means that when one changes by a certain amount the
other changes on an average by a certain amount. For instance, in the children described earlier
greater height is associated, on average, with greater anatomical dead Space. If y represents the
dependent variable and x the independent variable, this relationship is described as the regression
of y on x.

The relationship can be represented by a simple equation called the regression equation. In this
context "regression" (the term is a historical anomaly) simply means that the average value of y is
a "function" of x, that is, it changes with x.

The regression equation representing how much y changes with any given change of x can be
used to construct a regression line on a scatter diagram, and in the simplest case this is assumed to
be a straight line. The direction in which the line slopes depends on whether the correlation is
positive or negative. When the two sets of observations increase or decrease together (positive)
the line slopes upwards from left to right; when one set decreases as the other increases the line
slopes downwards from left to right. As the line must be straight, it will probably pass through
few, if any, of the dots. Given that the association is well described by a straight line we have to
define two features of the line if we are to place it correctly on the diagram. The first of these is
its distance above the baseline; the second is its slope. They are expressed in the following
regression equation :

Vous aimerez peut-être aussi