Vous êtes sur la page 1sur 12

Correlation

1. Introduction
Much of statistics is concerned with relationships among variables and
whether observed relationships are real or simply due to chance. In
particular, the simplest case deals with the relationship between two
variables.
When analyzing two variables, one question becomes important as it
determines the type of analysis that will be done. Is the purpose to explore
the nature of the relationship, or is the purpose to use one variable to explain
variation in another variable? For example, there is a difference between
examining height and weight to see if there is a strong relationship, as
opposed to using height to predict weight.
Consequently, you need to distinguish between a correlational analysis in
which only the strength of the relationship will be described, and regression
where one variable will be used to predict the values of a second variable.
The two variables are often called either a response variable or an
explanatory variable. A response variable (also known as a dependent or Y
variable) measures the outcome of a study. An explanatory variable (also
known as an independent or X variable) is the variable that attempts to
explain the observed outcomes.

2. Graphical displays
2.1 Scatterplots
The scatterplot is the primary graphical tool used when exploring the
relationship between two interval or ratio scale variables. This is obtained in
KYPLOT using the Create Graph->Same X-> Line/Scatter/Area be sure that
both variables have a continuous scale.
In graphing the relationship, the response variable is usually plotted along
the vertical axis (the Y axis) and the explanatory variable is plotted along the
horizontal axis (the X axis). It is not always perfectly clear which is the
response and which is the explanatory variable. If there is no distinction
between the two variables, then it doesnt matter which variable is plotted on
which axis this usually only happens when finding correlation between
variables is the primary purpose.

What to look for in a scatterplot


Overall pattern. - What is the direction of association? A positive
association occurs when above-average values of one variable tend to be

associated with above-average variables of another. The plot will have an


upward slope. A negative association occurs when above-average values of
one variable are associated with below-average values of another variable.
The plot will have a downward slope. What happens when there is no
association between the two variables?
Form of the relationship. Does a straight line seem to fit through the
middle of the points? Is the line linear (the points seem to cluster around a
straight line?) or is it curvi-linear (the points seem to form a curve)?

Strength of association. Are the points clustered tightly around the

curve? If the points have a lot of scatter above and below the trend line, then
the association is not very strong. On the other hand, if the amount of scatter
above and below the trend line is very small, then there is a strong
association.

Outliers Are there any points that seem to be unusual? Outliers are values

that are unusually far from the trend curve - i.e., they are further away from
the trend curve than you would expect from the usual level of scatter. There
is no formal rule for detecting outliers - use common sense. [If you set the
role of a variable to be a label, and click on points in a linked graph, the label
for the point will be displayed making it easy to identify such points.] Ones
usual initial suspicion about any outlier is that it is a mistake, e.g.,
a transcription error. Every effort should be made to trace the data back to its
original source and correct the value if possible. If the data value appears to
be correct, then you have a bit of a quandary. Do you keep the data point in
even though it doesnt follow the trend line, or do you drop the data point
because it appears to be anomalous? Fortunately, with computers it is
relatively easy to repeat an analysis with and without an
outlier - if there is very little difference in the final outcome - dont worry
about it.
In some cases, the outliers are the most interesting part of the data. For
example, for many years the ozone hole in the Antarctic was missed because
the computers were programmed to ignore readings that were so low that
they must be in error !

Lurking variables. A lurking variable is a third variable that is related to

both variables and may confound the association. For example, the amount
of chocolate consumed in Egypt and the number of automobile accidents are
positively related, but most people would agree that this is coincidental and
each variable is independently driven by population growth. Sometimes the
lurking variable is a grouping variable of sorts. This is often examined by
using a different plotting symbol to distinguish between
the values of the third variables. For example, consider the following plot of
the relationship between salary and years of experience for nurses.

The individual lines show a positive relationship, but the overall pattern when
the data are pooled, shows a negative relationship. It is easy in KYPLOT to
assign different plotting symbols to different points. From the Graph menu.

2.2. Smoothers
Once the scatterplot is plotted, it is natural to try and summarize the
underlying
trend line. For example, consider the following data:

There are several common methods available to fit a line through this data.

By eye The eye has remarkable power for providing a reasonable

approximation to an underlying trend, but it needs a little education. A trend


curve is a good summary of a scatterplot if the differences between the
individual data points and the underlying trend line (technically called

residuals) are small. As well, a good trend curve tries to minimize the total of
the residuals. And the trend line should try and go through the middle of most
of the data.
Although the eye often gives a good fit, different people will draw slightly
different trend curves. Several automated ways to derive trend curves are in
common use - bear in mind that the best ways of estimating trend curves will
try and mimic what the eye does so well.

Median or mean trace The idea is very simple. We choose a window

width of size w, say. For each point along the bottom (X) axis, the smoothed
value is the median or average of the Y -values for all data points with X
values lying within the window centered on this point. The trend curve is
then the trace of these medians or means over the entire plot. The result is
not exactly smooth. Generally, the wider the window chosen the smoother
the result. However, wider windows make the smoother react more slowly to
changes in trend. Smoothing techniques are too computationally intensive to
be performed by hand. Unfortunately, KYPLOT is unable to compute the trace
of data, but splines are a very good alternative (see below).
The mean or median trace is too unsophisticated to be a generally useful
smoother. For example, the simple averaging causes it to under-estimate the
heights of peaks and over-estimate the heights of troughs. (Can you see why
this is so? Draw a picture with a peak.) However, it is a useful way of trying to
summarize a pattern in a weak relationship for a moderately large data set.
In a very weak relationship it can even help you to see the tren d.

Box plots for strips The following gives a conceptually simple method

which is useful for exploring a weak relationship in a large data set. The X
axis is divided into equal-sized intervals. Then separate box plots of the
values of Y are found for each strip. The box-plots are plotted side-by-side
and the means or median are joined. Again, we are able to see what is
happening to the variability as well as the trend. There is even more detailed
information available in the box plots about the shape of the Y -distribution
etc. Again, this is too tedious to do by hand. It is possible to make this plot in
KYPLOT by creating a new variable that groups the values of the X variable
into classes. This is illustrated below :

Spline methods A spline is a series of short smooth curves that are joined
together to create a larger smooth curve. The computational details are
complex,
but can be done in KYPLOT. The stiffness of the spline indicates how straight
the resulting curve will be. The following shows two spline fits to the same
data
with different stiffness measures:

3. Correlation
WARNING!: Correlation is probably the most abused concept in statistics.
Many people use the word correlation to mean any type of association
between two variables, but it has a very strict technical meaning, i.e. the
strength of an apparent linear relationship between the two interval or
ratio scaled variables.

The correlation measure does not distinguish between explanatory and


response variables and it treats the two variables symmetrically. This means
that the correlation between Y and X is the same as the correlation between
X and Y.
Correlations are computed in KYPLOT using the Statistics-> Parametric tests ->
Linear correlation platform. Each cell in the table shows the correlation of the
two corresponding variables. Because of symmetry (the correlation between
variable1 and variable2 is the same as between variable2 and variable1), only
part of the complete matrix will be shown. As well, the correlation between any
variable and itself is always 1.

3.1 Correlation coefficient


It is possible to quantify the strength of association between two variables. As
with all statistics, the way the data are collected influences the meaning of
the statistics.
The population correlation coefficient between two variables is denoted
by the Greek letter and is computed as:.

The corresponding sample correlation coefficient is denoted r has a similar


form:

(Note that this formula SHOULD NOT be used for the actual computation of r, it is numerically unstable and
there are better computing formulae available.)

If the sampling scheme is simple random sample from the corresponding


population, then r is an estimate of . This is a crucial assumption. If
the sampling is not a simple random sample, the above definition of the
sample correlation coefficient should not be used! It is possible to find a

confidence interval for and to perform statistical tests that is zero.


However, for the most part, these are rarely done in chemical engineering
research and so will not be pursued further in this course.

The form of the formula does provide some insight into interpreting its value.
and r (unlike other population parameters) are unitless measures.
the sign of and r is largely determined by the pairing of the relationship
of each of the (X,Y) values with their respective means, i.e. if both of X and Y
are above the mean, or both X and Y are below their mean, this pair
contributes a positive value towards or r, while if X is above and Y is below,
or X is below and Y is above their respective means contributes a negative
value towards or r.
and r ranges from -1 to 1. A value of or r equal to -1 implies a perfect
negative correlation; a value of or r equal to 1 implies a perfect positive
correlation; a value of or r equal to 0 implied no correlation. A perfect
population correlation (i.e. _ or r equal to 1 or -1) implies that all points lie
exactly on a straight line, but the slope of the line has NO effect on the
correlation coefficient. This latter point is IMPORTANT and often is
wrongly interpreted.
and r are unaffected by linear transformations of the individual variables,
e.g. unit changes such as converting from imperial to metric units.
and r only measures the linear association, and is not affected by the
slope of the line, but only by the scatter about the line.

3.3 Cautions
Random Sampling Required. Sample correlation coefficients are only
valid under simple random samples. If the data were collected in a haphazard
fashion or if certain data points were oversampled, then the correlation
coefficient may be severely biased.
There are examples of high correlation but no practical use and low
correlation but great practical use. These will be presented in class. This
illustrates why I almost never talk about correlation.
correlation measures strength of a linear relationship; a curvilinear
relationship may have a correlation of 0, but there will still be a good
correlation.

effects of lurking variables. For example, suppose there is a positive


association between wages of male nurses and years of experience; between
female nurses and years of experience; but males are generally paid more
than females. There is a positive correlation within each group, but an overall
negative correlation when the data are pooled together.

correlation does not imply causation. This is the most

frequent mistake made by people. There are set of principles of causal


inference that need to be satisfied in order to imply cause and effect.

3.4 Principles of Causation


Types of association
An association may be found between two variables for several reasons
there may be direct causation, e.g. smoking causes illness

10

there may be a common cause, e.g. ice cream sales and number of
drownings
both increase with temperature
there may be a confounding factor, e.g. highway fatalities decreased
when the speed limits were reduced to 55 mph at the same time that the oil
crisis caused supplies to be reduced and people drove fewer miles.
there may be a coincidence, e.g., the population of Egypt has increased at
the same time as the moon has gotten closer by a few miles.

11

3.5 Establishing cause-and effect.


How do we establish a cause and effect relationship? Bradford Hill (Hill, A. B..
1971. Principles of Medical Statistics, 9th ed New York: Oxford University
Press) outlined 7 criteria that have been adopted by many researchers. It is
generally agreed that most or all of the following must be considered before
causation can be declared.

Strength of the association. The stronger an observed association

appears over a series of different studies, the less likely this association is
spurious because of bias.

Dose-response effect. The value of the response variable changes in a


meaningful
way with the dose (or level) of the suspected causal agent.

Lack of temporal ambiguity. The hypothesized cause precedes the

occurrence
of the effect. The ability to establish this time pattern will depend upon the
study design used.

Consistency of the findings. Most, or all, studies concerned with a given


causal hypothesis produce similar findings. Of course, studies dealing with a
given question may all have serious bias problems that can diminish the
importance of observed associations.

Engineering or theoretical plausibility. The hypothesized causal

relationship is consistent with current engineering or theoretical knowledge.


Note, that the current state of knowledge may be insufficient to explain
certain findings.

Coherence of the evidence. The findings do not seriously conflict with


accepted
facts about the outcome variable being studied.

Specificity of the association. The observed effect is associated with


only
the suspected cause (or few other causes that can be ruled out).

Examples:
Discuss the above in relation to:
amount of studying vs grades in a course.
amount of industrial growth and sediments in water.
fossil fuel burning and the greenhouse effect.

12

Vous aimerez peut-être aussi