Vous êtes sur la page 1sur 11

Chapter 9- Linear Correlation

A: Conceptual Foundation
Concept of correlation is used in dealing w/ individual people
o If student scores better than most others in the class on a midterm exam, we
expect similar relative performance from that student on the final
Dont expect correlation to be perfect
o Some factors can change unexpectedly
o Ex: many factors will affect each students performance on each exam
Perfect Correlation
Less obvious that 2 variables can be perfectly correlated when they are measured in
different units & 2 numbers are not simply related
o Ex: theoretically, height measured in inches can be perfectly correlated w/ weight
measured in pounds
Although numbers for each will be quite different
Correlation is not about an individual having the same number of both variables
o It is about an individual being in the same position on both variables relative to
the rest of the group
o Ex: to have perfect correlation, someone slightly above average in height should
also be slightly above average in weight
If height & weight are correlated
To quantify correlation, need to transform original score on some variable to a number
that represents scores position w/ respect to group
o Using z scores, perfect correlation can be defined: if both variables are normally
distributed, perfect positive correlation can be defined as each person in the group
having the same z score on both variables
Negative Correlation
Perfect negative correlationeach person in the group has the same z score in magnitude
for both variables but z scores are opposite in sign
o Ex: correlation b/w score on exam & number of points taken off
If there are 100 points on exam, student w/ 85 has 15 points taken off
whereas student w/ 97 points has only 3 points taken off
Correlation doesnt have to be based on individual people, each measured twice
o Individuals being measured can be schools, cities, or even entire countries
Correlation Coefficient
Correlation coefficientused to measure amount of correlation
o Coefficient of +1: perfect positive correlation
o Coefficient of -1: perfect negative correlation
o Coefficient of 0: total lack of correlation
Numbers b/w 0 and 1 represent relative amount of correlation, w/ sign (+ or -)
representing direction of correlation

Pearsons correlation coefficient (r)often referred to as Pearsons r


zx z y
r
N

When 2 z scores are always equal, largest z scores get multiplied by the largest z scores
which makes up for the fact that smallest z scores are being multiplied together
If z scores randomly paired:
o For some of the cross products, 2 z scores will have the same sign so cross
product will be positive
o For just about as many cross products, 2 z scores would have opposite signs,
producing negative cross products
o Positive & negative cross products would cancel each other outleading to
coefficient near 0

Linear Transformations
Linear transformationtransformation that doesnt change z scores
o Ex: rule used to convert Celsius (centrigrade) temperatures into Fahrenheit
temperatures

F C 32
5

Formula resembles general formula for straight line: Y = mX + b


Relative positions of measures of 2 variables (& not absolute numbers) that are important
o Relative positions are reflected in z scoreswhich dont change w/ simple (ex:
linear) changes in scale of measurement
o Any time 1 variable is a linear transformation of the other, 2 variables will be
perfectly correlated

Graphing Correlation
Correlation coefficient: useful number for characterizing relationship b/w 2 variables
Describing a group of numbers in terms of mean for group can misleading if numbers are
very spread out
o Even adding SD still doesnt tell us if distribution is strongly skewed or not
o Only by drawing distribution can we be sure if mean & SDs were goods way to
describe distribution
Scatterplotscatter diagram; graph in which 1 of the variables is plotted on the X axis &
the other variable is plotted on the Y axis
o Each individual is represented as a single dot on the graph
o Can be used to illustrate characteristics & limitations of correlation coefficient
Dealing w/ Curvilinear Relationships
Important property of Pearsons r: coefficient measures only degree of linear correlation

Linear relationships are common in nature


o Simple transformation of 1 or both of variables can make relationship linear
When psychologists use term correlation w/o specifying linear or curvilinear, can
assume that linear correlation is being discussed

Problems in Generalizing from sample correlations


Population correlation coefficient ()Pearsons r that would be calculated if entire
population had been measured
o Almost never practical to calculate this value directly
o Instead, psychologists use the r that is calculated for sample to draw some
inference about
Truly random sample should yield sample r that reflects for the population
One of most common types of biased samples: narrow (truncated) range on one (& often
both) of variables
Restricted (or truncated) ranges
Correlation is based not on absolute numbers but on relative numbers (ex: z scores) w/n a
particular group
Truncated (restricted) rangewhen theres no large differences w/n group
o Usually leads to sample r that is lower than the population
o In some cases, truncated range can cause r to be considerably higher than
Another sampling problem that can cause sample r to be much higher or lower than
correlation for entire population: outliers
Bivariate outliers
Potential problem w/ use of Pearsons r: sensitivity to outliers (outriders)
o Both the mean & SD are sensitive to outliers
Correlation is especially sensitive to bivariate outliers
o Since measurement of correlation depends on pairs of numbers
Bivariate outlier need not have extreme value on either variable but the combination it
represents must be extreme
If you can find independent basis for eliminating an outlier from data, should remove
outlier before calculating Pearsons r
Sometimes have to deal w/ outliers
o Outlier sometimes represents very unlucky event that is not likely to appear in the
next sample
o Outlier may represent influence of yet unknown factors that are worthy of further
study
Scatterplot should always be inspected before any attempt is made to interpret correlation
o Unexpected curvilinear trend can be discovered
o Always possible that Pearsons r has been raised or lowered in magnitude or even
reversed in direction (by truncated range or single outlier)
o Spread of points may not be even as you go across scatterplot

Correlation doesnt imply causation


Large sample r suggests that for population may be large as well or at least larger than 0
Possible that some unnoticed third variable is responsible for correlation
Sometimes casual link underlying strong correlation seems too obvious to require
experimental evidence, true scientist must resist temptation of relying solely on theory &
logic
o Scientist should make every effort to obtain confirmatory evidence by means of
true experiment
True experiments involving correlation

Summary
1. If each individual has the same score on 2 different variables, 2 variables will be perfectly
correlated
a. Condition is not necessary
b. Perfect correlation can be defined as all individuals having the same z score (ex:
same relative position in distribution) on both variables
2. Negative correlationtendency for high scores on 1 variable to be associated w/ low
scores on a second variable (& vice versa)
a. Perfect negative correlation occurs when each individual has the same magnitude
z score on the 2 variables but z scores are opposite in sign
3. Pearsons correlation coefficient (r)can be defined as average of cross products of z
scores on 2 variables
a. Pearsons r ranges from -1.0 for perfect negative correlation to 0 when theres no
linear relationship b/w variables to +1.0 when correlation is perfectly positive
4. Linear transformationconversion of 1 variable into another by only arithmetic
operations (ex: adding, subtracting, multiplying, dividing) involving constants
a. If 1 variable is linear transformation of another, each individual will have the
same z score on both variables & 2 variables will be perfectly correlated
b. Changing units of measurement on either or both of variables will not change
correlation as long as change is linear one (as is usually the case)
5. Scatterplot (scattergram)graph of 1 variable plotted on X axis vs. second variable
plotted on Y axis
a. Scatterplot of perfect correlation will be a straight line that slopes up to the right
for positive correlation & down to the right for negative correlation
6. One important property of Pearsons r: assesses only degree of linear relationship b/w 2
variables
a. 2 variables can be closely related by a very simple curve & yet produce a
Pearsons r near 0
7. Problems can occur when Pearsons r is measured on subset of population but you wish
to extrapolate results to estimate correlation for entire population ()
a. Most common problem: having truncated/restricted range on 1 or both of
variables
i. Problem usually causes r for sample to be considerably <

ii. In rare cases, opposite can occur (ex: when measuring one portion of
curvilinear relationship)
8. Another potential problem w/ Pearsons r: few bivariate outliers in sample can drastically
change magnitude (in rare instances, even the sign) of the correlation coefficient
a. Important to inspect scatterplot for curvilinearity, outliers, & other aberrations
before interpreting meaning of correlation coefficient
9. Correlation (even if high in magnitude & statistically significant) doesnt prove that
theres any casual link b/w 2 variables
a. Always the possibility that some 3rd variable is separately affecting each of the 2
variables being studied
b. Experimental design (w/ random assignment of participants to conditions or
different quantitative levels of some variable) is required to determine whether 1
particular variable is causing changes in 2nd variable
B: Basic Statistical Procedures
For mean, it doesnt matter whether referring to population mean () or sample mean (XX)
o Calculation is the same
Theres difference b/w and s
o s is calculated w/ n 1
o is calculated w/ N in denominator
Covariance
Covariance gets larger in magnitude as 2 variables show a greater tendency to covary
(vary togethereither positively or negatively)
o Dividing by product of 2 SDs ensures that correlation coefficient will never get
larger than +1.0 or smaller than -1.0
Denominator can be considered biased (when extrapolating to larger population) just like
numerator has corresponding bias
o 2 biases cancel each other out so that value of r is not affected
If you wish to use unbiased SDs in denominator, you need to calculate an unbiased
covariance in the numerator
Unbiased Covariance
Bias in numerator is removed in same way we remove bias from formula for SD:
o Divide by n1 instead of N
Example of calculating Pearsons r

Which formula to use

Testing Pearsons r for Significance

Using t Distribution
When =0 & sample size is quite large, sample rs will be normally distributed w/ mean
1
n
of 0 & standard error of about

For sample sizes that arent large, standard error can be estimated by:

1 r 2
n2

Testing null hypothesis other than =0 or constructing a confidence interval around a


sample r requires transformation of rFisher Z transformation

Using table of critical values for Pearsons r


Performing t tests to create convenient table allows you to look up critical value for r as
function of alpha & df
Critical value for r as function of sample size
Critical values for r become larger as alpha gets smaller
If sample size is large enough, any sample r (unlike z scores or t values) no matter how
small can be statistically significant
o Conversely, even correlation coefficient close to 1.0 can fail to attain significance
if sample is too small

Sample rs are clustered more tightly around population as sample size increases
o Becomes more & more unlikely to obtain sample r far from as n gets larger
Calculated r for sample is just as likely to get smaller as larger when you increase sample
size
As sample gets larger, sample r tends to be closer to

Understanding degrees of freedom


If youre sampling only 2 cases from population, correlation you calculate will be perfect
regardless of 2 variables being measured & regardless of magnitude of positive
correlation for those variables
Only as your sample size gets large does your sample r begin to accurately reflect
correlation in population
Inflation of r when degrees of freedom are few becomes problem when calculating
multiple regression
Assumptions associated w/ Pearsons r
Pearsons r is sometimes used purely for descriptive purposes

o More often sample r is used to draw some inference concerning population


correlationdeciding whether or not =0
Independent random sampling
Assumption applies to all hypothesis tests in text
In case of correlation, means that even though relation may exist b/w 2 numbers of a pair,
each pair should be independent of the other pairs
o All of the pairs in population should have an equal chance of being selected

Normal distribution
Each of 2 variables should be ideally be measured on interval or ratio scale & be
normally distributed in population
Bivariate normal distribution
If assumption of bivariate normal distribution is satisfied, can be certain that preceding
assumption will also be satisfied
Possible for each variable to be normally distributed separately w/o 2 variables jointly
following bivariate normal distribution
In univariate distribution, 1 variable is placed along horizontal axis & (vertical) height of
distribution represents likelihood of each value
In bivariate distribution, 2 axes are needed to represent 2 variables
o Third axis is needed to represent relative frequency of each possible pair of values
o Common bivariate distribution: smooth, round hill
Values near the middle are more common
Values that are far away in any direction (including diagonally) are less
common
Bivariate normal distribution can involve any degree of linear relationship b/w 2
variables from r=0 to r=+1.0 or -1.0
o However, curvilinear relationship wouldnt be consistent w/ bivariate normal
distribution
Spearman rank-order correlation formulacommonly used when correlation is
calculated for ranked data
o Resulting coefficient is Spearman rho (rs)
Correlation coefficient is interpreted in same way as any other Pearsons r
but critical values for testing significance are different
Assumption of bivariate normal distribution becomes less important as sample size
increases
o For very large sample sizes, assumption can be grossly violated w/ very little error
o Generally, if sample size is above about 30 or 40 & bivariate distribution doesnt
deviate a great deal from bivariate normality, assumption is usually ignored
Uses of Pearson correlation coefficient

Reliability & validity


Common use of Pearsons r: measurement of reliability
o Occurs in a variety of contexts
o Correlation of 2 scores is calculated to determine test-retest reliability
Split-half reliability: when separate subscores for odd- & even-numbered items can be
correlated to quantify tendency for all items in questionnaire to measure same trait
A variable is sometimes measured by having someone act as a judge to rate the behavior
of a person (ex: aggressive, cooperative)
o To be confident about ratings, researcher may have 2 judges rate same behavior so
that correlation of ratings can be assessed
o Important to have high interrater reliability to trust ratings
o In general, correlation coefficients for reliability that are below .7 lead to good
deal of caution & rethinking
Frequent use of correlation: establish criterion validity of self-report measure
Common to measure degree of correlation b/w 2 questionnaires that are supposed to be
measuring same variable (ex: 2 measures of anxiety, 2 measures of depression)
Relationships b/w variables
Most interesting use for correlation: measure degree of association b/w 2 variables that
arent obviously related but predicted by some theory or past research to have an
important connection
o Ex: correlation found b/w mathematical ability & ability to imagine how objects
would look if rotated in 3D space supports some notions about cognitive basis for
mathematical operations
Some observed correlations werent predicted but can provide basis for future theories
o Ex: correlations sometimes found b/w various eating & drinking habits and
particular health problems
Correlations can be used to evaluate results of experiment when levels of manipulated
variable come from interval or ratio scale
o Ex: experimenter may vary number of times particular words are repeated in a list
to be memorized & then find the correlation b/w number of repetitions &
probability of recall
Publishing results of correlational studies

Excerpt from psychological literature


Large df (hence, sample size) makes it relatively easy to attain statistical significance w/o
large correlation coefficient
Power associated w/ correlational tests
Null hypothesis for two-group t test is almost always 1 2 = 0
Null hypothesis for correlational study is almost always 0=0

To study power, is necessary to hypothesize particular value for population correlation


coefficient
Given that A0, sample rs will be distributed around whatever value A0
o How narrowly sample rs are distributed around expected r depends on sample
size
To understand power analysis for correlation, important to appreciate fundamental
difference b/w Pearsons r & t value for two-group test
o Expected r(A) is measure similar to d (effect of size associated w/ t test)
Doesnt depend on sample size

o Expected r describe size of effect in population & by itself its size doesnt tell you
whether particular test of null hypothesis is likely to come out statistically
significant
In contrast, expected t value is reflection of both d & sample size
Does give you a way to determine likelihood of attaining statistical
significance
o When performing power analysis for correlation, expected r plays same role as d
in power analysis of t test & the not the role of expected t
Power analysis can be used to determine maximum number of participants that should be
used
o Determine based on desired levels for power &
o Plug smallest correlation of interest for variables youre dealing w/
n given by calculation is largest sample size you should use
Larger n will have unnecessarily high chance of giving you statistical
significance when true correlation is so low that you wouldnt care about it
In place of A can plug in largest correlation that can reasonably be
expected
o n that is calculated is minimum sample size you should employ
Any smaller sample size will give you a less than desirable chance of
obtaining statistical significance
Magnitude of correlation can sometimes be predicted based on previous studies (w/ d)
o At other times, expected correlation is characterized roughly as small, medium, or
large
o Conventional guideline for Pearsons r: .1=small, .3=medium, .5=large
Correlations much larger than .5 usually involve 2 variables that are
measuring same thing (ex: various types of reliability described earlier or
2 different questionnaires that are both designed to asses patients current
level of depression)
and d are conceptually similar in terms of what they are assessing in the population
o Are measured on very different scales

Correlation can be used as alternative measure of effect size but it is more often referred
to as a measure of the strength of association b/w 2 variables
Once have determined that particular sample r is statistically significant, are permitted to
rule out =0 (that 2 variables have no linear relationship in population)
o Sample r provides point estimate of
Accuracy of point estimate depends on sample size
An interval estimate would be more informative than point estimate
o Provides clearer idea of what values are likely for population correlation
coefficient
o Interval estimation for is not performed as often as it is for
Probably since it has fewer practical implications
But should be performed more often than it is
Constructing confidence interval for requires Fisher Z transformation
o Best handled by statistical software

Summary
1. To calculate Pearsons r w/o transforming to z scores: calculating means & biased SDs of
both variables as well as cross products (ex: X times Y) for all individuals
a. Pearsons r is equal to mean of cross products cross product of 2 means divided
by product of 2 biased SDs
i. Numerator of ratio is biased covariance
ii. If wishing to use unbiased SDs in denominator, numerator must be
adjusted to correct bias
2. Calculation method is not accurate unless retaining at least 4 digits past decimal point for
means, SDs, & mean of cross products
3. Correlation coefficient can be tested for significance w/ t test or by looking up critical
value for r
a. As sample size increases, smaller rs becomes significant
b. Any r other than 0 can become significant w/ large enough sample size
i. However, sample r doesnt tend to get larger just b/c sample size is
increased
1. Gets closer to
4. Following assumptions are required for testing significance of Pearson correlation
coefficient:
a. Both variables are measured on interval or ratio scales
b. Pairs of scores have been sampled randomly & independently
i. Ex: w/n pair, scores may be related but 1 pair shouldnt be related to any
of the others
c. 2 variables jointly follow bivariate normal distribution
i. Implies that each variable separately will be normally distributed
5. Possible uses for correlation coefficients include:
a. Reliability (ex: test-retest, split-half, interrater)
b. Validity (ex: construct)

c. Observing relations b/w variables as they naturally occur in the population (as
reflected in your sample)
d. Measuring casual relation b/w 2 variables after assigning participants randomly to
different quantitative levels of 1 of the variables
6. Calculation of power for testing significance of correlation coefficient is similar to
calculation corresponding to one-sample t test
a. Hypothesized population correlation coefficient (A) plays same role played by d
i. A must be combined w/ proposed sample size to find value for (which
can then be used to look up power)

Vous aimerez peut-être aussi