Académique Documents
Professionnel Documents
Culture Documents
A: Conceptual Foundation
Concept of correlation is used in dealing w/ individual people
o If student scores better than most others in the class on a midterm exam, we
expect similar relative performance from that student on the final
Dont expect correlation to be perfect
o Some factors can change unexpectedly
o Ex: many factors will affect each students performance on each exam
Perfect Correlation
Less obvious that 2 variables can be perfectly correlated when they are measured in
different units & 2 numbers are not simply related
o Ex: theoretically, height measured in inches can be perfectly correlated w/ weight
measured in pounds
Although numbers for each will be quite different
Correlation is not about an individual having the same number of both variables
o It is about an individual being in the same position on both variables relative to
the rest of the group
o Ex: to have perfect correlation, someone slightly above average in height should
also be slightly above average in weight
If height & weight are correlated
To quantify correlation, need to transform original score on some variable to a number
that represents scores position w/ respect to group
o Using z scores, perfect correlation can be defined: if both variables are normally
distributed, perfect positive correlation can be defined as each person in the group
having the same z score on both variables
Negative Correlation
Perfect negative correlationeach person in the group has the same z score in magnitude
for both variables but z scores are opposite in sign
o Ex: correlation b/w score on exam & number of points taken off
If there are 100 points on exam, student w/ 85 has 15 points taken off
whereas student w/ 97 points has only 3 points taken off
Correlation doesnt have to be based on individual people, each measured twice
o Individuals being measured can be schools, cities, or even entire countries
Correlation Coefficient
Correlation coefficientused to measure amount of correlation
o Coefficient of +1: perfect positive correlation
o Coefficient of -1: perfect negative correlation
o Coefficient of 0: total lack of correlation
Numbers b/w 0 and 1 represent relative amount of correlation, w/ sign (+ or -)
representing direction of correlation
When 2 z scores are always equal, largest z scores get multiplied by the largest z scores
which makes up for the fact that smallest z scores are being multiplied together
If z scores randomly paired:
o For some of the cross products, 2 z scores will have the same sign so cross
product will be positive
o For just about as many cross products, 2 z scores would have opposite signs,
producing negative cross products
o Positive & negative cross products would cancel each other outleading to
coefficient near 0
Linear Transformations
Linear transformationtransformation that doesnt change z scores
o Ex: rule used to convert Celsius (centrigrade) temperatures into Fahrenheit
temperatures
F C 32
5
Graphing Correlation
Correlation coefficient: useful number for characterizing relationship b/w 2 variables
Describing a group of numbers in terms of mean for group can misleading if numbers are
very spread out
o Even adding SD still doesnt tell us if distribution is strongly skewed or not
o Only by drawing distribution can we be sure if mean & SDs were goods way to
describe distribution
Scatterplotscatter diagram; graph in which 1 of the variables is plotted on the X axis &
the other variable is plotted on the Y axis
o Each individual is represented as a single dot on the graph
o Can be used to illustrate characteristics & limitations of correlation coefficient
Dealing w/ Curvilinear Relationships
Important property of Pearsons r: coefficient measures only degree of linear correlation
Summary
1. If each individual has the same score on 2 different variables, 2 variables will be perfectly
correlated
a. Condition is not necessary
b. Perfect correlation can be defined as all individuals having the same z score (ex:
same relative position in distribution) on both variables
2. Negative correlationtendency for high scores on 1 variable to be associated w/ low
scores on a second variable (& vice versa)
a. Perfect negative correlation occurs when each individual has the same magnitude
z score on the 2 variables but z scores are opposite in sign
3. Pearsons correlation coefficient (r)can be defined as average of cross products of z
scores on 2 variables
a. Pearsons r ranges from -1.0 for perfect negative correlation to 0 when theres no
linear relationship b/w variables to +1.0 when correlation is perfectly positive
4. Linear transformationconversion of 1 variable into another by only arithmetic
operations (ex: adding, subtracting, multiplying, dividing) involving constants
a. If 1 variable is linear transformation of another, each individual will have the
same z score on both variables & 2 variables will be perfectly correlated
b. Changing units of measurement on either or both of variables will not change
correlation as long as change is linear one (as is usually the case)
5. Scatterplot (scattergram)graph of 1 variable plotted on X axis vs. second variable
plotted on Y axis
a. Scatterplot of perfect correlation will be a straight line that slopes up to the right
for positive correlation & down to the right for negative correlation
6. One important property of Pearsons r: assesses only degree of linear relationship b/w 2
variables
a. 2 variables can be closely related by a very simple curve & yet produce a
Pearsons r near 0
7. Problems can occur when Pearsons r is measured on subset of population but you wish
to extrapolate results to estimate correlation for entire population ()
a. Most common problem: having truncated/restricted range on 1 or both of
variables
i. Problem usually causes r for sample to be considerably <
ii. In rare cases, opposite can occur (ex: when measuring one portion of
curvilinear relationship)
8. Another potential problem w/ Pearsons r: few bivariate outliers in sample can drastically
change magnitude (in rare instances, even the sign) of the correlation coefficient
a. Important to inspect scatterplot for curvilinearity, outliers, & other aberrations
before interpreting meaning of correlation coefficient
9. Correlation (even if high in magnitude & statistically significant) doesnt prove that
theres any casual link b/w 2 variables
a. Always the possibility that some 3rd variable is separately affecting each of the 2
variables being studied
b. Experimental design (w/ random assignment of participants to conditions or
different quantitative levels of some variable) is required to determine whether 1
particular variable is causing changes in 2nd variable
B: Basic Statistical Procedures
For mean, it doesnt matter whether referring to population mean () or sample mean (XX)
o Calculation is the same
Theres difference b/w and s
o s is calculated w/ n 1
o is calculated w/ N in denominator
Covariance
Covariance gets larger in magnitude as 2 variables show a greater tendency to covary
(vary togethereither positively or negatively)
o Dividing by product of 2 SDs ensures that correlation coefficient will never get
larger than +1.0 or smaller than -1.0
Denominator can be considered biased (when extrapolating to larger population) just like
numerator has corresponding bias
o 2 biases cancel each other out so that value of r is not affected
If you wish to use unbiased SDs in denominator, you need to calculate an unbiased
covariance in the numerator
Unbiased Covariance
Bias in numerator is removed in same way we remove bias from formula for SD:
o Divide by n1 instead of N
Example of calculating Pearsons r
Using t Distribution
When =0 & sample size is quite large, sample rs will be normally distributed w/ mean
1
n
of 0 & standard error of about
For sample sizes that arent large, standard error can be estimated by:
1 r 2
n2
Sample rs are clustered more tightly around population as sample size increases
o Becomes more & more unlikely to obtain sample r far from as n gets larger
Calculated r for sample is just as likely to get smaller as larger when you increase sample
size
As sample gets larger, sample r tends to be closer to
Normal distribution
Each of 2 variables should be ideally be measured on interval or ratio scale & be
normally distributed in population
Bivariate normal distribution
If assumption of bivariate normal distribution is satisfied, can be certain that preceding
assumption will also be satisfied
Possible for each variable to be normally distributed separately w/o 2 variables jointly
following bivariate normal distribution
In univariate distribution, 1 variable is placed along horizontal axis & (vertical) height of
distribution represents likelihood of each value
In bivariate distribution, 2 axes are needed to represent 2 variables
o Third axis is needed to represent relative frequency of each possible pair of values
o Common bivariate distribution: smooth, round hill
Values near the middle are more common
Values that are far away in any direction (including diagonally) are less
common
Bivariate normal distribution can involve any degree of linear relationship b/w 2
variables from r=0 to r=+1.0 or -1.0
o However, curvilinear relationship wouldnt be consistent w/ bivariate normal
distribution
Spearman rank-order correlation formulacommonly used when correlation is
calculated for ranked data
o Resulting coefficient is Spearman rho (rs)
Correlation coefficient is interpreted in same way as any other Pearsons r
but critical values for testing significance are different
Assumption of bivariate normal distribution becomes less important as sample size
increases
o For very large sample sizes, assumption can be grossly violated w/ very little error
o Generally, if sample size is above about 30 or 40 & bivariate distribution doesnt
deviate a great deal from bivariate normality, assumption is usually ignored
Uses of Pearson correlation coefficient
o Expected r describe size of effect in population & by itself its size doesnt tell you
whether particular test of null hypothesis is likely to come out statistically
significant
In contrast, expected t value is reflection of both d & sample size
Does give you a way to determine likelihood of attaining statistical
significance
o When performing power analysis for correlation, expected r plays same role as d
in power analysis of t test & the not the role of expected t
Power analysis can be used to determine maximum number of participants that should be
used
o Determine based on desired levels for power &
o Plug smallest correlation of interest for variables youre dealing w/
n given by calculation is largest sample size you should use
Larger n will have unnecessarily high chance of giving you statistical
significance when true correlation is so low that you wouldnt care about it
In place of A can plug in largest correlation that can reasonably be
expected
o n that is calculated is minimum sample size you should employ
Any smaller sample size will give you a less than desirable chance of
obtaining statistical significance
Magnitude of correlation can sometimes be predicted based on previous studies (w/ d)
o At other times, expected correlation is characterized roughly as small, medium, or
large
o Conventional guideline for Pearsons r: .1=small, .3=medium, .5=large
Correlations much larger than .5 usually involve 2 variables that are
measuring same thing (ex: various types of reliability described earlier or
2 different questionnaires that are both designed to asses patients current
level of depression)
and d are conceptually similar in terms of what they are assessing in the population
o Are measured on very different scales
Correlation can be used as alternative measure of effect size but it is more often referred
to as a measure of the strength of association b/w 2 variables
Once have determined that particular sample r is statistically significant, are permitted to
rule out =0 (that 2 variables have no linear relationship in population)
o Sample r provides point estimate of
Accuracy of point estimate depends on sample size
An interval estimate would be more informative than point estimate
o Provides clearer idea of what values are likely for population correlation
coefficient
o Interval estimation for is not performed as often as it is for
Probably since it has fewer practical implications
But should be performed more often than it is
Constructing confidence interval for requires Fisher Z transformation
o Best handled by statistical software
Summary
1. To calculate Pearsons r w/o transforming to z scores: calculating means & biased SDs of
both variables as well as cross products (ex: X times Y) for all individuals
a. Pearsons r is equal to mean of cross products cross product of 2 means divided
by product of 2 biased SDs
i. Numerator of ratio is biased covariance
ii. If wishing to use unbiased SDs in denominator, numerator must be
adjusted to correct bias
2. Calculation method is not accurate unless retaining at least 4 digits past decimal point for
means, SDs, & mean of cross products
3. Correlation coefficient can be tested for significance w/ t test or by looking up critical
value for r
a. As sample size increases, smaller rs becomes significant
b. Any r other than 0 can become significant w/ large enough sample size
i. However, sample r doesnt tend to get larger just b/c sample size is
increased
1. Gets closer to
4. Following assumptions are required for testing significance of Pearson correlation
coefficient:
a. Both variables are measured on interval or ratio scales
b. Pairs of scores have been sampled randomly & independently
i. Ex: w/n pair, scores may be related but 1 pair shouldnt be related to any
of the others
c. 2 variables jointly follow bivariate normal distribution
i. Implies that each variable separately will be normally distributed
5. Possible uses for correlation coefficients include:
a. Reliability (ex: test-retest, split-half, interrater)
b. Validity (ex: construct)
c. Observing relations b/w variables as they naturally occur in the population (as
reflected in your sample)
d. Measuring casual relation b/w 2 variables after assigning participants randomly to
different quantitative levels of 1 of the variables
6. Calculation of power for testing significance of correlation coefficient is similar to
calculation corresponding to one-sample t test
a. Hypothesized population correlation coefficient (A) plays same role played by d
i. A must be combined w/ proposed sample size to find value for (which
can then be used to look up power)