Vous êtes sur la page 1sur 27

Correlation

We take two measurements, of


two different physical properties;
are they related?
What affects the degree (or amount)
of correlation?

• number of observations;
• strength of relationship (slope);
• strength of correlation (scatter).
• significance
• confidence
How many points are in each quadrant?

2 5

6 2
Simple case: centred on (0,0)
Y
X is positive
X is negative
Y is positive:
Y is positive:
X x Y is positive
X x Y is negative

X is negative X is positive
Y is negative : Y is negative :
X x Y is positive X x Y is negative
Porosity, φ and Permeability, K are both
always positive:
K
x

x
x
x x
x
x

x
x
x

φ
But the difference between φ and mean(φ)
plotted against the difference between K and
mean(K) centres the plot on (0,0):

(K – K)
x
x
x
x x
x
x (φ – φ)

x
x
x
So the difference between φ and mean(φ) and
the difference between K and mean(K) gives us
the basis for the measure of correlation we want:
(K – K)
4 points in
1 point in x quadrant
quadrant x
x
x x
x
x (φ – φ)
1 point in
x quadrant
x
4 points in x
quadrant
_ _
8 points for which (K – K).(φ – φ)
is positive versus 2 points for which
it is negative
(K – K)
4 points in
1 point in x quadrant
quadrant x
x
x x
x
x (φ – φ)
1 point in
x quadrant
x
4 points in x
quadrant
A formula for calculating the
correlation:

 x  x  y  y 
r i i

n i j
Notation
∑X = ∑xi
= x1 + ……………..….. xn

∑Y = ∑yi
= y1 + ……………. ….. yn
But a better formula for calculating
the correlation is:

r
 x y    x  y / n
i i i i

 
 x   x  
2
  y   y 
  2


 n    
2 i  2 i
i n i

    
Why is this a better formula for
calculating the correlation?

Because each of the terms can be


calculated relatively simply
Calculating the correlation coefficient (1)

Set up one cell of a spreadsheet for each of


the terms in the equation for the
correlation coefficient, r :
∑xiyi
∑xi ∑yi / n
∑xi2 ∑yi2
(∑xi)2 / n (∑yi)2 / n
Notation: ∑xi
∑xi = x1 + x2 + x3 + x4 + ….. xn
In Excel, if the data are in rows 7 –
232 (n = 226) of column M, then:

∑xi = SUM(M7:M226)

∑xi2 = SUMSQ(M7:M226)

∑xiyi = SUMPRODUCT(J2:J4,I2:I4)
Notation: ∑xi
∑xi = x1 + x2 + x3 + x4 + ….. xn
∑yi = y1 + ……………. ….. yn

In Excel, if the data are in rows 7 – 232


(n = 226) of columns M and P, then:

∑xiyi = SUMPRODUCT(M7:M232,P7:P232)
∑xiyi - ∑xi ∑yi / n
______________________________
√(∑xi2 - (∑xi)2 / n ) (∑yi2 - (∑yi)2 / n )

• ∑xi = SUM(M7:M232)

• ∑xi2 = SUMSQ(M7:M232)

• ∑xiyi = SUMPRODUCT(M7:M232,P7:P232)
Calculating the correlation coefficient (3)

• Calculate the correlation coefficient for the


porosity and permeability data in
USGS_poroperm_data\37-Lindquist-1988.xls
Alternatively, the whole expression can be evaluated in a single
function call:
PEARSON takes 2 arguments, the arrays of x and of y. Hence in
the above example:
Calculating the correlation coefficient (4)
• Calculate the correlation coefficient for the
porosity and permeability data in
USGS_poroperm_data\37-Lindquist-1988.xls

• Calculate the correlation coefficient for the


porosity and log10(permeability) data in
USGS_poroperm_data\37-Lindquist-1988.xls

• Calculate the correlation coefficient for the


porosity and ln(permeability) data in
USGS_poroperm_data\37-Lindquist-1988.xls

• Calculate the correlation coefficient for the


porosity and (cube root of permeability) data in
USGS_poroperm_data\37-Lindquist-1988.xls
Cautions:
1. False positives:
Just because there is a high correlation coefficient does
not mean there is a high correlation. It is the nature of
random processes that, if we take a number of
uncorrelated variables and plot them against each other,
then there will be a spread of correlations around the
zero point and, if we take enough pairs, then the highest
and lowest will be significant at any pre-defined
percentage point.
Experiment
• Create 10 sets of 10 pairs of random
numbers.
• Tabulate each set.
• Calculate the correlation coefficient for
each set
Larger number of pairs reduces this
risk: two data sets both with r = 0.84

0.08 8

0.07 6

0.06
4
0.05
2
0.04

0.03 0
0 5 10 15 20 25
0.02 -2

0.01
-4
0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 -6
Cautions:

2 .Causality:
Just because two variables are correlated, does not mean
there is a causal relationship between them, even once we
have ruled out the "false positive" effect. The statistical
literature abounds with counter-examples, mostly
accidental and some hilarious, at least to those who
weren't involved.
Does this show what the authors
think it shows?

Vous aimerez peut-être aussi