Vous êtes sur la page 1sur 20

Correlation and Regression

SCATTER DIAGRAM
The simplest method to assess relationship between two
quantitative variables is to draw a scatter diagram

From this diagram we notice that as age increases there is a


general tendency for the BP to increase. But this does not
give us a quantitative estimate of the degree of the relationship

CORRELATION COEFFICIENT
The correlation coefficient is an index of the degree of
association between two variables. It can also be used for
comparing the degree of association in different groups

For example, we may be interested in knowing whether the degree of


association between age and systolic BP is the same (or different) in
males and females

The correlation coefficient is denoted by the symbol r


r ranges from -1 to +1

High values of one variable tend to occur with high


values of the other (and low with low)
In such situations, we say that there is a positive correlation

High values of one variable occur with low values of the other
(and vice-versa)
we say that there is a negative correlation

A NOTE OF CAUTION
Correlation coefficient is purely a measure of degree of
association and does not provide any evidence of
a cause-effect relationship
It is valid only in the range of values studied
Extrapolation of the association may not always be valid

Eg.: Age & Grip strength

r measures the degree of linear relationship


r = 0 does not necessarily mean that there is no
relationship between the two characteristics under
study; the relationship could be curvilinear

Spurious correlation :
The production of steel in UK and population in India
over the last 25 years may be highly correlated

r does not give the rate of change in one variable


for changes in the other variable

Eg: Age & Systolic BP - Males : r = 0.7


Females : r = 0.5

From this one should not conclude that Systolic BP increases


at a higher rate among males than females

PROPERTY OF
CORRELATION COEFFICIENT
Correlation coefficient is unaffected by addition / subtraction
of a constant or multiplication / division by a constant to all the
values of X and Y

Corr. Coeff. between X & Y

= 0.7

,,

X+10 & Y-6 = 0.7

,,

5X & 2Y

= 0.7

If the correlation coefficient between height in inches and


weight in pounds is say, 0.6, the correlation coefficient
between

height in cm and weight on kg will also be 0.6

COMPUTATION OF THE
CORRELATION COEFFICIENT
X
8
3
4
10
6
7
11
Sum 49

Y (X - X ) (Y- Y ) (X X) (Y- Y )
12
1
0
0
9
-4
-3
12
10
-3
-2
6
15
3
3
9
11
-1
-1
1
12
0
0
0
15
4
3
12
84
0
0
40

y
x
y
12
x
7
n=7
n
n
( x x )( y y )
40

6.67
Covariance (XY)
(n 1)
6
Cov ( xy )
6.67
r

0.98
S .d .( x) S .d .( y ) 2.94 X 2.31

UNIVARIATE REGRESSION
Regression : Method of describing the relationship
between two variables

Use : To predict the value of one variable given the other

SAMPLE DATA SET


Patient No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Age (X)
45
48
46
45
46
48
46
55
51
56
53
60
53
54
49

Sys BP (Y)
150
153
148
150
147
153
149
159
157
160
158
165
157
158
154

BP = Response (dependent) variable; Age = Predicator (independent) variable

REGRESSION MODEL
We can perform a regression of BP on age,
to derive a straight line that gives an estimated value of BP
for any given age.
The general equation of a linear regression line is

Y = a + bX + e
Where,

a = Intercept
b = Regression coefficient
e = Statistical error

CALCULATIONS
Estimated from the observed values of
Age (X) and BP (Y) by least square method

( X X )(Y Y ) Co var iance( X , Y )


2
Variance ( X )
X X
Y bX
b gives the change in Y for a unit change in X

is the value of Y when X = 0, which may not be meaningful always

TEST OF SIGNIFICANCE FOR b


Null hypothesis : b 0

b 0
.......(1)
Test statistic t =

SE (b )
Where,

SE (b)

Y ) 2 b( X X ) 2
( n 2) ( X X ) 2

(Y

The value given under(1) follows a t-distribution with (n-2) df

ASSUMPTIONS
1. The relation between the two variables should be linear

2. The residuals should follow a Normal distribution with


zero mean and constant variance

PRECAUTIONS
1. Adequate sample size should be ensured
2. Prediction should be made within the range of the
observed values. No extrapolation should be attempted
3. The equation Y = a + bX should not be used
to predict X for a given Y
4. Model adequacy should be verified

RESULTS OF REGRESSION ANALYSIS

-------------------------------------------------------------------------------------Ind. variable
Reg Coeff. b SE b
t
P-value
-------------------------------------------------------------------------------------Age
1.08
0.08
14.16
< 0.0001
Constant
100.34
-------------------------------------------------------------------------------------R2 = 93.99% 94%
Systolic BP = 100.34 + 1.08 Age
95% CI for b = b 1.96 SE(b) = 1.08 1.96 x 0.08
= (0.92, 1.24)

INTERPRETATIONS
1. b 1.08 Change in age by one year results in a change of
1.08 mm Hg in Sys. BP

2. a 100.34 When age = 0, BP = 100.34, which is absurd


3. BP of a 50 year old individual is

100.24 + 1.08 x 50 = 154.34 154 mm Hg


2

4. R 94% 94% of the variation in BP is explained by age alone

MULTIPLE LINEAR REGRESSION


The response variable is expressed as a combination of
several predictor variables
Eg.

PEmax 47.35 0.147 ht. 1.024 wt.

0.147 & 1.024 are regression coefficients for ht. and wt.
Indicate the increase in

PEmax

for

an increase of 1 cm in ht. and 1 kg in wt., respectively

LOGISTIC REGRESSION
Response variable - Presence or absence of some condition
We predict a transformation of the response variable
instead of the actual value of the variable
Data : Hypertension, Smoking (X1) , Obesity(X2) & Snoring (X3)
Which of the factors are predictors of hypertension?

Logit (p) = -2.378 - 0.068 X1 + 0.695 X2 + 0.872 X3


The probability can be estimated for any combination of the three variables
Also, we can compare the predicated probability for different groups,
e.g., Smokers and Non-smokers