Vous êtes sur la page 1sur 35

Correlation and Linear Regression

Microbiology 3053 Microbiological Procedures

Correlation

Correlation analysis is used when you have measured two continuous variables and want to quantify how consistently they vary together The stronger the correlation, the more likely to accurately estimate the value of one variable from the other Direction and magnitude of correlation is quantified by Pearsons correlation coefficient, r

Perfectly negative (-1.00) to perfectly positive (1.00) No relationship (0.00)

Correlation

The closer r = |1|, the stronger the relationship

R=0 means that knowing the value of one variable tells us nothing about the value of the other

Correlation analysis uses data that has already been collected


Archival Data not produced by experimentation

Correlation does not show cause and effect but may suggest such a relationship

Correlation Causation

There is a strong, positive correlation between


the number of churches and bars in a town smoking and alcoholism (consider the relationship between smoking and lung cancer) students who eat breakfast and school performance marijuana usage and heroin addiction (vs heroin addiction and marijuana usage)

Visualizing Correlation

Scatterplots are used to illustrate correlation analysis

Assignment of axes does not matter (no independent and dependent variables) Order in which data pairs are plotted does not matter In strict usage, lines are not drawn through correlation scatterplots

Correlations
Weak Positive Correlation
600 500 400 300 200 100 0 -100 0 -200 -300 -400

Strong Negative Correlation


120 100 80 60 40 20 0 -20 0 -40 -60 -80 -100

r = 0.266

r = - 0.9960

10

20

30

40

50

10

20

30

40

50

No Correlation
5000 4000 3000 2000 1000 0 0 -1000 -2000 50 100 150 200 250

r = 0.00

Linear Regression

Used to measure the relationship between two variables


Prediction and a cause and effect relationship Does one variable change in a consistent manner with another variable? x = independent variable (cause) y = dependent variable (effect)

If it is not clear which variable is the cause and which is the effect, linear regression is probably an inappropriate test

Linear Regression

Calculated from experimental data

Independent variable is under the control of the investigator (exact value) Dependent variable is normally distributed Differs from correlation, where both variables are normally distributed and selected at random by investigator

Regression analysis with more than one independent variable is termed multiple (linear) regression

Linear Regression
Best fit line based on the sum of the squares of the distance of the data points from the predicted values (on the line)
70

y = 1.0092x + 8.6509
60 50 40 30 20 10 0 0 10 20 30 40 50 Independent Variable

R2 = 0.8863

Dependent Variable

Linear Regression

y = a + bx where

a = y intercept (point where x = 0 and the line passes through the y-axis) b = slope of the line (y2-y1/x2-x1) Positive = y increases as x increases Negative = y decreases as x increases 0 = no correlation

The slope indicates the nature of the correlation


Same as Pearsons correlation No relationship between the variables

Correlation Coefficient (r)


Shows the strength of the linear relationship between two variables, symbolized by r The closer the data points are to the line, the closer the regression value is to 1 or -1

r varies between -1 (perfect negative correlation) to 1 (perfect positive correlation)


0 - 0.2 no or very weak association 0.2 -0.4 weak association 0.4 -0.6 moderate association 0.6 - 0.8 strong association 0.8 - 1.0 very strong to perfect association null hypothesis is no association (r = 0) Salkind, N. J. (2000) Statistics for people who think they hate statistics. Thousand Oaks, CA: Sage

Coefficient of Determination (r2)

Used to estimate the extent to which the dependent variable (y) is under the influence of the independent variable (x) r2 (the square of the correlation coefficient)

Varies from 0 to 1 r2 = 1 means that the value of y is completely dependent on x (no error or other contributing factors) r2 < 1 indicates that the value of y is influenced by more than the value of x

Coefficient of Determination

A measurement of the proportion of variance of y explained by its dependence on x


r2 can be calculated for correlation analysis by squaring r but


Remainder (1 - r2) is the variance of y that is not explained by x (i.e., error or other factors) e.g., if r2 = 0.84, it shows a strong, positive relationship between the variables and shows that the value of x is used to predict 84% of the variability of y (and 16% is due to other factors)

Not a measure of variation of y explained by variation in x Variation in y is associated with the variance of x (and vice versa)

Assumptions of Linear Regression

Independent variable (x) is selected by investigator (not random) and has no associated variance For every value of x, values of y have a normal distribution Observed values of y differ from the mean value of y by an amount called a residual. (Residuals are normally distributed.) The variances of y for all values of x are equal (homoscedasticity) Observations are independent (Each individual in the sample is only measured once.)

Linear Regression Data


The numbers alone do not guarantee that the data have been fitted well!

Anscombe, F. J. 1973. Graphs in Statistical Analysis. The American Statistician 27(1):17-21.

Linear Regression Data

Linear Regression Data


Figure 1: Acceptable regression model with observations distributed evenly around the regression line

Figure 2: Strong curvature suggests that linear regression may not be appropriate (an additional variable may be required)

Linear Regression Data


Figure 3: A single outlier alters the slope of the line. The point may be erroneous but if not, a different test may be necessary

Figure 4: Actually a regression line connecting only two points. If the rightmost point was different, the regression line would shift.

What if were not sure if linear regression is appropriate?

Residuals
Homoscedastic Heteroscedastic

Variance appears random Good regression model

Funnel shaped and may be bowed Suggests that a transformation and inclusion of additional variables may be warranted

Helsel, D.R., and R.M. Hirsh. 2002. Statistical Methods in Water Resources. USGS (http://water.usgs.gov/pubs/twri/twri4a3/)

Data Set 1
2.5 2 1.5 1 0.5 0 -0.5 0 -1 -1.5 -2 -2.5
1.5 1 0.5 0 -0.5 0 -1 -1.5 -2 -2.5

Data Set 2

Residuals

Residuals

10

15

10

15

X Variable 1

X Variable 1

Data Set 3
4 3
2.5 2 1.5 1 0.5 0 -0.5 0 -1 -1.5 -2

Data Set 4

Residuals

2 1 0 -1 -2 X Variable 1 0 5 10 15

Residuals

10

15

20

X Variable 1

Outliers

Values that appear very different from others in the data set

Rule of thumb: an outlier is more than three standard deviations from mean Measurement or recording error Observation from a different population A rare event from within the population May indicate important phenomenon e.g., ozone hole data (outliers removed automatically by analysis program, delaying observation about 10 years)

Three causes

Outliers need to be considered and not simply dismissed


Outliers

Helsel, D.R., and R.M. Hirsh. 2002. Statistical Methods in Water Resources. USGS (http://water.usgs.gov/pubs/twri/twri4a3/)

When is Linear Regression Appropriate?

Data should be interval or ratio The dependent and independent variables should be identifiable The relationship between variables should be linear (if not, a transformation might be appropriate) Have you chosen the values of the independent variable? Does the residual plot show a random spread (homoscedastic) and does the normal probability plot display a straight line (or does a histogram of residuals show a normal distribution)?

(Normal Probability Plot of Residuals)


The normal probability plot indicates whether the residuals follow a normal distribution, in which case the points will follow a straight line. Expect some moderate scatter even with normal data. Look only for definite patterns like an "S-shaped" curve, which indicates that a transformation of the response may provide a better analysis. (from Design Expert 7.0 from Stat-Ease)

(Histogram of Residuals Distribution)

Lineweaver-Burk Plot
The Michaelis-Menton equation to describe enzyme activity:

[ S ] Vm ax vo K m [S ]
is linearized by taking its reciprocal:
Km 1 1 1 vo Vm ax Vm ax [ S ]

where: y = 1/vo x = 1/[S] a = 1/Vmax b = Km/Vmax

Mock Enzyme Experiment


Michaelis-Menton Plot
90 80 70

v (pennies/min)

60 50 40 30 20 10 0 0 20 40 60 80 100 120 140 S (pennies/m^2)

Mock Enzyme Experiment


Lineweaver-Burk Plot
0.090 0.080

1/v (pennies/min)^-1

0.070 0.060 0.050 0.040 0.030 0.020 0.010 0.000 0.000 0.020 0.040

y = 0.7053x + 0.0076 R2 = 0.9785

0.060

0.080

0.100

0.120

1/S (pennies/m^2)^-1

Mock Enzyme Experiment


Eadie-Hofstee
140 120

v (pennies/min)

y = -85.671x + 124.48 R2 = 0.8543

100 80 60 40 20 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 v/S (m^2/min)

Mock Enzyme Experiment

Mock Enzyme Experiment

Mock Enzyme Experiment

Mock Enzyme Experiment


Residual Plot
0.01

Residuals

0.005 0 0.00 -0.005 -0.01


X Variable

0.05

0.10

0.15

Mock Enzyme Experiment


Normal Probability Plot
0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0 20 40 60 80 100 120 Sample Percentile

Vous aimerez peut-être aussi