Vous êtes sur la page 1sur 49

Chapter 2

Looking at DATA-Relationship

Chapter 2

Interested in studying the relationship between two


variables by measuring both variables on the same
individuals. For examplethe relationship between
stress and lack of sleep
Before proceeding we define two classifications of
variables
Quantitative that takes numerical values and algebraic
operations such as sum average make sense. E.g duration
of the song
Categorical that places a case into one or several groups
E.g. name of the song
One variable type can be converted into another.

Chapter 2

Explanatory and Response


Variables

When studying the relationship between


two variables.
a response variable (dependent variable)
measures an outcome of a study
an explanatory variable (independent
variable) explains or influences changes in
a response variable
sometimes there is no distinction

Chapter 2

Question
In a study to determine whether surgery or
chemotherapy results in higher survival
rates for a certain type of cancer, whether
or not the patient survived is one variable,
and whether they received surgery or
chemotherapy is the other. Which is the
explanatory variable and which is the
response variable?
Chapter 2

Scatterplot
Graphs the relationship between two
quantitative (numerical) variables
measured on the same individuals.
If a distinction exists, plot the
explanatory variable on the horizontal (x)
axis and plot the response variable on
the vertical (y) axis.

Chapter 2

Scatterplot
Relationship
between
mean SAT
verbal score
and percent
of high
school grads
taking SAT

Chapter 2

Scatterplot
Look

for overall pattern and


deviations from this pattern

Describe

pattern by form, direction,


and strength of the relationship

Look

for outliers

Chapter 2

Linear Relationship
Some relationships are such that the
points of a scatterplot tend to fall along
a straight line -- linear relationship

Chapter 2

Direction

Positive association
above-average values of one variable tend
to accompany above-average values of the
other variable, and below-average values
tend to occur together

Negative association
above-average values of one variable tend
to accompany below-average values of the
other variable, and vice versa
Chapter 2

Examples
From a scatterplot of college students,
there is a positive association between
verbal SAT score and GPA.

Chapter 2

10

Examples of Relationships

Chapter 2

11

Extensions
Adding categorical variable to the
scatter plot..
Taking log tranformation.when data is
more clustered

Chapter 2

12

Scatterplot
To add a
categorical
variable, use
a different
plot color or
symbol for
each
category

Southern
states
highlighted

Chapter 2

13

Measuring Strength & Direction


of a Linear Relationship
How closely does a non-horizontal straight line
fit the points of a scatterplot?
The correlation coefficient (often referred to as
just correlation): r

measure of the strength of the relationship: the


stronger the relationship, the larger the magnitude of
r.
measure of the direction of the relationship: positive r
indicates a positive relationship, negative r indicates
a negative relationship.

Chapter 2

14

Correlation Coefficient

special values for r :


a perfect positive linear relationship would have r = +1
a perfect negative linear relationship would have r = -1
if there is no linear relationship, or if the scatterplot
points are best fit by a horizontal line, then r = 0
Note: r must be between -1 and +1, inclusive

both variables must be quantitative; no distinction


between response and explanatory variables
r has no units; does not change when
measurement units are changed (ex: ft. or in.)
Chapter 2

15

Examples of Correlations

Chapter 2

16

Examples of Correlations

Husbands versus Wifes ages


r

Husbands versus Wifes heights


r

= .94
= .36

Professional Golfers Putting Success:


Distance of putt in feet versus percent
success
r

= -.94

Chapter 2

17

Not all Relationships are Linear


Miles per Gallon versus Speed

Linear relationship?

Correlation is close
to zero.

Chapter 2

18

Not all Relationships are Linear


Miles per Gallon versus Speed

Curved relationship.

Correlation is
misleading.

Chapter 2

19

Problems with Correlations


Outliers can inflate or deflate
correlations like with mean and
standard deviation (see next slide)
Groups combined inappropriately may
mask relationships (a third variable)

groups may have different relationships


when separated

Chapter 2

20

Outliers and Correlation


A

For each scatterplot above, how does the outlier


affect the correlation?
A: outlier decreases the correlation
B: outlier increases the correlation
Chapter 2

21

Correlation Calculation

Suppose we have data on variables X


and Y for n individuals:
x1, x2, , xn and y1, y2, , yn

Each variable has a mean and std dev:


( x , sx ) and ( y, sy )

(see ch. 2 for s )

1 n xi x

n - 1 i 1 s x
Chapter 2

yi y

s
y

22

Case Study
Per Capita Gross Domestic Product
and Average Life Expectancy for
Countries in Western Europe

Chapter 2

23

Case Study
Country

Per Capita GDP (x)

Life Expectancy (y)

Austria

21.4

77.48

Belgium

23.2

77.53

Finland

20.0

77.32

France

22.7

78.63

Germany

20.8

77.17

Ireland

18.6

76.39

Italy

21.5

78.51

Netherlands

22.0

78.15

Switzerland

23.8

78.99

United Kingdom

21.2

77.37

Chapter 2

24

Case Study
xi x /s x y i y /s y

xi - x

s
x

yi - y

s
y

21.4

77.48

-0.078

-0.345

0.027

23.2

77.53

1.097

-0.282

-0.309

20.0

77.32

-0.992

-0.546

0.542

22.7

78.63

0.770

1.102

0.849

20.8

77.17

-0.470

-0.735

0.345

18.6

76.39

-1.906

-1.716

3.271

21.5

78.51

-0.013

0.951

-0.012

22.0

78.15

0.313

0.498

0.156

23.8

78.99

1.489

1.555

2.315

21.2

77.37

-0.209

-0.483

0.101

x = 21.52 y = 77.754
sx =1.532

sum = 7.285

sy =0.795

Chapter 2

25

Case Study
1 n xi x

n - 1 i 1 s x

yi y

s
y

(7.285)
10 1
0.809

Chapter 2

26

Linear Regression
Objective:

To quantify the linear


relationship between an explanatory
variable (x) and response variable (y).

We

can then predict the average


response for all subjects with a given
value of the explanatory variable.

Chapter 2

27

Linear Regression
Case Study
Number of new birds and Percent returning
One of natures patterns
connects the percent of
adult birds in a colony
that return from the
previous year and the
number of new adults
that join the colony.
Chapter 2

28

Prediction via Regression Line

Number of new birds and Percent returning


Example: predicting
number (y) of new
adult birds that join
the colony based on
the percent (x) of
adult birds that
return to the colony
from the previous
year.

Chapter 2

29

Prediction via Regression Line

Number of new birds and Percent returning

Chapter 2

30

Least Squares

Used to determine the best line

We want the line to be as close as possible to


the data points in the vertical (y) direction
(since that is what we are trying to predict)

Least Squares: use the line that minimizes


the sum of the squares of the vertical distances
of the data points from the line

Chapter 2

31

Least Squares

Chapter 2

32

Least Squares Regression Line


Regression

equation:

y = a + bx

x is the value of the explanatory variable


y-hat is the average value of the response variable
(predicted response for a value of x)
note that a and b are just the intercept and slope of a
straight line
a measures the average value of y when x is 0 and b
measure the rate at which y-hat changes w.r.t x.
note that r and b are not the same thing, but their
signs will agree
Chapter 2

33

Regression Line Calculation


Regression

equation:
sy
br
sx

y = a + bx

a y bx
where sx and sy are the standard deviations of
the two variables, and r is their correlation
Chapter 2

34

Prediction via Regression Line

Number of new birds and Percent returning

The regression equation is


y-hat = 31.9343 0.3040x

y-hat is the average number of new birds for all


colonies with percent x returning

For all colonies with 60% returning, we predict


the average number of new birds to be 13.69:
31.9343 (0.3040)(60) = 13.69 birds

Chapter 2

35

Regression Calculation
Case Study
Per Capita Gross Domestic Product
and Average Life Expectancy for
Countries in Western Europe

Chapter 2

36

Regression Calculation
Case Study
Country

Per Capita GDP (x)

Life Expectancy (y)

Austria

21.4

77.48

Belgium

23.2

77.53

Finland

20.0

77.32

France

22.7

78.63

Germany

20.8

77.17

Ireland

18.6

76.39

Italy

21.5

78.51

Netherlands

22.0

78.15

Switzerland

23.8

78.99

United Kingdom

21.2

77.37

Chapter 2

37

Regression Calculation
Case Study
Linear regression equation:
x 21.52
s x 1.532

y 77.754
s y 0.795

r 0.809

sy

0.795
br
(0.809)
0.420
sx
1.532
a y bx 77.754 - (0.420)(21 .52) 68.716
^

y = 68.716 + 0.420x
Chapter 2

38

Coefficient of Determination (R2)

Measures usefulness of regression prediction

R2 (or r2, the square of the correlation):


measures what fraction of the variation in the
values of the response variable (y) is explained
by the regression line

r=1: R2=1: regression line explains all (100%) of


the variation in y

r=.7: R2=.49: regression line explains almost half


(50%) of the variation in y
Chapter 2

39

Residuals
A

residual is the difference between an


observed value of the response variable
and the value predicted by the regression
line:
^

residual = y y

Chapter 2

40

Residuals
A residual

plot is a scatterplot of the regression


residuals against the explanatory variable
used to assess the fit of a regression line
look for a random scatter around zero..if not then, the
linear fit is not appropriate to the data.
Mean of the least square residuals is always equal to 0.

Chapter 2

41

Residual Plot:
Case Study
Number of new birds and Percent
returning

Chapter 2

42

Outliers and Influential Points


An

outlier is an observation that lies far


away from the other observations
outliers in the y direction have large residuals
outliers in the x direction are often influential
for the least-squares regression line, meaning
that the removal of such points would
markedly change the equation of the line

Chapter 2

43

Outliers:
Case Study
Gesell Adaptive Score and Age at First Word

After removing
child 18
r2 = 11%
From all the data
r2 = 41%

Chapter 2

44

Cautions

about Correlation and Regression

only describe linear relationships

are both affected by outliers

always plot the data before interpreting

beware of extrapolation
predicting outside of the range of x

beware of lurking variables


have important effect on the relationship among the
variables in a study, but are not included in the study

association does not imply causation


Chapter 2

45

Caution:
Beware of Extrapolation

Sarahs height was


plotted against her
age
Can you predict her
height at age 42
months?
Can you predict her
height at age 30
years (360 months)?
Chapter 2

46

Caution:
Beware of Extrapolation

Regression line:
y-hat = 71.95 + .383 x
height at age 42
months? y-hat = 88
height at age 30
years? y-hat = 209.8
She is predicted to
be 6 10.5 at age 30.

Chapter 2

47

Caution:
Correlation Does Not Imply Causation
Even very strong correlations may
not correspond to a real causal
relationship (changes in x actually
causing changes in y).
(correlation may be explained by a
lurking variable)

Chapter 2

48

Caution:
Correlation Does Not Imply Causation
Social Relationships and Health
House, J., Landis, K., and Umberson, D. Social Relationships
and Health, Science, Vol. 241 (1988), pp 540-545.

Does lack of social relationships cause people to become


ill? (there was a strong correlation)
Or, are unhealthy people less likely to establish and
maintain social relationships? (reversed relationship)
Or, is there some other factor that predisposes people
both to have lower social activity and become ill?

Chapter 2

49

Vous aimerez peut-être aussi