Vous êtes sur la page 1sur 19

STAT Course Notes Set 10

Chi-Squared Tests
Analyzing association between 2 Categorical Variables
In this section we will study the hypotheses used to test whether or not an association exists between 2
categorical variables.
EX 1:Researchers wanted to test the theory that women who went to work shortly after giving birth were

more likely to experience postpartum depression compared to those who stayed home.

A random sample of women giving birth at a Dallas hospital were queried six months after giving birth to
their first child. The researchers recorded whether or not the woman worked outside the home and
whether or not she experienced postpartum depression.
What is the explanatory variable?

Work Status

What is the response variable?

Whether or not the woman has post partum depression

Descriptive Statistics:

Contingency Table
Work Status By Mental State
Count
Row %
Expected
At Home
Working

Depressed

Not
Depressed

17(n11)
50(n12) 67(R1)
25.37%
74.63%
23.7635(E11) 43.2365(E12)
55(n21)
81(n22) 136(R2)
40.44%
59.56%
48.2365(E21) 87.7635(E22)
72(C1)
131(C2) 203(n)

Reading a contingency table:

The rows are the groups one row for each group defined by the explanatory variable.

The columns are divided up by response variable value one column for each response variable
value.

The first number in a cell (box) is the number of subjects in that row group taking that columns
response value. Ex 1: In the first cell the first number is 17. This tells us that 17 of the 67 stayat-home moms in the sample were depressed.

The second number in a cell is the % of subject in the row group taking that response value. Ex 1:
In the first cell the 2nd number is 25.37 which tells us that 25.37% of the stay-at-home moms in
the sample were depressed. This is a conditional percent.

STAT Course Notes Set 10

The 3rd number in the cell gives the number of counts wed expected to see if H0 were true and the

% of depressed moms is the same in both the working and stay-at-home groups.

How to calculate is 3rd number:

Hypotheses:

Observational study: In the population there is no association between 2 variables if the


probability of having a particular response value is the same across all groups and this holds for
all response values.
o

The null hypothesis is always there is no association (the explanatory and response are

independent).

In the population there is an association or dependence between 2 variables, if the explanatory


variable has some value for predicting the response.
o

This is not to say the explanatory causes the response but only that they are associated. For
example, we can predict that the likelihood of a shark attack in Florida is greater when more
ice cream cones are sold. The reason for this being that more ice cream cones are sold on
warm days and on warm days, more people swim making them shark bait.

General form of the hypotheses in observational studies

H0: There is no association between the explanatory and response variable


o

Alternative statement of the null hypothesis: The explanatory and response are
independent.

HA: There is an association between the explanatory and response variable


o

Alternative statement of the alternative hypothesis: The explanatory and response are
not independent.

If there are only 2 values for both the explanatory and response variable, then we can write the
hypotheses in the following mathematical form:
H0: : p=p

versus

HA: p p2

STAT Course Notes Set 10

P1 = proportion of successes in group 1. This parameter equals the conditional proportion of


successes in group 1
P2 = proportion of successes in group 2. This parameter equals the conditional proportion of
successes in group 2

Ex: 1 continued: What are the hypotheses both written out and in mathematical form?

Hypotheses when the data comes from a good Comparative Randomized Experiment:
In the case of the good randomized experiment, we can make stronger conclusions when we reject the
null hypothesis.
One standard conclusion is to say the chance of a particular experimental outcome occurring depends, in

part, on the treatment received.

Other good hypotheses statements

H0: There is no difference in response to the different treatments.


Note: a control is considered a treatment.

HA: Different treatments cause a difference in response.


o

Alternative statement: The different treatments affect the response differently

We use a statistical test called a chi-squared test to determine if there is statistical evidence of association
(or cause and effect) between 2 categorical variables
Output from the Chi-squared test:
Output from example 1:
Test
Pearson

ChiSquare
4.453

Prob>ChiSq
0.0348*

What is the p-value?


What can we conclude based on this output if = .05?
3

STAT Course Notes Set 10

Below is a discussion of the chi-squared test in the context of this example.


Recall that test statistics in some manner measure the distance between the null hypothesis and the data.
We call the data that has been collected the observed counts. The observed counts are the number of
counts from the data in each cell.
The expected counts for a cell are the # of counts that would have been observed if the conditional
proportions were the same for each group that is, the percents in a column would be identical. This is
what we would expect if the null hypothesis were true and there was no sampling variability.
We will form a test statistic from these observed and expected counts to test if the proportion of stay-athome moms who suffer from postpartum depression is different from the proportion of working moms
who suffer from postpartum depression after birth of the first child.
The test statistic has chi-squared distribution IF the null hypothesis is true. A chi-squared distribution is
written
The test statistic is: TS = =

ni = actual # subjects in cell i.

ni Ei 2
Ei

We sum over every cell in the table.

For i= 1 ni = 17

Ei = # of subjects wed expect would be in cell i if H0 were

true. So for cell 1 this value is 23.76. We interpret this to


mean that if working and stay at home moms have exactly
the same chance of experiencing post-partum depression
and in a group of 203 new moms in which 67 stayed at
home, then wed expect 23.76 of the stay at home moms to
be depressed.

Count
Row %
Expected
At Home

Depressed

Not
Depressed

17
25.37
23.7635
55
40.44
48.2365
72

50
74.63
43.2365
81
59.56
87.7635
131

Working

The larger the value of (ni Ei)2 the stronger the evidence is that H0 is not true.
For our data set the TS is calculated as
TS =

(17 23.7635) 2
(50 43.2365)2 (55 48.2365) 2 (81 87.7635)2
+
+
+
= 4.4527
43.2365
48.2365
87.7635
23.7635

If there is no association between postpartum depression and work status, how many of the 136
working moms in our study would you expect to have been depressed?

How many working moms in our study were actually found to be depressed?

The larger the value of the test statistic is, then the greater the difference between the data counts and
4

67
136
203

the expected number of counts.

STAT Course Notes Set 10

Do larger test statistics result in larger or smaller p-values?

Conditions that must be satisfied in order to run a chi-squared test


1. Independent samples - All observations independent of one another if SRS or randomly assigned
treatment then ok.
2. Large sample sizes - Expected number of observations in each cell 5
Note: if neither variable can be considered to be explanatory, choose one as the explanatory

Are the conditions met to use the chi-squared test to analyze the data from example 6?

EX 2: The Physicians Health Study is a very famous study which looked at the effects of aspirin on heart
attack rates. In that study, the male subjects (all doctors) were randomly assigned to either take an
aspirin or a placebo over a 5 year period. At the end of the study, the proportion of men who had heart
attacks in each group was reported. Many spin-off studies have resulted from that one. In one study,
men were randomly assigned to one of 3 groups aspirin, ibuprofen or placebo, which they took for a
period of 5 years. The number of heart attacks in each group was recorded at the end of the study. The
proportion of men in each group who had a heart attack was then compared. The data is below. The
theory they are testing is does the type of drug a man takes effect his chances of having a heart attack?
Contingency Table
Drug By Health Status
Count
Heart
Row %
Attack
Expected
Aspirin
104
0.94
151.664
Ibuprofen
81
1.57
70.7133
Placebo
189
1.71
151.623
374

None
10933
99.06
10885.3
5065
98.43
5075.29
10845
98.29
10882.4
26843

Test
Pearson

ChiSquare
26.048

Prob>ChiSq
<.0001*

11037
5146
11034
27217

What are the variables and variable types?

Explanatory variable: Type of drug with values: aspirin, ibuprofen and placebo

Response variable: Whether or not a man experienced a heart attack during the time of the study.
5

STAT Course Notes Set 10

What hypotheses can we test with this data set?

Are the conditions met to use the chi-squared test to analyze this data set?
What is the p-value and what is your decision?
What do you conclude based on the chi-square test?
In the example on the previous page, the proportions in the Heart Attack column are all very small. In
cases like this, a useful way to compare the how much the groups conditional proportions differ from
each other is a measure called relative risk.

p 1
p 2

Relative risk

p 1 = sample proportion of successes in group 1


p 2 = sample proportion of successes in group 2
Properties of Relative Risk
1. Relative risk can equal any number 0.
2. When the conditional proportions being compared are equal, the relative risk equals 1.
3. A relative risk greater than 1 indicates that the proportion of successes is larger in the first
group than in the 2nd group.
4. A relative risk less than 1 indicates that the proportion of successes is larger is the second
group.
5. Values farther from one (either less than or greater to 1) represent stronger associations.
Continuing EX 2: Below, the table gives just the counts

Aspirin
Ibuprofen
Placebo
Totals

Heart
Attack
104
81
189
374

None
10933
5065
10845
26843

Totals
11037
5146
11034
27217

What is the relative risk of having a heart attack among men who take aspirin compared to men who take
6

STAT Course Notes Set 10

a placebo?

We interpret this as: Men in the study who took aspirin were about _____ as likely to have a heart attack
as those men who took a placebo.
What is the relative risk of having a heart attack among men who take Ibuprofen compared to men who
take a placebo?

Confidence intervals for the difference of 2 proportions:

Linear Regression
Inferential statistics for 2 numerical variables
Now we want to make statistical inferences a population based on the data when there appears to be a
linear relationship between the explanatory variable (numerical) and the response variable (numerical).

EX 1 We are interested in determining if there is statistical evidence of a linear relationship between


height and weight in the population. In particular, in the Spring 02 STAT 302 class (population of
interest), were taller people, in general heavier? I took a random sample of people in this class. This data
is plotted below.
Explanatory variable: height

Response variable:

weight

The first step in answering this question is to make a scatter plot (see below). It is clear that there is a
general positive linear trend and that as height increases, on average, weight also increases. There are
no outrageous outliers and the correlation, R = .660, is a good measure of the strength of the linear
association.
7

STAT Course Notes Set 10

From this data, I can estimate the equation for the regression
line: The line used to estimate the true population line is:

250

Y | x

= the estimated average weight of all individuals who

are x inches tall.


For example, the estimated average weight of people 70.0
Inches tall is
-247 + 6.0 70.0 = 173.0 lbs.

200

150

100

weight

Y | X = -247 + 6.0 x

60

65

70

75

height

Definitions of the parameters of interest

Y|x = 0 + 1x equation of the population regression line


Y|x is the average response value for all individuals in the population defined by the explanatory
variable value x. This is a population parameter.

EX 6: Explanatory variable = height


Response variable = weight.

Y|64 = average weight of all people in the population who are 64 inches tall.
The slope, 1 of the regression line measures the change in Y|x for every unit change in the explanatory
variable x. This is a population parameter.

EX 6:

1 = average change in weight when height increased by 1 inch for this population

1 is estimated by b1 = slope of the line which is the best fit through the data.

Using computer output, the value of b1 is calculated from the data set. b1 is a statistic calculated
from the data.
8

STAT Course Notes Set 10

2.00

Response

1.00

0.00

Explanatory

4.00

Response

3.00

3.75

3.50

3.25

2.00

2.50

Response

3.50

3.00

1.50
1

b1 > 0

Explanatory

Explanatory

b1 < 0

b1 = 0

The true intercept, 0, of the mean function tells the value of Y|X when X = 0.
parameter.
We estimate the value of 0 with b0.

b0 is a statistic calculated from the data.

The estimated value of Y|x is written


value

Y | x .

We use the expression b0 + b1x =

Y | x to calculate the

Y | X from the values of x, b1 and b0.

EX 1 revisited:

0 is a population

Y |X = -247 + 6.0x

b0 =

b1 =

Y |70 =

-247 + 6.0 70.0 = 173.0 lbs.

STAT Course Notes Set 10

Simple Linear Regression


Statistical Inferences when 2 numerical variables
In linear regression the idea is to test if there is a linear relationship between the explanatory and
response variable. The way we tell if there is a linear relationship is to test if the slope of the least
squares line is not zero. Of course, this only makes sense when the conditions are met (see pages 6-8).

The 3 possible hypotheses that can be tested using linear regression methods are:
There is a positive linear relationship:

H0: 1 = 0 vs HA: 1 > 0

1-sided

There is a negative linear relationship:

H0: 1 = 0 vs HA: 1 < 0

1-sided

There is a linear relationship:

H0: 1 = 0 vs HA: 1 0

2-sided

EX 2: Theory: the number of times a TAMU student goes out per week is negatively linearly related to
their GPR. A SRS was taken of 43 STAT 302 students in Fall 02. Below is a scatter plot of their data. We
want to test this theory at the = .05 level.
Explanatory variable:
Response variable:

Hypothesis:

Summary

Multiple
R
0.5722

Regression
Table

Coefficient

Constant
# NightsOut

3.678
-0.239

b0 estimated intercept

R-Square
0.3275

Adjusted
R-Square
0.3111

Std Err of
Estimate
0.4831625

t-Value

p-Value

20.310

< 0.0001
0.002

Standard
Error
0.181
0.072

b1 estimated slope

p-value for 2-sided hypotheses for 1

10

STAT Course Notes Set 10

How do we interpret the value of b1?


b1 is a slope estimate which estimates the change in average response when the explanatory
variable increases by 1.
From this data set, we estimate the average GPR drops by 0.24 when the # nights out increases by
1.

Calculating the p-value:


Case 1: You have 2-sided hypotheses: H0: 1 = 0 vs HA: 1 0 then JMP gives the correct p-value.
Case 2: You have 1-sided hypotheses
a) Data supports HA then the correct p-value =

1
(p-value in table)
2

How to tell if data supports HA.


1. If HA: 1 > 0 then we must have b1 > 0.
2. If HA: 1 < 0

then we must have b1 < 0.

b) You have 1-sided hypotheses and the sign of b1 doesnt match HA statement, then FTR H0.

Regression
Table

Coefficient

Constant
# NightsOut

3.678
-0.239

Standard
Error
0.181
0.072

t-Value

p-Value

20.310

< 0.0001
0.002

What is the correct p-value for testing the hypotheses H0: 1 = 0 versus HA: 1 < 0?

What is the decision?

What is the conclusion?

Predicting GPR from # of nights out per week:

The predicted response of an individual whose explanatory value is x is written as


for calculating

y .

The formula

y is given by the regression line: y b0 b1x

What is the predicted GPR of a person who goes out 3 times per week?

11

STAT Course Notes Set 10

Can we predict the GPR of a person who goes out 0


times per week?

As usual, there are conditions that must be met before we can make statistical inferences.
Below is a discussion of those conditions.
We start with defining residuals. These are very important to statisticians but all you need to be able to
do is understand enough of the plots to be able to decide if conditions are met.

Residuals:
There are various methods for estimating an equation for the best straight line through a set of data
points. The most commonly used method results in a line called the Least Squares Line. The least square
line is the line that minimizes the sum of the squared sample residuals.
For a data point with response value

y , the residual of this data point is: y Y |x

A data points residual tells us how much a subjects response value differs from the average
response value.

A plot of the residuals (see lower right) tells us about the variation of the data values about the
predicted regression line.

Scatterplot of Residual vs Fit


1.5
1.0

Residual

0.5
0.0
2.0

2.2

2.4

2.6

2.8

3.0

3.2

-0.5
-1.0
-1.5

12

3.4

STAT Course Notes Set 10

Estimated average GPA = 3.68 -0.24*nights out


Estimate the residual for the person who goes out 6 times per week and whose GPR = 1.5.

Assumptions inherent in the model of linear regression:


For theoretical reasons, we divide the target population into many populations according to the
explanatory variable value. For example, if x = height and y = weight, then we have a separate
population for each height. This assumption is needed because of the mathematical model assumptions
we make below. These mathematical assumptions were used when someone devised the hypothesis test
for 2 numerical variables. We use these model assumptions to come up with the requirements
(conditions) that must be met in order to use a linear regression analysis to analyze our data. If the
requirements arent met, then we cant use a linear regression analysis on our data because the results
will be nonsense.
For each explanatory variable value, we have a population of ys. Moreover we have the following 4
conditions:
1. Independence: All the response values are independent. This is assured if the data comes from a
random sample and there is exactly 1 response value for each randomly selected subject.
2. Linearity: If individuals x and y values are linearly related by the equation Y|x = 1x + 0. The
quantity Y|x is the average response (average y value) in the population of individuals taking explanatory
value x.
3. Normality: For each explanatory variable value x, the response values, y, associated with that x are
normally distributed. That means that for each x value, they values (associated with that x value) are
normally distributed.
For example, we expect the weights of everyone who is 70.0 inches tall to be normally distributed.
4. Equal variances: For each explanatory variable value x the response values, y, associated with that x all
have the same variance. That means that for each x value, the y values (associated with that x value) all
have the same variance, regardless of the x value.

13

STAT Course Notes Set 10

How to check model assumptions

Checking independence:
o

Checking the assumption of a linear relationship between Y|X and the value of x.
o

Look at the normal QQ plot of the residuals and make sure you dont see a C shape.

Checking the assumption of equal variances


o

Look at the scatter plot and make sure that the pattern of the points looks linear or like a
shotgun pattern.

Checking the assumption that the responses are normally distributed about their means
o

Make sure there is only 1 response value per subject.

Look at the scatter plot of residuals, you want to see a horizontal band of points or a
shotgun pattern. You do NOT want to see a wedge shape.

Also need to check that there are no extreme outliers as they mess up everything just like with
correlation.

Example where all the conditions are met:


Scatter plot used to check linearity

NQ plot to check for normality

Residual plot to check


Equal Variances

Normal P-P Plot of Regression Standardized


Scatterplot
Residual
Dependent Variable: Y

60

Dependent Variable: Y

40
.75

30

Expected Cum Prob

20
10
0
-10
-20

.50

.25

0.00

-30

2.0

Regression Standardized Residual

1.00

50

0.00

10

15

1.5
1.0
.5
0.0
-.5
-1.0
-1.5
-2.0
.25

.50

.75

1.00

-1.5

-1.0

-.5

0.0

.5

1.0

20

Observed Cum Prob

Regression Standardized Predicted Value

14

1.5

2.0

STAT Course Notes Set 10

EXAMPLES WHERE THE MODEL ASSUMPTIONS DONT HOLD


Data Not linear

Normal P-P Plot of Regression Standardized Residual

Dependent Variable: Y
1.0

Expected Cum Prob

0.8

0.6

0.4

0.2

0.0
0.0

0.2

0.4

0.6

0.8

1.0

Observed Cum Prob

In the case above, the explanatory and response are not linearly related but everything else is ok
although the lack of linearity messes up the scatter plot of the residuals (plot on far right above).
Data Not normally distributed
Normal P-P Plot of Regression Standardized Residual
Scatterplot

Dependent Variable: Y

40.000

Dependent Variable: Y

1.00
4

30.000

20.000

10.000

5.000

.75

Expected Cum Prob

10.000

Regression Standardized Residual

15.000

.50

.25

-1

0.00
20.000

-1.5

0.00

.25

.50

.75

-1.0

-.5

0.0

.5

1.0

1.5

2.0

1.00

Regression Standardized Predicted Value

Observed Cum Prob

15

STAT Course Notes Set 10

Data doesnt have equal variances


In the example below, the responses for each X value are normal BUT the variances increase as X
increases. As a result, the constant variance assumption is violated.
Normal P-P Plot of Regression Standardized
Scatterplot
Residual
Dependent Variable: Y

50.000

Expected Cum Prob

20.000

10.000

5.000

10.000

4
3

.75

40.000

30.000

Dependent Variable: Y

1.00

Regression Standardized Residual

60.000

2
1

.50

0
-1

.25

-2

0.00
15.000

20.000

0.00

-3
.25

.50

.75

1.00

-1.5

Observed Cum Prob

-1.0

-.5

0.0

.5

1.0

1.5

2.0

Regression Standardized Predicted Value

More examples of the right side plot used to check equal variances:
Equal variances

Not Equal variances


Scatterplot
Dependent Variable: Y2
Regression Standardized Residual

-2

-4

-6
-2.0

-1.5

-1.0

-.5

0.0

.5

1.0

1.5

2.0

Regression Standardized Predicted Value

16

STAT Course Notes Set 10

EX 3: Doctors would like a way to predict a premature infants weight at birth based on the infants
gestational age. They wanted to test their theory that gestational age (in weeks) and weight (in grams)
are positively linearly related. To test their theory, a researcher group selected a random sample of 100
premature infants and recorded the gestational age at birth and the birth weight of each baby. Assume all
conditions are met to do the analysis.
Explanatory variable:

Scatterplot of birthweight vs gestational


age

Response variable

2000

H0: 1 = 0 vs HA: 1

birthweight

Hypotheses:
0

1500
1000
500
0
20

25

30

35

40

Gestational age

Q-Q Normal Plot of Residuals


3.5

Standardized Q-Value

2.5
1.5
0.5
-3.5

-2.5

-1.5

-0.5
-0.5

0.5

1.5

2.5

3.5

-1.5
-2.5
-3.5
Z-Value

Scatterplot of Residuals vs Fit


600.0
400.0

Residual

200.0
0.0
600.0
-200.0

800.0

1000.0

1200.0

1400.0

1600.0

-400.0
-600.0
-800.0
Fit

Are the conditions met to analyze this data set using linear regression?
1. Independence: This condition is met because the data comes from a random sample and most
importantly, each babys response value (birthweight) was only measured once.
2. Linear relationship: Yes, visual inspection of the scatter plot shows gestational age at birth and
birth weight are linearly related.
3. Normality: Condition met, the Q-Q normal plot of the residuals doesnt have a C shape.

17

STAT Course Notes Set 10

4. Equal variances: Condition met, the scatter plot of residuals shows a shotgun pattern and not a
wedge pattern.

Multiple
R
0.66

Summary

R-Square
0.44

Adjusted
R-Square
0.43

Std Err of
Estimate
203.89

Recall the definition of R2 from set 2 notes. It is the % of variability in the datas response values that can
be explained by differences in the explanatory values.

What % of the variability in birth weights in this data set can be explained by differences in
gestational age?

Regression
Table
Intercept
gestational age

Coeff.
-932.40
70.31

Standard
Error
234.49
22.87

t-Value

p-Value

-3.976
3.176

0.0001
0.0010

How should you interpret the number 70.31 given above?


o

The average birth weight of premature babies increases by approximately 70.3 grams
when the gestational age at birth increases by 1 week.

What is the correct p-value?

What decision should you make based on the above analysis?

What is your conclusion?


o

The data provides very strong statistical evidence that for premature babies, gestational
age at birth and birth weight are positively linearly related.

Discussion on when we can estimate average response and calculate a predicted response:
Now that we have completed our analysis, we can use the values of the coefficients to form a linear
equation relating gestational age and birth weight. Based on this data set we can both estimate the
average weight at birth and predict the birth weight of a baby yet to be born after x weeks.

Y |x 932.40 70.31 x
Scatterplot of birthweight vs gestational
age

y 932.40 70.31 x

2000

birthweight

is the predicted birth


x is the gestational age and y
weight when the gestational age is x. By looking at
the data, I determined that 23 is the minimum age
(minimum x value in the data set) and 35 is the
maximum age (maximum x value in the data set).

1500
1000
500
0
20

25

30

Gestational age

35
18

40

STAT Course Notes Set 10

To find the minimum and maximum value of x, look at the scatter plot of the data.

Therefore, this equation can only be used to estimate average birth weight or predict the weights of
infants whose gestational age is between 23 and 35 weeks. It is very important that once a regression is
done, the estimates of and are only used to estimate averages or predict response values for values
of x (gestational age) between the minimum and maximum x values of your data. In other words, we can
interpolate between our explanatory data values but we cant extrapolate to values outside of the range of
x values.
Using our equation, we predict that a baby born at 30 weeks will weigh -932.404 + 70.310(30) =
1,176.896 grams at birth.
What is the estimated average weight of babies born at 40 weeks?

19

Vous aimerez peut-être aussi