Vous êtes sur la page 1sur 110

Two-sample tests

Binary or categorical outcomes


(proportions)

Outcome
Variable
Are the observations correlated? Alternative to the chi-
square test if sparse
cells:
independent correlated
Binary or
categorical
(e.g.
fracture,
yes/no)
Chi-square test:
compares proportions between
two or more groups

Relative risks: odds ratios
or risk ratios

Logistic regression:
multivariate technique used
when outcome is binary; gives
multivariate-adjusted odds
ratios
McNemars chi-square test:
compares binary outcome between
correlated groups (e.g., before and
after)

Conditional logistic
regression: multivariate
regression technique for a binary
outcome when groups are
correlated (e.g., matched data)

GEE modeling: multivariate
regression technique for a binary
outcome when groups are
correlated (e.g., repeated measures)

Fishers exact test: compares
proportions between independent
groups when there are sparse data
(some cells <5).

McNemars exact test:
compares proportions between
correlated groups when there are
sparse data (some cells <5).

Recall: The odds ratio (two
samples=cases and controls)


Smoker (E)

Non-smoker
(~E)



Stroke (D)

15 35
No Stroke (~D)

8 42




50
50
25 . 2
8 * 35
42 * 15
= = =
bc
ad
OR
Interpretation: there is a 2.25-fold higher odds of stroke
in smokers vs. non-smokers.
Inferences about the odds
ratio
Does the sampling distribution follow a
normal distribution?
What is the standard error?
Simulation
1. In SAS, assume infinite population of cases and
controls with equal proportion of smokers
(exposure), p=.23 (UNDER THE NULL!)
2. Use the random binomial function to randomly
select n=50 cases and n=50 controls each with
p=.23 chance of being a smoker.
3. Calculate the observed odds ratio for the
resulting 2x2 table.
4. Repeat this 1000 times (or some large number
of times).
5. Observe the distribution of odds ratios under
the null hypothesis.

Properties of the OR (simulation)
(50 cases/50 controls/23% exposed)
Under the null, this is the expected
variability of the sample ORnote
the right skew
Properties of the lnOR

Normal!

Properties of the lnOR

From the simulation,
can get the empirical
standard error (~0.5)
and p-value (~.10)

Properties of the lnOR
d c b a
1 1 1 1
+ + +
Or, in general, standard error
=
Inferences about the ln(OR)


Smoker (E)

Non-smoker
(~E)



Stroke (D)

15 35
No Stroke (~D)

8 42




50
50
81 . 0 ) ln(
25 . 2
=
=
OR
OR
64 . 1
494 . 0
81 . 0
42
1
35
1
15
1
8
1
0 ) 25 . 2 ln(
= =
+ + +

= Z
p=.10
Confidence interval


Smoker (E)

Non-smoker
(~E)



Stroke (D)

15 35
No Stroke (~D)

8 42




50
50
92 . 5 , 85 . 0 , CI % 95
78 . 1 , 16 . 0 494 . 0 * 96 . 1 81 . 0 ln CI % 95
78 . 1 16 .
= =
= =

e e OR
OR
Final answer: 2.25 (0.85,5.92)
Practice problem:
Suppose the following data were collected in a case-control study of brain tumor and
cell phone usage:




Brain tumor

No brain
tumor

Own a cell
phone

20

60

Dont own a
cell phone

10

40



Is there sufficient evidence for an association between cell phones and brain tumor?
Answer
1. What is your null hypothesis?
Null hypothesis: OR=1.0; lnOR = 0
Alternative hypothesis: OR= 1.0; lnOR>0

2. What is your null distribution?
lnOR~ N(0, ) ; =SD (lnOR) = .44

3. Empirical evidence: = 20*40/60*10 =800/600 = 1.33
lnOR = .288

4. Z = (.288-0)/.44 = .65
p-value = P(Z>.65 or Z<-.65) = .26*2

5. Not enough evidence to reject the null hypothesis of no association
40
1
60
1
20
1
10
1
+ + +
40
1
60
1
20
1
10
1
+ + +
TWO-SIDED TEST
TWO-SIDED TEST: it
would be just as extreme
if the sample lnOR were
.65 standard deviations
or more below the null
mean
Key measures of relative risk:
95% CIs OR and RR:
(
(

+ + + +
(
(

+ + +
d c b a d c b a
1 1 1 1
96 . 1
1 1 1 1
96 . 1
exp * OR , exp * OR
(
(

+
+
+
+
(
(

+
+
+

c
d c c
a
b a a
c
d c c
a
b a a ) /( 1 ) /( 1
96 . 1
) /( 1 ) /( 1
96 . 1
exp * RR , exp * RR
For an odds ratio, 95% confidence limits:
For a risk ratio, 95% confidence limits:
Continuous outcome (means)

Outcome
Variable
Are the observations independent or correlated?
Alternatives if the normality
assumption is violated (and
small sample size):
independent correlated
Continuous
(e.g. pain
scale,
cognitive
function)
Ttest: compares means
between two independent
groups

ANOVA: compares means
between more than two
independent groups

Pearsons correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables

Linear regression:
multivariate regression technique
used when the outcome is
continuous; gives slopes
Paired ttest: compares means
between two related groups (e.g.,
the same subjects before and
after)

Repeated-measures
ANOVA: compares changes
over time in the means of two or
more groups (repeated
measurements)

Mixed models/GEE
modeling: multivariate
regression techniques to compare
changes over time between two
or more groups; gives rate of
change over time
Non-parametric statistics
Wilcoxon sign-rank test:
non-parametric alternative to the
paired ttest

Wilcoxon sum-rank test
(=Mann-Whitney U test): non-
parametric alternative to the ttest

Kruskal-Wallis test: non-
parametric alternative to ANOVA

Spearman rank correlation
coefficient: non-parametric
alternative to Pearsons correlation
coefficient
The two-sample t-test
The two-sample T-test
Is the difference in means that we
observe between two groups more than
wed expect to see based on chance
alone?


The standard error of the
difference of two means



**First add the variances and then take the square
root of the sum to get the standard error.
m n
y
x
y x
2
2
o
o
o + =

Recall, Var (A-B) =


Var (A) + Var (B) if
A and B are
independent!
Shown by simulation:
91 .
30
5
= = SE
91 .
30
5
= = SE
91 .
30
5
= = SE
91 .
30
5
= = SE
29 . 1
30
25
30
25
) ( = + = diff SE
One sample of 30
(with SD=5).
One sample of
30 (with SD=5).
Difference of the two samples.
Distribution of differences
) , ( ~
2
2
m n
N Y X
y
x
y x m n
o
o
+
If X and Y are the averages of n and m subjects, respectively:
But
As before, you usually have to use the
sample SD, since you wont know the
true SD ahead of time
So, again becomes a T-distribution...
Estimated standard error of
the difference.
m
s
n
s
y
x
y x
2
2
+ ~

o
Just plug in the sample
standard deviations for each
group.
Case 1: un-pooled variance
Question: What are your degrees of freedom here?
Answer: Not obvious!
Case 1: ttest, unpooled
variances
It is complicated to figure out the degrees of freedom here! A good
approximation is given as df harmonic mean (or SAS will tell you!):
v
t
m
s
n
s
Y X
T
y
x
m n
~
2
2
+

=
m n
1 1
2
+

Case 2: pooled variance
If you assume that the standard deviation of the characteristic
(e.g., IQ) is the same in both groups, you can pool all the data
to estimate a common standard deviation. This maximizes your
degrees of freedom (and thus your power).
2
) ( ) (
) ( ) 1 ( and
1
) (
) ( ) 1 ( and
1
) (
: variances pooling
1
2
1
2
2
1
2 2 1
2
2
1
2 2 1
2
2
+
+
=
=

=
=

= =
=
=
=
=
m n
y y x x
s
y y s m
m
y y
s
x x s n
n
x x
s
m
i
m i
n
i
n i
p
m
i
m i y
m
i
m i
y
n
i
n i x
n
i
n i
x
2
) 1 ( ) 1 (
2 2
2
+
+
=
m n
s m s n
s
y x
p
Degrees of Freedom!
Estimated standard error
(using pooled variance estimate)
m
s
n
s
p p
y x
2 2
+ ~

o
2
) ( ) (
:
1
2
1
2
2
+
+
=
= =
m n
y y x x
s
where
m
i
m i
n
i
n i
p
The degrees
of freedom
are n+m-2
Case 2: ttest, pooled variances
2
2 2
~
+
+

=
m n
p p
m n
t
m
s
n
s
Y X
T
2
) 1 ( ) 1 (
2 2
2
+
+
=
m n
s m s n
s
y x
p
Alternate calculation formula:
ttest, pooled variance
2
~
+
+

=
m n
p
m n
t
mn
n m
s
Y X
T
) ( ) ( )
1 1
(
2 2
2 2
mn
m n
s
mn
m
mn
n
s
n m
s
n
s
m
s
p p p
p p
+
= + = + = +
Pooled vs. unpooled variance
Rule of Thumb: Use pooled unless you have a
reason not to.
Pooled gives you more degrees of freedom.
Pooled has extra assumption: variances are
equal between the two groups.
SAS automatically tests this assumption for you
(Equality of Variances test). If p<.05, this
suggests unequal variances, and better to use
unpooled ttest.

Example: two-sample t-test
In 1980, some researchers reported that
men have more mathematical ability than
women as evidenced by the 1979 SATs,
where a sample of 30 random male
adolescents had a mean score 1 standard
deviation of 43677 and 30 random female
adolescents scored lower: 41681 (genders
were similar in educational backgrounds,
socio-economic status, and age). Do you
agree with the authors conclusions?
Data Summary
n

Sample
Mean
Sample
Standard
Deviation

Group 1:
women
30 416 81
Group 2:
men
30 436 77
Two-sample t-test
1. Define your hypotheses (null,
alternative)
H
0
: - math SAT = 0
Ha: - math SAT 0 [two-sided]

Two-sample t-test
2. Specify your null distribution:
F and M have similar standard
deviations/variances, so make a pooled
estimate of variance.

6245
58
81 ) 29 ( 77 ) 29 (
2
) 1 ( ) 1 (
2 2
2 2
2
=
+
=
+
+
=
m n
s m s n
s
f m
p
)
30
6245
30
6245
, 0 ( ~
58 30 30
+ T F M
4 . 20
30
6245
30
6245
= +
Two-sample t-test
3. Observed difference in our experiment = 20
points

Two-sample t-test
4. Calculate the p-value of what you observed




98 .
4 . 20
0 20
58
=

= T
data _null_;
pval=(1-probt(.98, 58))*2;
put pval;
run;
0.3311563454
5. Do not reject null! No evidence that men are better
in math ;)
Example 2: Difference in means
Example: Rosental, R. and Jacobson, L.
(1966) Teachers expectancies:
Determinates of pupils I.Q. gains.
Psychological Reports, 19, 115-118.
The Experiment
(note: exact numbers have been altered)
Grade 3 at Oak School were given an IQ test at
the beginning of the academic year (n=90).
Classroom teachers were given a list of names of
students in their classes who had supposedly
scored in the top 20 percent; these students
were identified as academic bloomers (n=18).
BUT: the children on the teachers lists had
actually been randomly assigned to the list.
At the end of the year, the same I.Q. test was re-
administered.


Example 2
Statistical question: Do students in the
treatment group have more improvement
in IQ than students in the control group?

What will we actually compare?
One-year change in IQ score in the treatment
group vs. one-year change in IQ score in the
control group.
Academic
bloomers
(n=18)
Controls
(n=72)
Change in IQ score:
12.2 (2.0) 8.2 (2.0)
Results:
12.2 points
8.2 points
Difference=4 points
The standard deviation
of change scores was
2.0 in both groups. This
affects statistical
significance
What does a 4-point
difference mean?
Before we perform any formal statistical
analysis on these data, we already have
a lot of information.
Look at the basic numbers first; THEN
consider statistical significance as a
secondary guide.
Is the association statistically
significant?
This 4-point difference could reflect a
true effect or it could be a fluke.
The question: is a 4-point difference
bigger or smaller than the expected
sampling variability?
Hypothesis testing
Null hypothesis: There is no difference between
academic bloomers and normal students (=
the difference is 0%)
Step 1: Assume the null hypothesis.
Hypothesis Testing

These predictions can be made by
mathematical theory or by computer
simulation.
Step 2: Predict the sampling variability assuming the null
hypothesis is true
Hypothesis Testing
Step 2: Predict the sampling variability assuming the null
hypothesis is truemath theory:
0 . 4
2
=
p
s
) 52 . 0
72
4
18
4
, 0 ( ~
88 " "
= + T
control gifted

Hypothesis Testing

In computer simulation, you simulate
taking repeated samples of the same
size from the same population and
observe the sampling variability.
I used computer simulation to take
1000 samples of 18 treated and 72
controls
Step 2: Predict the sampling variability assuming the null
hypothesis is truecomputer simulation:
Computer Simulation Results
Standard error is
about 0.52
3. Empirical data
Observed difference in our experiment =
12.2-8.2 = 4.0

4. P-value
t-curve with 88 dfs has slightly wider
cut-offs for 95% area (t=1.99) than a
normal curve (Z=1.96)

p-value <.0001
8
52 .
4
52 .
2 . 8 2 . 12
88
= =

= t
If we ran this
study 1000 times,
we wouldnt
expect to get 1
result as big as a
difference of 4
(under the null
hypothesis).
Visually
5. Reject null!
Conclusion: I.Q. scores can bias
expectancies in the teachers minds and
cause them to unintentionally treat
bright students differently from those
seen as less bright.

Confidence interval (more
information!!)
95% CI for the difference: 4.01.99(.52) =
(3.0 5.0)

t-curve with 88 dfs
has slightly wider cut-
offs for 95% area
(t=1.99) than a normal
curve (Z=1.96)
What if our standard deviation
had been higher?
The standard deviation for change
scores in treatment and control were
each 2.0. What if change scores had
been much more variablesay a
standard deviation of 10.0 (for both)?
Standard error is
0.52
Std. dev in
change scores =
2.0
Std. dev in
change scores =
10.0
Standard error is 2.58
With a std. dev. of 10.0
LESS STATISICAL POWER!
Standard
error is 2.58
If we ran this
study 1000 times,
we would expect to
get >+4.0 or s4.0
12% of the time.
P-value=.12
Dont forget: The paired T-test
Did the control group in the previous
experiment improve
at all during the year?
Do not apply a two-sample ttest to answer
this question!
After-Before yields a single sample of
differences
within-group rather than between-group
comparison
Continuous outcome (means);

Outcome
Variable
Are the observations independent or correlated?
Alternatives if the normality
assumption is violated (and
small sample size):
independent correlated
Continuous
(e.g. pain
scale,
cognitive
function)
Ttest: compares means
between two independent
groups

ANOVA: compares means
between more than two
independent groups

Pearsons correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables

Linear regression:
multivariate regression technique
used when the outcome is
continuous; gives slopes
Paired ttest: compares means
between two related groups (e.g.,
the same subjects before and
after)

Repeated-measures
ANOVA: compares changes
over time in the means of two or
more groups (repeated
measurements)

Mixed models/GEE
modeling: multivariate
regression techniques to compare
changes over time between two
or more groups; gives rate of
change over time
Non-parametric statistics
Wilcoxon sign-rank test:
non-parametric alternative to the
paired ttest

Wilcoxon sum-rank test
(=Mann-Whitney U test): non-
parametric alternative to the ttest

Kruskal-Wallis test: non-
parametric alternative to ANOVA

Spearman rank correlation
coefficient: non-parametric
alternative to Pearsons correlation
coefficient
Data Summary
n

Sample
Mean
Sample
Standard
Deviation

Group 1:
Change
72 +8.2 2.0
Did the control group in the
previous experiment improve
at all during the year?
28
29 .
2 . 8
72
2
0 2 . 8
2
71
= =

= t
p-value <.0001
Normality assumption of ttest
If the distribution of the trait is normal, fine to
use a t-test.
But if the underlying distribution is not normal
and the sample size is small (rule of thumb:
n>30 per group if not too skewed; n>100 if
distribution is really skewed), the Central Limit
Theorem takes some time to kick in. Cannot use
ttest.
Note: ttest is very robust against the normality
assumption!
Alternative tests when normality
is violated: Non-parametric tests
Continuous outcome (means);

Outcome
Variable
Are the observations independent or correlated?
Alternatives if the normality
assumption is violated (and
small sample size):
independent correlated
Continuous
(e.g. pain
scale,
cognitive
function)
Ttest: compares means
between two independent
groups

ANOVA: compares means
between more than two
independent groups

Pearsons correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables

Linear regression:
multivariate regression technique
used when the outcome is
continuous; gives slopes
Paired ttest: compares means
between two related groups (e.g.,
the same subjects before and
after)

Repeated-measures
ANOVA: compares changes
over time in the means of two or
more groups (repeated
measurements)

Mixed models/GEE
modeling: multivariate
regression techniques to compare
changes over time between two
or more groups; gives rate of
change over time
Non-parametric statistics
Wilcoxon sign-rank test:
non-parametric alternative to the
paired ttest

Wilcoxon sum-rank test
(=Mann-Whitney U test): non-
parametric alternative to the ttest

Kruskal-Wallis test: non-
parametric alternative to ANOVA

Spearman rank correlation
coefficient: non-parametric
alternative to Pearsons correlation
coefficient
Non-parametric tests

t-tests require your outcome variable
to be normally distributed (or close
enough), for small samples.
Non-parametric tests are based on
RANKS instead of means and standard
deviations (=population parameters).
Example: non-parametric tests
10 dieters following Atkins diet vs. 10 dieters following
Jenny Craig

Hypothetical RESULTS:
Atkins group loses an average of 34.5 lbs.

J. Craig group loses an average of 18.5 lbs.

Conclusion: Atkins is better?


Example: non-parametric tests
BUT, take a closer look at the individual data

Atkins, change in weight (lbs):
+4, +3, 0, -3, -4, -5, -11, -14, -15, -300

J. Craig, change in weight (lbs)
-8, -10, -12, -16, -18, -20, -21, -24, -26, -30


Jenny Craig
-30 -25 -20 -15 -10 -5 0 5 10 15 20
0
5
10
15
20
25
30
P
e
r
c
e
n
t
Weight Change
Atkins
-300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20
0
5
10
15
20
25
30
P
e
r
c
e
n
t
Weight Change
t-test inappropriate
Comparing the mean weight loss of the
two groups is not appropriate here.
The distributions do not appear to be
normally distributed.
Moreover, there is an extreme outlier
(this outlier influences the mean a great
deal).

Wilcoxon rank-sum test
RANK the values, 1 being the least weight
loss and 20 being the most weight loss.
Atkins
+4, +3, 0, -3, -4, -5, -11, -14, -15, -300
1, 2, 3, 4, 5, 6, 9, 11, 12, 20
J. Craig
-8, -10, -12, -16, -18, -20, -21, -24, -26, -30
7, 8, 10, 13, 14, 15, 16, 17, 18, 19

Wilcoxon rank-sum test
Sum of Atkins ranks:
1+ 2 + 3 + 4 + 5 + 6 + 9 + 11+ 12 +
20=73
Sum of Jenny Craigs ranks:
7 + 8 +10+ 13+ 14+ 15+16+ 17+ 18+19=137

Jenny Craig clearly ranked higher!
P-value *(from computer) = .018
*For details of the statistical test, see appendix of these slides
Binary or categorical outcomes
(proportions)

Outcome
Variable
Are the observations correlated? Alternative to the chi-
square test if sparse
cells:
independent correlated
Binary or
categorical
(e.g.
fracture,
yes/no)
Chi-square test:
compares proportions between
two or more groups

Relative risks: odds ratios
or risk ratios

Logistic regression:
multivariate technique used
when outcome is binary; gives
multivariate-adjusted odds
ratios
McNemars chi-square test:
compares binary outcome between
two correlated groups (e.g., before
and after)

Conditional logistic
regression: multivariate
regression technique for a binary
outcome when groups are
correlated (e.g., matched data)

GEE modeling: multivariate
regression technique for a binary
outcome when groups are
correlated (e.g., repeated measures)

Fishers exact test: compares
proportions between independent
groups when there are sparse data
(some cells <5).

McNemars exact test:
compares proportions between
correlated groups when there are
sparse data (some cells <5).

Difference in proportions (special
case of chi-square test)
Standard error of the difference of two proportions=


2 1
2 2 1 1
2 1 2
2 2
1
1 1
) ( ) (n
where ,
) 1 ( ) 1 (
or
) 1 ( ) 1 (
n n
p n p
p
n
p p
n
p p
n
p p
n
p p
+
+
=

+

+

Standard error of a proportion=



n
p p ) 1 (
Null distribution of a difference
in proportions
Standard error can be estimated by=
(still normally distributed)
n
p p ) 1 (
Analagous to pooled variance
in the ttest
The variance of a difference is the
sum of variances (as with difference
in means).
Null distribution of a difference
in proportions

Difference of proportions


)
) 1 ( ) 1 (
, ( ~
2 1
2 1
n
p p
n
p p
p p N

+

Difference in proportions test


Null hypothesis: The difference in proportions is 0.
2 1
2 1
) 1 ( * ) 1 ( *
n
p p
n
p p
p p
Z


=
2 group in number
1 group in number
2 group in proportion
1 group in proportion
) proportion average (just
2
1
2
1
2 1
2 2 1 1
=
=
=
=
+
+
=
n
n
p
p
n n
p n p n
p
Recall, variance of a
proportion is p(1-p)/n
Use average (or pooled)
proportion in standard
error formula, because
under the null
hypothesis, groups have
equal proportions.
Follows a normal
because binomial can be
approximated with
normal
Recall case-control example:


Smoker (E)

Non-smoker
(~E)



Stroke (D)

15 35
No Stroke (~D)

8 42




50
50
Absolute risk: Difference in
proportions exposed
% 14 % 16 % 30
50 / 8 50 / 15 ) ~ / ( ) / (
= =
= D E P D E P


Smoker (E)

Non-smoker
(~E)



Stroke (D)

15 35
No Stroke (~D)

8 42




50
50
Difference in proportions
exposed
67 . 1
084 .
14 .
50
77 . * 23 .
50
77 . * 23 .
% 0 % 14
= =
+

= Z
.31 to 03 . 0 084 . * 96 . 1 14 . 0 : CI % 95 =
Example 2: Difference in
proportions
Research Question: Are antidepressants
a

risk factor for suicide attempts in
children and adolescents?
Example modified from: Antidepressant Drug Therapy and Suicide in Severely
Depressed Children and Adults ; Olfson et al. Arch Gen Psychiatry.2006;63:865-
872.
Example 2: Difference in
Proportions
Design: Case-control study
Methods: Researchers used Medicaid records
to compare prescription histories between
263 children and teenagers (6-18 years) who
had attempted suicide and 1241 controls who
had never attempted suicide (all subjects
suffered from depression).
Statistical question: Is a history of use of
antidepressants more common among cases
than controls?
Example 2
Statistical question: Is a history of use of
antidepressants more common among
heart disease cases than controls?

What will we actually compare?
Proportion of cases who used antidepressants
in the past vs. proportion of controls who did
No (%) of
cases
(n=263)
No (%) of
controls
(n=1241)
Any antidepressant
drug ever
120 (46%) 448 (36%)
46%
36%
Difference=10%
Results
Is the association statistically
significant?
This 10% difference could reflect a true
association or it could be a fluke in this
particular sample.
The question: is 10% bigger or smaller
than the expected sampling variability?
Hypothesis testing
Null hypothesis: There is no association
between antidepressant use and suicide
attempts in the target population (= the
difference is 0%)
Step 1: Assume the null hypothesis.
Hypothesis Testing
Step 2: Predict the sampling variability assuming the null
hypothesis is true
) 033 . =
1241
)
1504
568
1 (
1504
568
+
263
)
1504
568
1 (
1504
568
= , 0 ( N ~ p

controls cases
Also: Computer Simulation Results
Standard error is
about 3.3%
Hypothesis Testing
Step 3: Do an experiment
We observed a difference of 10% between
cases and controls.
Hypothesis Testing
Step 4: Calculate a p-value
003 . = p ; 0 . 3 =
033 .
10 .
= Z
When we ran this
study 1000 times,
we got 1 result as
big or bigger than
10%.
P-value from our simulation
We also got 3
results as small
or smaller than
10%.
P-value
From our simulation, we
estimate the p-value to be:
4/1000 or .004
Here we reject the null.
Alternative hypothesis: There is an association
between antidepressant use and suicide in the
target population.
Hypothesis Testing
Step 5: Reject or do not reject the null hypothesis.
What would a lack of
statistical significance mean?
If this study had sampled only 50 cases
and 50 controls, the sampling variability
would have been much higheras
shown in this computer simulation
Standard error is
about 10%
50 cases and 50
controls.
Standard error is
about 3.3%
263 cases and
1241 controls.
With only 50 cases and 50 controls
Standard
error is
about 10%
If we ran this
study 1000 times,
we would expect to
get values of 10%
or higher 170 times
(or 17% of the
time).
Two-tailed p-value
Two-tailed
p-value =
17%x2=34%
Practice problem

An August 2003 research article in
Developmental and Behavioral Pediatrics
reported the following about a sample of UK
kids: when given a choice of a non-branded
chocolate cereal vs. CoCo Pops, 97% (36) of
37 girls and 71% (27) of 38 boys preferred
the CoCo Pops. Is this evidence that girls are
more likely to choose brand-named products?

Answer
1. Hypotheses:
H
0
: p

-p

= 0
Ha: p

-p

0 [two-sided]

2. Null distribution of difference of two proportions:



3. Observed difference in our experiment = .97-.71= .26

4. Calculate the p-value of what you observed:


085 .
38
) 16 (. 84 .
37
) 16 (. 84 .
)
38
)
75
63
1 (
75
63
37
)
75
63
1 (
75
63
, 0 ( ~
= +

= o N p p
m f
data _null_;
pval=(1-probnorm(3.06))*2;
put pval;
run;
0.0022133699
5. p-value is sufficiently low for us to reject the null; there does appear to be a difference in
gender preferences here.
Null says ps are equal so
estimate standard error using
overall observed p
06 . 3
085 .
0 26 .
=

= Z
Key two-sample Hypothesis
Tests
Test for H
o
:
x
-
y
= 0 (
2
unknown, but roughly equal):





Test for H
o
: p
1-
p
2
= 0:




2
) 1 ( ) 1 (
;
2 2
2
2 2
2

+
=
+

n
s n s n
s
n
s
n
s
y x
t
y y x x
p
y
p
x
p
n
2 1
2 2 1 1
2 1
2 1

;
) 1 )( ( ) 1 )( (

n n
p n p n
p
n
p p
n
p p
p p
Z
+
+
=


=
Corresponding confidence
intervals
For a difference in means, 2 independent
samples (
2
s unknown but roughly equal):




For a difference in proportions, 2 independent
samples:





y
p
x
p
n
n
s
n
s
t y x
2 2
2 / , 2
) ( + -
o
2 1
2 / 2 1
) 1 )( ( ) 1 )( (
)

(
n
p p
n
p p
Z p p

+

-
o
Appendix: details of rank-sum
test

Wilcoxon Rank-sum test
) , min(
12
) 1 (
2
Z
2
) 1 (
U
, 10 , 0 1 for
2
) 1 (
U
) (n population larger the from ranks the of sum the is T
) (n population smaller from ranks the of sum the is T
n. to 1 from order in ns observatio the of all Rank
2 1 0
2 1 2 1
2 1
0
2
2 2
2 1 2
2 1 1
1 1
2 1 1
2 2
1 1
U U U
n n n n
n n
U
T
n n
n n
n n T
n n
n n
=
+ +

=
+
+ =
> >
+
+ =
Find P(U U
0
) in Mann-Whitney U tables
With n
2
= the bigger of the 2 populations
Example
For example, if team 1 and team 2 (two gymnastic
teams) are competing, and the judges rank all the
individuals in the competition, how can you tell if
team 1 has done significantly better than team 2 or
vice versa?

Answer
Intuition: under the null hypothesis of no difference between the two
groups
If n
1
=n
2
, the sums of T
1
and T
2
should be equal.
But if n
1
n
2
, then T
2
(n
2=
bigger group) should automatically be
bigger. But how much bigger under the null?

For example, if team 1 has 3 people and team 2 has 10, we could
rank all 13 participants from 1 to 13 on individual performance. If
team1 (X) and team2 dont differ in talent, the ranks ought to be
spread evenly among the two groups, e.g.

1 2 X 4 5 6 X 8 9 10 X 12 13 (exactly even distribution if team1
ranks 3
rd
, 7
th
, and 11
th
)

(larger) 2 group of ranks of sum
(smaller) 1 group of ranks of sum
2
1
=
=
T
T
2 1
2 2 1 1 2
2
2 2 1 1 2 1
2
1
2 1 2 1
1
2 1
2
) 1 (
2
) 1 (
2
) (
2
) 1 )( (
2 1
n n
n n n n n n n n n n n n
n n n n
i T T
n n
i
+
+
+
+
=
+ + + + +
=
+ + +
= = +

+
=
Remember
this?
sum of within-group ranks for smaller
group.

2
) 1 (
1 1
1
1
+
=

=
n n
i
n
i
sum of within-group ranks for larger
group.

2
) 1 (
2 2
1
2
+
=

=
n n
i
n
i
30 6 55 91
2
) 14 )( 13 (
: here e.g.,
13
1
2 1
+ + = = = = +

= i
i T T
2 1
2 2 1 1
2 1
2
) 1 (
2
) 1 (
n n
n n n n
T T +
+
+
+
= +
Take-home point:
49 6 55
6
2
) 4 ( 3
55
2
) 11 ( 10
3
1
10
1
=
=
= =

=
=
i
i
i
T1 = 3 + 7 + 11 =21
T2 = 1 + 2 + 4 + 5 + 6 + 8 + 9 +10 + 12 +13 = 70
70-21 = 49 Magic!
The difference between the sum of the
ranks wi thin each individual group is 49.
The difference between the sum of the
ranks of the two groups is also equal to 49
if ranks are evenly interspersed (null is
true).
It turns out that, if the null hypothesis is true, the difference between
the larger-group sum of ranks and the smaller-group sum of ranks is
exactly equal to the difference between T
1
and T
2
2
) 1 (
2
) 1 (
null, Under the
1 1 2 2
1 2
+

+
=
n n n n
T T
. equal should sum Their
2
) 1 (
U define
2
) 1 (
U define
2 2
) 1 (
2 2
) 1 (
2
) 1 (
2
) 1 (
2
) 1 (
2
) 1 (
2 1
1 2 1
1 1
1
2 2 1
2 2
2
2 1 1 1
1
2 1 2 2
2
1 1 2 2
1 2
2 1
2 2 1 1
1 2
n n
T n n
n n
T n n
n n
n n n n
T
n n n n
T
n n n n
T T
n n
n n n n
T T
+
+
=
+
+
=
+
+
=
+
+
=
+

+
=
+
+
+
+
= +
From slide 23
From slide 24
Define new
statistics
Here, under null:
U2=55+30-70
U1=6+30-21
U2+U1=30
under null hypothesis, U
1
should equal U
2
:
0 )] T ( )
2
) 1 (
2
) 1 (
[( ) U - E(U
1 2
1 1 2 2
1 2
=
+

+
= T
n n n n
E
The Us should be equal to each other and will equal n
1
n
2
/2:

U
1
+ U
2
= n
1
n
2

Under null hypothesis, U
1
= U
2
= U
0

E(U
1
+ U
2
) = 2E(U
0
) = n
1
n
2

E(U
1
= U
2
=U
0
) = n
1
n
2
/2

So, the test statistic here is not quite the difference in the
sum-of-ranks of the 2 groups
Its the smaller observed U value: U
0

For small ns, take U
0
, and get p-value directly from a U table.
For large enough ns (>10 per
group)
) (
2
) (
) (
Z
0
2 1
0
0
0 0
U Var
n n
U
U Var
U E U

=

=
2
) (
2 1
0
n n
U E =
12
) 1 (
) (
2 1 2 1
0
+ +
=
n n n n
U Var
Add observed data to the
example
Example: If the girls on the two gymnastics teams were ranked as follows:
Team 1: 1, 5, 7 Observed T
1
= 13
Team 2: 2,3,4,6,8,9,10,11,12,13 Observed T
2
= 78

Are the teams significantly different?
Total sum of ranks = 13*14/2 = 91 n
1
n
2
=3*10 = 30

Under the null hypothesis: expect U
1
- U
2
= 0 and U
1
+ U
2
= 30 (each should equal about 15 under
the null) and U
0
= 15

U
1
=30 + 6 13 = 23
U
2
= 30 + 55 78 = 7
U
0
= 7

Not quite statistically significant in U tablep=.1084 (see attached) x2 for two-tailed test

Example problem 2
A study was done to compare the Atkins Diet (low-carb) vs. Jenny Craig
(low-cal, low-fat). The following weight changes were obtained; note
they are very skewed because someone lost 100 pounds; the mean loss
for Atkins is going to look higher because of the bozo, but does that
mean the diet is better overall? Conduct a Mann-Whitney U test to
compare ranks.


Atkins

Jenny Craig

-100

-11

-8

-15

-4

-5

+5

+6

+8

-20

+2





Answer
Corresponding Ranks (lower is more weight loss!):


Atkins

Jenny Craig

1

4

5

3

7

6

9

10

11

2

8



Sum of ranks for JC = 25 (n=5)
Sum of ranks for Atkins=41 (n=6)

n
1
n
2
=5*6 = 30

under the null hypothesis: expect U
1
- U
2
= 0 and
U
1
+ U
2
= 30 and U
0
= 15

U
1
=30 + 15 25 = 20
U
2
= 30 + 21 41 = 10

U
0
= 10; n
1
=5, n
2
=6
Go to Mann-Whitney chart.p=.2143x 2 = .42