Statistical Inference I: Hypothesis Testing Sample Size

Statistical Inference I:
Hypothesis testing; sample

size
Statistics Primer
Statistical Inference
Hypothesis testing
P-values
Type I error
Type II error
Statistical power
Sample size calculations
What is a statistic?
A statistic is any value that can be
calculated from the sample data.
Sample statistics are calculated to give
us an idea about the larger population.
Examples of statistics:
mean
The average cost of a gallon of gas in the US is
$2.65.
difference in means
The difference in the average gas price in Los
Angeles ($2.96) compared with Des Moines, Iowa
($2.32) is 64 cents.
proportion
67% of high school students in the U.S. exercise
regularly
difference in proportions
The difference in the proportion of Republicans
who approve of George W. (66%) and Democrats
who do (11%) is 55%
What is a statistic?
Sample statistics are estimates of
population parameters.
Sample statistics estimate
population parameters:
Sample statistic: mean IQ

of 5 subjects
110 105 96 124 115
110
Truth (not 5
observable)
Sample
Mean IQ of (observation)
some population
of 100,000
people =100 Make guesses
about the whole
population
Sampling Distributions
Most experiments are one-shot deals. So, how do we know if
an observed effect from a single experiment is real or is just an
artifact of sampling variability (chance variation)?
**Requires a priori knowledge about how sampling variability

works
Question: Why have I made you learn about probability

distributions and about how to calculate and
manipulate expected value and variance?
Answer: Because they form the basis of describing the
distribution of a sample statistic.
What is sampling variation?
Statistics vary from sample to sample due to
random chance.
Example:
A population of 100,000 people has an
average IQ of 100 (If you actually could
measure them all!)
If you sample 5 random people from this
population, what will you get?
Sampling Variation
120 160 180 95 95

90 85 95 92 88 130
5
100 105 86 104 95 90
110 105 596 124 115 98
5 110
5 (not
Truth
observable)
Mean
IQ=100
Sampling Variation and
Sample Size
Do you expect more or less sampling
variability in samples of 10 people?
Of 50 people?
Of 1000 people?
Of 100,000 people?
Standard error
Standard error is the standard deviation
of a sample statistic.
Its a measure of sampling variability.
What is statistical inference?
The field of statistics provides guidance
on how to make conclusions in the face
of this chance variation.
Example 1: Difference in
proportions
Research Question: Are antidepressants
a risk factor for suicide attempts in
children and adolescents?
Example modified from: Antidepressant Drug Therapy and Suicide in Severely

Depressed Children and Adults ; Olfson et al. Arch Gen Psychiatry.2006;63:865-
872.
Example 1
Design: Case-control study
Methods: Researchers used Medicaid records
to compare prescription histories between
263 children and teenagers (6-18 years) who
had attempted suicide and 1241 controls who
had never attempted suicide (all subjects
suffered from depression).
Statistical question: Is a history of use of
antidepressants more common among cases
than controls?
Example 1
Statistical question: Is a history of use of
particular antidepressants more common
among heart disease cases than controls?
What will we actually compare?

Proportion of cases who used antidepressants
in the past vs. proportion of controls who did

Results
No (%) of No (%) of
cases controls
(n=263) (n=1241)
Any antidepressant
drug ever 120 (46%) 448 (36%)
46% 36%
Difference=10%
What does a 10% difference
mean?
Before we perform any formal statistical
analysis on these data, we already have
a lot of information.
Look at the basic numbers first; THEN
consider statistical significance as a
secondary guide.
Is the association statistically
significant?
This 10% difference could reflect a true
association or it could be a fluke in this
particular sample.
The question: is 10% bigger or smaller
than the expected sampling variability?
What is hypothesis testing?
Statisticians try to answer this question
with a formal hypothesis test
Hypothesis testing
Step 1: Assume the null hypothesis.
Null hypothesis: There is no association

between antidepressant use and suicide
attempts in the target population (= the
difference is 0%)
Hypothesis Testing
Step 2: Predict the sampling variability assuming the null
hypothesis is truemath theory:
The standard error of the difference in

two proportions is:
p (1 p ) p (1 - p )

n1 n2
568 568 568 568
(1 ) (1 )
1504 1504 1504 1504 .033
263 1241
We expect to see differences between the group as
big as about 6% (2 standard errors) just by chance
Hypothesis Testing
hypothesis is truecomputer simulation:
In computer simulation, you simulate

taking repeated samples of the same
size from the same population and
observe the sampling variability.
I used computer simulation to take
1000 samples of 263 cases and 1241
controls
Computer Simulation Results
What is standard error?
Standard error:
measure of
variability of
sample statistics
Standard error is
about 3.3%
Hypothesis Testing
Step 3: Do an experiment
We observed a difference of 10% between

cases and controls.
Hypothesis Testing
Step 4: Calculate a p-value
P-value=the probability of your data or

something more extreme under the null
hypothesis.
Hypothesis Testing
Step 4: Calculate a p-valuemathematical theory:
.10
Z= = 3.0; p = .003
.033
What is a P-value?
When we ran this

We also got 2 study 1000 times,
results as small we got 1 result as
or smaller than big or bigger than
10%. 10%.
P-value
P-value=the probability of
your data or something
more extreme under the null
hypothesis.
From our simulation, we
estimate the p-value to be:
3/1000 or .003
Hypothesis Testing
Step 5: Reject or do not reject the null hypothesis.
Here we reject the null.

Alternative hypothesis: There is an association
between antidepressant use and suicide in the
target population.
mean?
Is it statistically significant? YES
Is it clinically significant?
Is this a causal association?
mean?
Is it clinically significant? MAYBE
Is this a causal association? MAYBE
Statistical significance does not necessarily

imply clinical significance.

imply a cause-and-effect relationship.
What would a lack of
statistical significance mean?
If this study had sampled only 50 cases
and 50 controls, the sampling variability
would have been much higheras
shown in this computer simulation
Standard error is
about 3.3% 263 cases and
1241 controls.
Standard error is
about 10%
50 cases and 50
controls.
With only 50 cases and 50 controls
If we ran this
Standard study 1000 times,
error is we would expect to
about 10% get values of 10%
or higher 170 times
(or 17% of the
time).
Two-tailed p-value
Two-tailed
p-value =
17%x2=34%
mean (50 cases/50 controls)?
Is it statistically significant? NO
No evidence of an effect Evidence of no effect.

Example 2: Difference in means
Example: Rosental, R. and Jacobson, L.
(1966) Teachers expectancies:
Determinates of pupils I.Q. gains.
Psychological Reports, 19, 115-118.
The Experiment
(note: exact numbers have been altered)
Grade 3 at Oak School were given an IQ test at

the beginning of the academic year (n=90).
Classroom teachers were given a list of names of
students in their classes who had supposedly
scored in the top 20 percent; these students
were identified as academic bloomers (n=18).
BUT: the children on the teachers lists had
actually been randomly assigned to the list.
At the end of the year, the same I.Q. test was re-
administered.
Example 2
Statistical question: Do students in the
treatment group have more improvement
in IQ than students in the control group?
What will we actually compare?

One-year change in IQ score in the treatment
group vs. one-year change in IQ score in the

control group.
The standard deviation
of change scores was
Results: 2.0 in both groups. This
affects statistical
Academic significance
bloomers Controls
(n=18) (n=72)
Change in IQ score: 12.2 (2.0) 8.2 (2.0)
12.2 points 8.2 points
Difference=4 points
What does a 4-point
difference mean?
Before we perform any formal statistical
analysis on these data, we already have
a lot of information.
Look at the basic numbers first; THEN
consider statistical significance as a
secondary guide.
Is the association statistically
significant?
This 4-point difference could reflect a
true effect or it could be a fluke.
The question: is a 4-point difference
bigger or smaller than the expected
sampling variability?
Hypothesis testing
Step 1: Assume the null hypothesis.
Null hypothesis: There is no difference between

academic bloomers and normal students (=
the difference is 0%)
Hypothesis Testing
hypothesis is truemath theory:
The standard error of the difference in

two means is:
2 2
s s 4 4
0.52
n1 n2 18 72
We expect to see differences between the group as
big as about 1.0 (2 standard errors) just by chance
Hypothesis Testing
hypothesis is truecomputer simulation:
In computer simulation, you simulate

taking repeated samples of the same
size from the same population and
observe the sampling variability.
I used computer simulation to take
1000 samples of 18 treated and 72
controls
Computer Simulation Results
What is the standard error?
Standard error is
about 0.52
Standard error:
measure of
variability of
sample statistics
Hypothesis Testing
Step 3: Do an experiment
We observed a difference of 4 between

treated and controls.
Hypothesis Testing
Step 4: Calculate a p-value
P-value=the probability of your data or

something more extreme under the null
hypothesis.
Hypothesis Testing
Step 4: Calculate a p-valuemathematical theory:
t-curve with 88 dfs has slightly wider

cut-offs for 95% area (t=1.99) than a
normal curve (Z=1.96)
4
t88 8 p-value <.0001
.52
What is the P-value?
If we ran this
study 1000 times
we wouldnt
expect to get 1
result as big as a
difference of 4
(under the null
hypothesis).
P-value
P-value=the probability of
your data or something
more extreme under the null
hypothesis.
Here, p-value<.0001
Hypothesis Testing
Step 5: Reject or do not reject the null hypothesis.
Here we reject the null.

Alternative hypothesis: There is an association
between being labeled as gifted and subsequent
academic achievement.
What does a 4-point
difference mean?
Is it clinically significant?
Is this a causal association?
What does a 4-point
difference mean?

imply clinical significance.

imply a cause-and-effect relationship.
What if our standard deviation
had been higher?
The standard deviation for change
scores in both treatment and control
was 2.0. What if change scores had
been much more variablesay a
standard deviation of 10.0?
Standard error is
0.52 Std. dev in
change scores =
2.0
Standard error is 2.58 Std. dev in

change scores =
10.0
With a std. dev. of 10.0
Standard If we ran this

error is 2.58 study 1000 times,
we would expect to
get +4.0 or 4.0
12% of the time.
P-value=.12
What would a 4.0 difference
mean (std. dev=10)?
Is it statistically significant? NO
No evidence of an effect Evidence of no effect.

Hypothesis Testing Summary
The Steps:
1. Define your hypotheses (null, alternative)
2. Specify your null distribution
3. Do an experiment
4. Calculate the p-value of what you observed
5. Reject or fail to reject the null hypothesis
Follows the logic: If A then B; not B; therefore, not A.

Hypothesis testing summary
Null hypothesis: the hypothesis of no effect (usually
the opposite of what you hope to prove). The straw
man you are trying to shoot down.
Example: antidepressants have no effect on suicide risk
P-value: the probability of your observed data if the
null hypothesis is true.
Example: The probability that the study would have found 10%
higher suicide attempts in the antidepressant group (compared
with control) if antidepressants had no effect (i.e., just by
chance).
If this probability is low enough (i.e., if our data are very
unlikely given the null hypothesis), this is evidence that the null
hypothesis is wrong.
If p-value is low enough (typically <.05), we reject the null
hypothesis and conclude that antidepressants do have an
effect.
Summary: The Underlying
Logic of hypothesis tests
Follows this logic:
Assume A.
If A, then B.
Not B.
Therefore, Not A.
But throw in a bit of uncertaintyIf A, then probably B

Error and power
Type I error rate (or significance level): the
probability of finding an effect that isnt real (false
positive).
If we require p-value<.05 for statistical significance, this means
that 1/20 times we will find a positive result just by chance.
Type II error rate: the probability of missing an effect
(false negative).
Statistical power: the probability of finding an effect if
it is there (the probability of not making a type II
error).
When we design studies, we typically aim for a power of 80%
(allowing a false negative rate, or type II error rate, of 20%).
Type I and Type II Error in a box
Your Statistical True state of null hypothesis
Decision
H0 True H0 False
Reject H0
Type I error () Correct
Do not reject H0
Correct Type II Error ()
Reminds me of
Pascals Wager
The TRUTH
Your Decision God Exists God Doesnt Exist
Reject God
BIG MISTAKE Correct
Accept God Correct

MINOR MISTAKE
Big Pay Off
Type I and Type II Error in a box
Your Statistical True state of null hypothesis
Decision
H0 True H0 False
Reject H0
Type I error () Correct
Do not reject H0
Correct Type II Error ()
Review Question 1
If we have a p-value of 0.03 and so decide that our
effect is statistically significant, what is the
probability that were wrong (i.e., that the
hypothesis test gave us a false positive)?
a. .03
b. .06
c. Cannot tell
d. 1.96
e. 95%
Review Question 1
If we have a p-value of 0.03 and so decide that our
effect is statistically significant, what is the
probability that were wrong (i.e., that the
hypothesis test gave us a false positive)?
a. .03
b. .06
c. Cannot tell
d. 1.96
e. 95%
Review Question 2
Standard error is:
a. For a given variable, its standard deviation

divided by the square root of n.
b. A measure of the variability of a sample
statistic.
c. The inverse of sample size.
d. A measure of the variability of a
characteristic.
e. All of the above.
Review Question 2
Standard error is:
a. For a given variable, its standard deviation

divided by the square root of n.
b. A measure of the variability of a
sample statistic.
c. The inverse of sample size.
d. A measure of the variability of a
characteristic.
e. All of the above.
Review Question 3
A randomized trial of two treatments for depression
failed to show a statistically significant difference in
improvement from depressive symptoms (p-value
=.50). It follows that:
a. The treatments are equally effective.

b. Neither treatment is effective.
c. The study lacked sufficient power to detect a
difference.
d. The null hypothesis should be rejected.
e. There is not enough evidence to reject the null
hypothesis.
Review Question 3
A randomized trial of two treatments for depression
failed to show a statistically significant difference in
improvement from depressive symptoms (p-value
=.50). It follows that:
a. The treatments are equally effective.

b. Neither treatment is effective.
c. The study lacked sufficient power to detect a
difference.
d. The null hypothesis should be rejected.
e. There is not enough evidence to reject the null
hypothesis.
Review Question 4
Following the introduction of a new treatment regime in a
rehab facility, alcoholism cure rates increased. The
proportion of successful outcomes in the two years
following the change was significantly higher than in the
preceding two years (p-value: <.005). It follows that:
a. The improvement in treatment outcome is clinically important.

b. The new regime cannot be worse than the old treatment.
c. Assuming that there are no biases in the study method, the new
treatment should be recommended in preference to the old.
d. All of the above.
e. None of the above.
Review Question 4
Following the introduction of a new treatment regime in a
rehab facility, alcoholism cure rates increased. The
proportion of successful outcomes in the two years
following the change was significantly higher than in the
preceding two years (p-value: <.005). It follows that:
a. The improvement in treatment outcome is clinically important.

b. The new regime cannot be worse than the old treatment.
c. Assuming that there are no biases in the study method, the new
treatment should be recommended in preference to the old.
d. All of the above.
e. None of the above.
Statistical Power
Statistical power is the probability of
finding an effect if its real.
Can we quantify how much
power we have for given
sample sizes?
study 1: 263 cases, 1241 controls
Rejection region.
Null Any value >= 6.5
Distribution: (0+3.3*1.96)
difference=0.
For 5% significance level,
one-tail area=2.5%
(Z/2 = 1.96)
Power= chance of being in the

Clinically relevant
rejection region if the alternative
alternative: is true=area to the right of this
difference=10%.
line (in yellow)
Rejection region.
Any value >= 6.5
(0+3.3*1.96)
Power= chance of being in the

rejection region if the alternative
is true=area to the right of this
Power here = >80% line (in yellow)
Critical value=
0+10*1.96=20
Z/2=1.96
2.5% area
Power closer to
20% now.
Study 2: 18 treated, 72 controls, STD DEV = 2
Critical value=
0+0.52*1.96 = 1
Clinically relevant
alternative: Power is nearly
difference=4 points 100%!
Study 2: 18 treated, 72 controls, STD DEV=10
Critical value=
0+2.59*1.96 = 5
Power is about
40%
Study 2: 18 treated, 72 controls, effect size=1.0
Critical value=
0+0.52*1.96 = 1
Power is about
50%
Clinically relevant
alternative:
difference=1 point
Factors Affecting Power
1. Size of the effect
2. Standard deviation of the characteristic
3. Bigger sample size
4. Significance level desired
1. Bigger difference from the null mean
Null
Clinically
relevant
alternative
average weight from samples of 100

2. Bigger standard deviation

3. Bigger Sample Size

4. Higher significance level
Rejection region.

Sample size calculations
Based on these elements, you can write
a formal mathematical equation that
relates power, sample size, effect size,
standard deviation, and significance
level
Simple formula for difference
in proportions Represents the
desired power
Sample size in each (typically .84 for
group (assumes equal 80% power).
sized groups)
( p )(1 p )( Z Z /2 ) 2
n
(p 1 p 2 ) 2
Represents the
A measure of Effect Size desired level of
variability (similar to (the difference statistical
standard deviation) in proportions) significance
(typically 1.96).
Simple formula for difference
in means Represents the
desired power
Sample size in each (typically .84 for
group (assumes equal 80% power).
sized groups)
( Z Z /2 )
2 2
n 2
diffe re nce
Represents the
Standard deviation desired level of
Effect Size
of the outcome statistical
(the difference
variable significance
in means)
(typically 1.96).
Sample size calculators on the
web
http://biostat.mc.vanderbilt.edu/twiki/bi
n/view/Main/PowerSampleSize
http://calculators.stat.ucla.edu
http://hedwig.mgh.harvard.edu/sample
_size/size.html
These sample size calculations are
idealized
They do not account for losses-to-follow up
(prospective studies)
They do not account for non-compliance (for
intervention trial or RCT)
They assume that individuals are independent
observations (not true in clustered designs)
Consult a statistician!
Review Question 5
Which of the following elements does not
increase statistical power?
a. Increased sample size

b. Measuring the outcome variable more
precisely
c. A significance level of .01 rather than .05
d. A larger effect size.
Review Question 5
Which of the following elements does not
increase statistical power?
a. Increased sample size

b. Measuring the outcome variable more
precisely
c. A significance level of .01 rather than
.05
d. A larger effect size.
Review Question 6
Most sample size calculators ask you to
input a value for . What are they asking
for?
a. The standard error

b. The standard deviation
c. The standard error of the difference
d. The coefficient of deviation
e. The variance
Review Question 6
Most sample size calculators ask you to
input a value for . What are they asking
for?
a. The standard error

b. The standard deviation
c. The standard error of the difference
d. The coefficient of deviation
e. The variance
Review Question 7
For your RCT, you want 80% power to detect a
reduction of 10 points or more in the
treatment group relative to placebo. What is
10 in your sample size formula?
a. Standard deviation
b. mean change
c. Effect size
d. Standard error
e. Significance level
Review Question 7
For your RCT, you want 80% power to detect a
reduction of 10 points or more in the
treatment group relative to placebo. What is
10 in your sample size formula?
a. Standard deviation
b. mean change
c. Effect size
d. Standard error
e. Significance level
Homework
Problem Set 3
Continue reading textbook
Journal article

Statistical Inference I: Hypothesis Testing Sample Size

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Statistical Inference I: Hypothesis Testing Sample Size

Transféré par

Droits d'auteur :

Formats disponibles

Statistical Inference I:

Hypothesis testing; sample

Sample statistic: mean IQ

**Requires a priori knowledge about how sampling variability

Question: Why have I made you learn about probability

120 160 180 95 95

Example modified from: Antidepressant Drug Therapy and Suicide in Severely

What will we actually compare?

in the past vs. proportion of controls who did

Null hypothesis: There is no association

The standard error of the difference in

In computer simulation, you simulate

We observed a difference of 10% between

P-value=the probability of your data or

When we ran this

Here we reject the null.

Statistical significance does not necessarily

Statistical significance does not necessarily

No evidence of an effect Evidence of no effect.

Grade 3 at Oak School were given an IQ test at

What will we actually compare?

group vs. one-year change in IQ score in the

Change in IQ score: 12.2 (2.0) 8.2 (2.0)

12.2 points 8.2 points

Null hypothesis: There is no difference between

The standard error of the difference in

In computer simulation, you simulate

We observed a difference of 4 between

P-value=the probability of your data or

t-curve with 88 dfs has slightly wider

Here we reject the null.

Statistical significance does not necessarily

Statistical significance does not necessarily

Standard error is 2.58 Std. dev in

Standard If we ran this

No evidence of an effect Evidence of no effect.

Follows the logic: If A then B; not B; therefore, not A.

But throw in a bit of uncertaintyIf A, then probably B

Accept God Correct

a. For a given variable, its standard deviation

a. For a given variable, its standard deviation

a. The treatments are equally effective.

a. The treatments are equally effective.

a. The improvement in treatment outcome is clinically important.

a. The improvement in treatment outcome is clinically important.

Power= chance of being in the

Power= chance of being in the

average weight from samples of 100

average weight from samples of 100

average weight from samples of 100

average weight from samples of 100

a. Increased sample size

a. Increased sample size

a. The standard error

a. The standard error

Vous aimerez peut-être aussi