Statnotes PDF

Biostatistics Notes and Exercises
Jeff Longmate
Department of Biostatistics
ext. 62478
February 5, 2014
preface
0.1
Course Organization
These are lecture notes and problem sets for a seven-week course on
biostatistics at the City of Hopes Irell and Manella Grauate School of
Biological Science.
Instructor: Jeff Longmate (ext 62478)
Meetings: Mondays, Wednesdays, and Fridays, 10:45 pm. Fridays will be
used for problem set discussion, computing exercises, and exams.
Evaluations: 60% P-sets; 20% Mid-term exam; 20%; Final exam.
Texts:
1. Lecture notes & handouts. (All thats really needed)

2. Statistics, 3rd edition; Friedman, Pisani & Purves. On reserve
at Lee Graff Library. Some optional readings from this book will
be given.
Course Website:
http://www.infosci.coh.org/jal/class/index.html
Computing tools:
1. Excel (or another spreadsheet)
2. GraphPad/Prism (ITS can install)
3. R & R Commander (to be self-installed in tutorial)
ii
PREFACE
Contents
preface
0.1
i
Course Organization . . . . . . . . . . . . . . . . . . . . . . .
1 About Statistics
i
1
1.1
Comments About Statistics . . . . . . . . . . . . . . . . . . .
1.2
First Example: A Bioassay for Stem Cells . . . . . . . . . . .
1.2.1
The Model . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.2
Perspective . . . . . . . . . . . . . . . . . . . . . . . .
The taxi problem. . . . . . . . . . . . . . . . . . . . . . . . . .
1.3
1.4
1.3.1
Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2
Simulation of estimator performance . . . . . . . . . . 12
Philosophical Bearings . . . . . . . . . . . . . . . . . . . . . . 14
1.4.1
Science as selection . . . . . . . . . . . . . . . . . . . . 15
1.4.2
Examples of hypothesis rejection . . . . . . . . . . . . 16
1.4.3
A broad model . . . . . . . . . . . . . . . . . . . . . . 17
1.5
The notion of population . . . . . . . . . . . . . . . . . . . . . 21
1.6
Homework: A Computing Tutorial . . . . . . . . . . . . . . . 25

1.6.1
Creating a Data File in Excel . . . . . . . . . . . . . . 25
1.6.2
Installation of R . . . . . . . . . . . . . . . . . . . . . . 26
1.6.3
Trying it out . . . . . . . . . . . . . . . . . . . . . . . 27
iii
iv
CONTENTS
1.6.4
Data Analysis in R . . . . . . . . . . . . . . . . . . . . 29
1.6.5
Documenting the analysis . . . . . . . . . . . . . . . . 33
2 Data Summary
2.1
2.2
37
Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . 38
2.1.1
Kinds of Variables . . . . . . . . . . . . . . . . . . . . 38
2.1.2
The Mean . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.1.3
Other Notions of Typical . . . . . . . . . . . . . . . . . 42
2.1.4
Measuring Variation . . . . . . . . . . . . . . . . . . . 45
2.1.5
Linear Transformation . . . . . . . . . . . . . . . . . . 48
2.1.6
Quantiles & Percentiles . . . . . . . . . . . . . . . . . . 49
Graphical Summaries . . . . . . . . . . . . . . . . . . . . . . . 50
2.2.1
Histograms . . . . . . . . . . . . . . . . . . . . . . . . 50
2.2.2
stem-and-leaf plots . . . . . . . . . . . . . . . . . . . . 55
2.2.3
Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.3
Graphical Principles . . . . . . . . . . . . . . . . . . . . . . . 56
2.4
Logarithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.5
Homework Exercises Problem Set 1 . . . . . . . . . . . . . 62
3 Probability
3.1
3.2
69
Example: Mendels Peas . . . . . . . . . . . . . . . . . . . . . 70

3.1.1
The choice of characters . . . . . . . . . . . . . . . . . 70
3.1.2
Hybrids and their offspring . . . . . . . . . . . . . . . . 71
3.1.3
Odds and Probabilities . . . . . . . . . . . . . . . . . . 72
3.1.4
Subsequent generations . . . . . . . . . . . . . . . . . . 74
3.1.5
An explanation . . . . . . . . . . . . . . . . . . . . . . 74
3.1.6
Multiple Characters . . . . . . . . . . . . . . . . . . . . 75
Probability formalism . . . . . . . . . . . . . . . . . . . . . . . 77
3.2.1
Conditional Probability
. . . . . . . . . . . . . . . . . 78
CONTENTS
3.2.2
Marginal, Joint and Conditional Probabilities . . . . . 82
3.2.3
Independence . . . . . . . . . . . . . . . . . . . . . . . 82
3.2.4
Complementary Events . . . . . . . . . . . . . . . . . . 84
3.3
More Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.4
Bayes Rule and Prediction . . . . . . . . . . . . . . . . . . . . 88
3.5
3.4.1
A Diagnostic Test . . . . . . . . . . . . . . . . . . . . . 88
3.4.2
Bayes Rule . . . . . . . . . . . . . . . . . . . . . . . . 90
3.4.3
Example: The ELISA test for HIV . . . . . . . . . . . 92
3.4.4
Positive by Degrees . . . . . . . . . . . . . . . . . . . . 93
3.4.5
Perspective . . . . . . . . . . . . . . . . . . . . . . . . 95
Problem Set 2, (part 1 of 2) . . . . . . . . . . . . . . . . . . . 96
4 Estimating and Testing a Probability
101
4.1
Example: MEFV Gene and Fibromyalgia Syndrome . . . . . . 102
4.2
The Binomial Distribution . . . . . . . . . . . . . . . . . . . . 105

4.2.1
Calculating the tail probability . . . . . . . . . . . . . 106
4.2.2
Computing . . . . . . . . . . . . . . . . . . . . . . . . 108
4.3
Estimating a probability . . . . . . . . . . . . . . . . . . . . . 109
4.4
Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 112

4.4.1
4.5
The Law of Averages . . . . . . . . . . . . . . . . . . . . . . . 114

4.5.1
4.6
Expected Value . . . . . . . . . . . . . . . . . . . . . . 113

Mean and standard deviation of a binomial . . . . . . . 115
The Normal Distribution . . . . . . . . . . . . . . . . . . . . . 118

4.6.1
Standardized scale . . . . . . . . . . . . . . . . . . . . 121
4.6.2
Some motivation . . . . . . . . . . . . . . . . . . . . . 123
4.6.3
Central Limit Theorem . . . . . . . . . . . . . . . . . . 125
4.6.4
Standard error of the mean . . . . . . . . . . . . . . . 126
4.6.5
Areas under the normal curve . . . . . . . . . . . . . . 126
vi
CONTENTS
4.6.6
Computing . . . . . . . . . . . . . . . . . . . . . . . . 128
4.7
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.8
Homework Exercises . . . . . . . . . . . . . . . . . . . . . . . 131
5 Estimation & Testing using Students tDistribution
135
5.1
A Single Sample . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.2
A Paired Experiment . . . . . . . . . . . . . . . . . . . . . . . 139
5.3
The t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.4
Two Independent Samples . . . . . . . . . . . . . . . . . . . . 142

5.4.1
Standard Error of a Difference . . . . . . . . . . . . . . 143
5.4.2
Pooled Standard Deviation . . . . . . . . . . . . . . . . 144
5.4.3
Example: Diet Restriction . . . . . . . . . . . . . . . . 146
5.5
One-sided versus two-sided . . . . . . . . . . . . . . . . . . . . 148
5.6
Computing
5.7
Exercises (not turned in) . . . . . . . . . . . . . . . . . . . . . 150
5.8
Homework Exercises (Problem Set 3) . . . . . . . . . . . . . . 152
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6 Comparison Examples
6.1
6.2
6.3
6.4
155
Example: Genetics of T-cell Development . . . . . . . . . . . . 156

6.1.1
Computing in R Commander . . . . . . . . . . . . . . 156
6.1.2
Interpretation . . . . . . . . . . . . . . . . . . . . . . . 158
6.1.3
Computing with Prism . . . . . . . . . . . . . . . . . . 159
Tests in General . . . . . . . . . . . . . . . . . . . . . . . . . . 160

6.2.1
A Lady Tasting Tea . . . . . . . . . . . . . . . . . . . 161
6.2.2
Tests and Confidence Intervals . . . . . . . . . . . . . . 161
t Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.3.1
Interpretation of . . . . . . . . . . . . . . . . . . . . 162
6.3.2
Type I and Type II errors . . . . . . . . . . . . . . . . 163
Example: Inference v. Prediction . . . . . . . . . . . . . . . . 163
vii
CONTENTS
6.5
Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.6
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7 Contingency Tables
169
7.1
Chi-square goodness-of-fit test . . . . . . . . . . . . . . . . . . 170
7.2
Comparing 2 Groups: Testing independence of rows and columns174

7.2.1
One-sample versus two-sample chi-square tests . . . . . 177
7.2.2
Multiple testing . . . . . . . . . . . . . . . . . . . . . . 179
7.3
A 2 by 2 Table . . . . . . . . . . . . . . . . . . . . . . . . . . 180
7.4
Decomposing tables . . . . . . . . . . . . . . . . . . . . . . . . 182
7.5
Fishers exact test (small samples) . . . . . . . . . . . . . . . . 184
7.6
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
8 Power, Sample size, Non-parametric tests

8.1
8.2
191
Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

8.1.1
Sample Size for Confidence Intervals . . . . . . . . . . 193
8.1.2
Sample Size and the Power of a Test . . . . . . . . . . 195
8.1.3
Computing . . . . . . . . . . . . . . . . . . . . . . . . 197
8.1.4
Other Situations . . . . . . . . . . . . . . . . . . . . . 198
Paired Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

8.2.1
Sign Test . . . . . . . . . . . . . . . . . . . . . . . . . 201
8.2.2
The Wilcoxon Signed-Rank Test . . . . . . . . . . . . . 204
8.3
Two Independent Groups . . . . . . . . . . . . . . . . . . . . . 206
8.4
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
9 Correlation and Regression
213
9.1
Example: Galtons height data . . . . . . . . . . . . . . . . . . 214
9.2
Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . 215
9.3
Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
viii
CONTENTS
9.3.1
Example: Spouse education . . . . . . . . . . . . . . . 225
9.4
Uses of Linear Regression . . . . . . . . . . . . . . . . . . . . 227
9.5
The simple linear regression model . . . . . . . . . . . . . . . 228
9.6
Computing
9.6.1
9.7
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Transformed variables . . . . . . . . . . . . . . . . . . 231
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
10 Comparing several means
243
10.1 Calorie Restriction and Longevity . . . . . . . . . . . . . . . . 244

10.1.1 Global F-test . . . . . . . . . . . . . . . . . . . . . . . 245
10.1.2 Pairwise t-tests . . . . . . . . . . . . . . . . . . . . . . 249
10.2 A Genetics Example . . . . . . . . . . . . . . . . . . . . . . . 250
10.3 Why ANOVA? . . . . . . . . . . . . . . . . . . . . . . . . . . 251
10.4 Example: Non-transitive Comparisons . . . . . . . . . . . . . 252
10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
11 Context Issues
257
11.1 Observational & Experimental studies . . . . . . . . . . . . . . 258

11.1.1 Association versus Cause and Effect . . . . . . . . . . . 259
11.1.2 Randomization . . . . . . . . . . . . . . . . . . . . . . 259
11.1.3 The Role of Randomized Assignment . . . . . . . . . . 264
11.1.4 Simpsons Paradox . . . . . . . . . . . . . . . . . . . . 265
11.2 Hypothesis-Driven Research v. High Through-Put Screening . 277
11.2.1 Testing One Hypothesis (Review) . . . . . . . . . . . . 280
11.2.2 Multiple Testing Situations . . . . . . . . . . . . . . . 280
11.2.3 Type I Error Control . . . . . . . . . . . . . . . . . . . 282
11.2.4 Error Rate Definitions . . . . . . . . . . . . . . . . . . 282
11.2.5 FWE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
11.2.6 FDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
CONTENTS
ix
11.2.7 q-values . . . . . . . . . . . . . . . . . . . . . . . . . . 286

11.2.8 Summary of multiple testing . . . . . . . . . . . . . . . 287
11.3 A Review Problem . . . . . . . . . . . . . . . . . . . . . . . . 288
CONTENTS
Chapter 1
About Statistics
1.1
CHAPTER 1. ABOUT STATISTICS
Comments About Statistics
Statistics (plural) are summary numbers, like your grade-point

average, or the USA population. The root of the word reflects its ancient
connection with matters of state.
Statistics (singular) is a field of study, widely applicable to science
and other rational endevors, concerned with drawing conclusions from data
that are subject to variation or uncertainty.
Data are numbers in context. Conceiving of data as numerical is not
very limiting, as we can classify objects into categories and count them,
digitize images, ask experts to score specimens, and so on. Context,
however, is crucial. For example, data from an experimental intervention
will often permit much stronger conclusions than numerically identical
results from passive observation. The correct handling of data depends on
the context.
Variation is ubiquitous. Any measurement has limited accuracy, and the
limitations are often large enough to be important. Biological variation
may be seen, even when measurement errors are negligible. Populations of
plants, animals, and even populations of cells within a tissue, all exhibit
many kinds of variation. A genetic cross may produce many types of
offspring. Cellular immunity in mammals involves the random
rearrangement of receptors. Genes may be transcribed in bursts, leading to
highly variable levels of transcripts in individual cells. A stem cell may shift
between multiple expression profiles, without committing to differentiation.
Genetically identical NOD mice, kept in specific antigen-free cages, with
usually develop diabetes, but some will not, and for no discernable reason.
All of these involve an element of randomness, or stochastic behavior.
Dealing with variation takes effort. It is human nature to do much of
our thinking using examples and stereotypes. Statistics and biology both
require population thinking, i.e. going beyond what is typical, and
considering how individuals vary. This might be as simple as reporting a
standard deviation in addition to an average, but it does take more effort
(e.g. two numbers instead of one) to keep track of variation. On a deeper
level, the prominent evolutionary biologist, Ernst Meyr, wrote that
population thinking is essential to modern biological thinking, and that it
was a relatively recent innovation in the history of ideas, explaining why
1.1. COMMENTS ABOUT STATISTICS
nearly two centuries elapsed from the work of Newton to that of Darwin,
even though the problem addressed by Newton seems more difficult.
Probability is the mathematical language for describing
uncertainty. Simple probabilities describe the chances of discrete events
that may or may not happen. Distributions of random variables describe
the probabilities that a numerical value will fall in various intervals. The
distribution of a random variable is a model for the process of sampling and
observing an individual from a population, so we often speak of
distributions and populations almost interchangably.
Observation versus experiment is a major dichotomy in study
design. Variation that is passively observed and variation in response to
intervention are profoundly different things. Associations between variables
can be observed without intervention, but establishing a cause-and-effect
relationship requires something more either experimental manipulation,
or strong assumptions.
The broad objectives of statistics, are to summarize, infer, and
predict. Summary focuses on the data at hand. We may calculate
summary statistics and graphical displays to better appreciate the
information in a large dataset. Inference involves drawing conclusions
about a whole population based on a limited sample from that population.
Sometimes the sample is quite small. We calculate a statistic (e.g. the
sample mean) from the available sample, in order to infer the approximate
value of a parameter (e.g. the population mean) which is a characteristic of
the entire population. This is a recurring pattern: using sample statistics to
estimate the parameters that characterize a population. Prediction
attempts something even more ambitious. Instead of trying to estimate,
say, an average value for a large population, we attempt to predict the
specific value for a given member of the population. This often involves
observing additional variables for the individual of interest. While inference
problems often yield to increasing amounts of data, some things are
inherently unpredictable.
The rest of the lecture. Our first example will be a form of prediction (or
calibration), called bioassay, in which we will estimate the number of stem
cells in a specific culture, based on the engraftment success rate when the
cells are used in a mouse model of bone marrow transplantation. We will
then consider a toy problem of inference. Given a very small sample of
taxicab numbers, we will estimate the total number of taxis in a city. In the
remaining time, we will discuss the relationship of statistics to science and
mathematics, with a nod to a few famous philosophers. Having briefly
illustrated inference and prediction, we will take up data summary in the
following chapter.
1.2
First Example: A Bioassay for Stem

Cells
Lets start with an example from City of Hope. Shih et al.1 describe the
expansion of transplantable human hematopoietic stem cells in ex vivo
culture. Documenting this acheivement to the satisfaction of referees,
however, presented a problem. It was not possible to identify hematopoeitic
stem cells (HSCs) by direct observation. HSCs are defined by their capacity
for both self-renewal and for differentiation into multiple lineages. While
HSCs are found among cells expressing the CD34 and Thy-1 antigens, most
cells expressing these markers are already committed to a specific lineage.
The authors could show that they could expand a population of cells
bearing markers associated with stem cells, and they could also show that
the expanded culture still contained cells that could support engraftment in
the SCID-hu mouse model of HSC transplantation. Referees, however,
pointed out that the engraftment might be supported by a subpopulation of
stem cells that were maintained in culture, but not expanded.
The problem
What was needed was a demonstration that they had expanded the unseen
subpopulation of cells that can support engraftment. A quantitative
assessment of this sort, based on biological function as opposed to direct
measurement, is called a bioassay.
Bioassay (noun): Measurement of the concentration or
potency of a substance by its effect on living cells or tissues.
1
Blood 1999, 94, 1623-1636
wells (CD34CD38/positive wel

cells), averaging 9%, is significan
the respective percentage, averag
phoid/myeloid wells. All of the mi
(36 wells from donor no. 21, 31 we
wells from donor no. 23) contain
data
1.2. FIRST EXAMPLE: A BIOASSAY FOR STEM CELLS
5 show that 1 in 29 CD34
proliferation, but only 1 in 182 CD
of both proliferation and multip
The Result
higher frequency and percentage
cultures are evidence suggesting
The investigators did a dilution experiment using fresh CD34+ Thy-1+ cells
cells which are capable of
CD38
in the SCID-hu mouse model. Four different cell doses (10,000, 3,000, 1,000,
committed to the my
and 300 cells per graft) were evaluated. Each dose was used in 60 mice have already
CD34 CD38 cells in those mix

from each of two model systems (thymus/liver and bone model). For each
model, the number of mice with long-term engraftment was observed to present both expected characteristi
multipotential differentiation. I
increase with the cell dose, and calibration curves were fit to the
myeloid wells, an average of 15%
engraftment rates, as shown for the bone model in figure 1.1 below.
CD38 cells, representing a 1,500
the CD34 CD38 phenotype. Th
not counting only the 10% to 12%
wells) is in excess of a 150-fold e
cells under this culture system.
In vivo engrafting potential of
and after culture. Our next ex
whether ex vivoexpanded CD34
their ability to establish long-term
in SCID-hu mice, and to determin
CD34 CD38 cells from myeloid
myeloid wells differ in their in viv
vivoexpanded CD34 CD38 ce
myeloid wells and mixed lymphoid
derived from 5-week cultures. Th
then injected (10,000 cells/graft) i
fragments in SCID-hu mice. Con
HBSS only. Thymic and BM eng
Fig 5. Statistical measurements for the transplantable human
cells in the SCID-hu mice were a
Figure 1.1: Calibration
curve,
bone
10,000
are equiva thy-1
cellsmodel:
before and
after cultured
culture by cells
a standard
fetal BM CD34
lent to approximately
cells.
A lower bound
on thisthy/liv
estimatestem
is cell injection. Data from 3 in
calibration16,000
method.fresh
(A) The
measurements
in the SCID-hu
cells
from different donors were
model. (B) The measurements in the SCID-hu bone model. The
7,900 fresh cells.
statistical analyses are based on the data shown in Tables 1 and 6.
reconstitution derived from 10,0
percent lowerthe
confidence
bounds were
found
by engraftment
applying
The calibrationNinety-five
curves permitted
investigators
to use
the
CD38 cells was evident in about
a standard calibration method40 to the exact binomial lower confirate from cultured cells to estimate an equivalent dose of fresh cells. For 90% of the thymic grafts for all
dence bound for the cultured cell engraftment probability, using a
the bone model,
theconfidence
result was
10,000
cultured
cells were equivalentconsistent
to
with our unpublished da
97.5%
levelthat
for both
SCID-hu
mouse models.
16,350 fresh cells. Contrary to the worry that the culture would merely reconstitution rate when 10,000
maintain stem69%
cells,
theofstem
seemed
to increase
injected. The percentage of donor
(207
300), cells
71% (214
of 300),
and 72%somewhat
(216 of 300)faster
for than
the culture as donors
a whole.
was about 35% in both the bone
no. 21, 22, and 23, respectively (Table 7). This result
donors (Table 8). Compared wit

1 in 29 CD34
CD38
cells calculated.
(v 1 in 200 CD34
A lower boundsuggests
on the that
equivalent
cell dose
was also
This involves
derived cells in the animals trans
thy-1 cells) is capable of expansion under this ex vivo culture
thy-1 cells (40% in the bone m
mice), the percentage of donorTable 5. muLIF Can Facilitate Ex Vivo Expansion of Human Fetal BM
CD38 cells is lower. However,

CD34 thy-1 Cells
CD38 cells purified from mixed l
Percentage of
Frequency of
CD34 thy-1
CD34 thy-1 /
the same results in both reconstit
Positive Wells
Treatments
Cells
donor-derived cells as freshly pu
IL-3 IL-6 GM-CSF SCF
0.3% (1/300)
1.2
from the same donors (Table 8). T
two sources of variation. If we imagine that each dose of fresh or cultured

cells has a true underlying engraftment rate, which we might discover if
we could do a very large number of experiments, then our actual results
with a modest number of animals will approximate the true rate with some
error. To get a lower bound, we consider that the true engraftment rate
with cultured cells might be smaller than the rate we observed, and that
the true calibration curve might be somewhat further to the left than our
estimated curve. We wont go into the details at this point, but the
estimated lower bounds was 7,900 for the bone model (fig 1.1). Because the
original number of cells grown to 10,000 was much smaller than this, the
implication is that the stem cells were indeed expanded in culture, even
after allowing for experimental variation. However, our initial estimate that
10,000 cultured cells may be equivalent to more than 16000 fresh cells must
be tempered by realizing that they may also be equivalent to as few as 7900
fresh cells. The expansion of stem cells in culture is established with high
confidence. The initial impression of selective growth of stem cells versus
non-stem cells in culture appears to be merely an impression, and not a
reliable conclusion.
1.2.1
The Model
The type of calibration curve that was fit to the reconstitution data is a
standard model called a logistic regression. It would not make sense to fit a
straight line to the data, because the reconstitution rate must always be
between 0 and 1 (equivalently, 0% and 100%. Instead, we assume that the
engraftment rate increases from zero to one as the cell dose increases,
according to some relatively simple function. The function that was used is
called the logit function, which is the logarithm of the odds of engraftment.
If we let p be the proportion of mice that engraft, then p/(1 p) is the odds
of engraftment, and our model is

p
log
= + x
1p
where x is the dose, while and are parameters that we choose to make
the curve as close as possible to the data. This model defines a family of
S-shaped curves, one for each combination of and . Three members of
this family of curves are depicted in figure 1.2 Changing makes the curve
1.2. FIRST EXAMPLE: A BIOASSAY FOR STEM CELLS
steeper or flatter, i.e. more or less responsive to dose. Changing moves

the curve to the left or right, changing the dose that yields 50%
engraftment. We take as close as possible to the data to mean that the
choice of and should maximize the likelihood of the data that were
actually observed. The precise definition of likelihood is a theoretical matter
that we wont take up here, but it is worth noting that maximizing a
likelihood is a general principle, and there is a body of theory stating that
estimation based on maximum likelihood delivers some desirable properties.
0.0
0.2
0.4
0.6
0.8
1.0
Figure 1.2: Logistic regression curves The two curves with broken lines illustrate the effect of varying each of the two parameters.
Dose
1.2.2
Perspective
There are several points worth noting about this example.

The bioassay allowed the investigators to study a class of cells that
they could not observe directly.
The need for a calibration curve required additional laboratory work.

The estimate of an equivalent number of fresh cells came with a
measure of uncertainty in the form of a lower bound.
The uncertainly permitted a forceful conclusion about the main point.
The numbers of cells supporting engraftment were clearly expanded in
culture.
The calibration also suggested that stem cells were preferentially
expanded, but this was only a hint. The point estimate was consistent
with faster growth for stem cells, but the lower bound indicates that
the same, or slightly slower growth cannot be ruled out.
And finally, adding a small amount of statistical analysis addressed a
major concern of referees, and got the paper through peer review.
The points above indicate why one would want to do the dilution
experiment and the necessary statistical calculations. Actually doing the
statsitical work involves understanding a number of distinct concepts and
tools. Among these are the following.
Probability. In the SCID-hu mouse model, transplants of 1000 cells per
graft sometimes engrafted, but often did not. The tendency for
engraftment to happen more reliably as the cell dose increases is the
basis for asserting that the culture actually expanded hematopoeitic
stem cells. We suppose that the cell dose determines the probability
of engraftment, but this probability is an unseen parameter that we
can only estimate, using a finite (and rather limited) number of
SCID-hu mice.
Logarithms The logit function takes the unit interval (from zero to one)
in which probabilities must lie, and maps it onto the entire real
number line. The logit finction is the (natural) logarithm of the odds
of engraftment. Logarithms are often useful in statistics, and are
worth at least some refresher-level attention.
Linear Model The logistic regression model relates the logit-transformed
probability to two parameters, which can be thought of as a slope and
intercept. Linear models, of the form y = + x, are extremely
useful, despite their simplicity, and their usefulness can be extended
further by replacing y with a logarithm, logit, or some other function.
1.3. THE TAXI PROBLEM.
Experimental Design. In order to be useful, the experiment needed to

use enough SCID-hu mice, at enough different doses, spread across a
big enough range. While there are some strageties and tools to help
with the planning, there is also a lot of guesswork and biological
intuition. In this example, essentially everything but the engraftment
response was controlled by the experimenter. Studies that need to
rely on naturally occuring variation in more variables can be much
harder to design and analyze.
The experiment itself. The conclusions depend on maintaining
well-defined experimental conditions and procedures. There is a
substantial body of statistical methods for quality control and process
optimization that are much-used in industry, and sometimes used in
laboratories.
Model Fitting. The model supposes that probabilities of engraftment at
the four cell doses fall on a smooth, S-shaped curve, which is
determined by just two parameters. This assumption may be wrong
in detail, but it is probably good enough for our purposes.
All models are wrong. Some are useful. George Box
We can use a computer program to find the pair of values of and
that best fit the data. We can also find a range of values for and
that adequately fit the data. Here adequately depends on how
much chance of an error we are willing to tolerate.
Computing. To actually get the computing done, we have to organize the
data in a suitable form, and use a computer program. We will go
through the details in a tutorial exercise.
1.3
The taxi problem.
Lets consider a simple problem that involves inferring something about a

population from a small sample. This is a toy problem described by
Gottfried Noether2 .
2
Introduction to Statistics, a Fresh Approach, Houghton Mifflin (1971)
10
Dr. Noether was traveling, and trying to hail a taxicab. Several went by,
but all were already hired. He started to wonder how many taxicabs
operated in this city (clearly not enough). He noted the numbers on the
taxi medalions that each cab displayed, which were
97, 234, 166, 7, 65, 17, 4.
These are the data. He formulated a simple model, in which the taxicabs
were numbered sequentially, starting with 1. The highest number, say N ,
being the total number of taxicabs in the city. N is the parameter whose
value we would like to know. It describes a feature of the population of
taxicabs in the city.
We need a model relating the parameter to the probability of observing
different possible samples. If we assume that all cabs are equally likely to
be in operation, then each number from 1 to N has an equal chance of
showing up among the data. We can now think about ways to estimate N
from the observed data.
1.3.1
Estimators
Under our assumptions, N must be at least as large as the largest

observation, so we could simply use the largest observation to estimate N .
Since this estimate is probably smaller than N , we might add some
increment in the hope of getting closer, but how much should we add? We
might suppose the largest falls short of N by about the same amount that
the smallest exceeds 0. This suggests adding the smallest observation to the
largest observation. Lets give these two estimators of N some names to
a , the hat designating an
distinguish them. Lets call the maximum N
estimator of N , and the subscript indicating simply that it is our first idea
b to denote the maximum plus minimum
for an estimator. Lets then use N
estimator.
Can you think of other ideas for a good estimator?
Heres one more idea. We might consider averaging all of the gaps between
observed numbers to estimate the likely gap between the largest and N . So
c be the maximum plus the average gap. (With a little thought, we
we let N
c without actually calculating
can come up with an easy way to calculate N
all the individual gaps.)
11
How should we choose among possible estimates?

Any one of them might turn out to be closest to N in a single sample, so
lets think of applying our ideas to many future samples. Lets use the term
estimator to refer to the method of computing an estimate, the latter term
refering to the actual number we get when we apply the estimator to a
particular set of observations. Since we dont know the value of the
parameter N , we cant pick the best estimate with certainty, but if we
study a set of hypothetical populations where we do know N , we may be
able to identify the best estimator, i.e. the estimator that tends to be closer
to N than the other estimators, on average in repeated use.
We can easily use a computer to simlate samples from 1, . . . , N , for any
value of N we like. We can then compare the results for the different
estimators. We can summarize the simulations graphically, or we can use
summary statistics. Well do both. One particularly compelling summary is
the mean square error between estimators and N . If we square the
difference before averaging over simulations, we penalize large differences,
which we would like to avoid. The mean square error can be decomposed
into a measure of variation (or precision) and a measure of bias (or
accuracy). The standard deviation is the root mean square deviation of an
estimator from its mean (which might be different from N ). This can be
computed without knowing the target parameter, N , and it measures
variation, without regard to accuracy. In analogy with archery, it would
measure how tightly grouped the arrows are, without regard to the
bulls-eye. Bias, on the other hand, is the difference between the average
value of the estimator, and the target, N . It is analogous to the distance
from the middle of the cluster to the bulls-eye. These are related by
MSE = SD2 + Bias2
which essentially says that standard deviation and bias are related to MSE
by the Pythagorean theorem. To be a little more explicit, lets suppose we
1 , . . . , N
n . Note
have n simulations, numbered 1, . . . , n, yielding estimates N
that we are now considering only one estimator, say the maximum plus
minimum, and using the subscript P
to denote the result from different
i be the average value of our
simulated samples. Let N = (1/n) N

estimator over all of the simulations. Then we can write the decomposition
of the MSE as
X
X
i N )2 = (1/n)
i N
)2 + ( N
N )2
(1/n)
(N
(N
12
1.3.2
Simulation of estimator performance
We will do some simulations in class and calculate these summaries. To do

the simulations we need a computer, a programming language, and the
actual program code. We will use the R programming language
(http://cran.cnr.berkeley.edu/) which will also be encountered in the
homework tutorial.
Some program code for simulated sampling and calculation of estimators is
shown below. R is an interpreted language, so we can simply start R, and
type in the commands, executing one line at a time, or with this file
displayed in acrobat, we can copy and paste the code into R. This will be
done in class.
options(digits=3)
# sample data:
y = c(97, 234, 166, 7, 65, 17, 4)
y
# Note that in R, when you type the name of an object
# (y, in this case) the object is printed.
# Some
est1 =
est2 =
est3 =
est4 =
estimators
function(x)
function(x)
function(x)
function(x)
max(x)
max(x) + min(x)
max(x) + mean(diff(sort(c(1,x))))
2*mean(x)
# package them together for convienience

est.all = function(x) c(max=est1(x), maxmin=est2(x),
gap=est3(x), dblmean=est4(x))
# try them all on the data
round(est.all(y))
# Which is best?
# We decide by comparing performance
13
# Hypothetical situation
N = 240
# The number of taxis (which we want to learn from the data)
n = 7
# the number of observations
# A simulated sample
y.eg = sample(1:N,n)
y.eg
round(est.all(y.eg))
# Now repeat the simulation many times
reps = 200
yrep = matrix(data=sample(1:N, reps*n, replace=TRUE), nrow=reps, ncol=n)
# Look at the first 20 simulations
yrep[1:20,]
# and apply all the estimators to each sample
r = apply(yrep, 1, est.all)
r = t(r)
# Look at the estimators from the first 20 simulations
r[1:20,]
# Graphically examine the results for each estimator
par(mfrow=c(2,2))
for(i in 1:4){
x = r[,i]
# get data for one of the estimators
x.name = dimnames(r)[[2]][i]
# get the name of the estimator
hist(x, breaks=seq(0,450,by=20), main=x.name, xlab="")
}
# A different look using boxplots
par(mfrow=c(1,1))
boxplot(as.data.frame(r))
abline(h=N, lty=3)
# Summarize the performance of each
14
rmse = apply(r,2,function(x) sqrt( mean((x N)^2) ))

bias = apply(r,2,function(x) mean(x) - N)
sd
= apply(r,2,function(x) sqrt( mean((x - mean(x))^2) ))
tbl = rbind(rmse, bias, sd)
tbl
tbl^2
1.4
Philosophical Bearings
Why should a student of science, with much to learn and limited time,
study statistics? The answer that seems obvious to some of us, is that
statistics is an important part of the scientific method. Its a position
shared with other things. Much of science does not involve statistics, and
statistical methods are often used in government, finance, and industry, as
well as in science. In terms of methodology, however, statistics lies at the
interface of mathematics and science, which employ quite distinct methods.
Mathematics is deductive. We start with axioms, which are assumed to be
true, and we deduce their consequences. These consequences may be
proven, i.e. shown to be definitely true, without uncertainty. However, the
whole enterprise is concerned with an idealized Platonic world, rather than
the real world. Science moves in the opposite direction. We observe
outcomes and try to infer the causes that gave rise to them. The
conclusions cannot be proven with mathematical certainty, but they do deal
directly with the real world.
Statisticians use the methods of mathematics to help advance the methods
of science. Mathematical statisticians propose methods for designing
studies and analyzing data, and they work out the accuracy and
vulnerabilities of these methods. Applied statisticians, or statistically
knowledgable scientists, apply these methods to draw conclusions. The
application of statistical methods does not necessarily require great skill in
mathematics, but it does require an awareness of the accuracy and
vulnerabilities of the methods to be applied.
The Speed Berkeley Research Group has a website about their work on
statistical methods in functional genomics that carries this relevant
1.4. PHILOSOPHICAL BEARINGS
15
statement:3
What has statistics to offer? A tradition of dealing with
variability and uncertainty; a framework for devising, studying
and comparing approaches to questions whose answers involve
data analysis. In statistics we have horses for courses, not one
for all weathers. That is why guidance is needed.
Because statistics courses seem to be one of the few places in science
curricula where scientific methods get any sort of formal treatment, it
seems fitting to start with a broad view of science, to get our bearings.
1.4.1
Science as selection
We can draw an analogy between science and natural selection. The

analogy amounts to an extremely terse history of life on earth.
1. Life: Information is passed between generations in genes.
2. Evolution: Natural selection of genes generates diverse species
adapted to their environments.
3. Culture: Information passed between generations in minds and
literature.
4. Science: Data-based selection of ideas increases understanding of the
natural world.
Aside from making science (and teaching) look remarkably important, this
analogy has two major points:
Science is a cultural activity.
Science often involves selection among hypotheses.
The fact that science is cultural means that different people will require
different degrees of evidence for what they believe. This problem is greatly
reduced by the notion of selection, i.e. by focusing on what can be ruled
out with compelling force.
3
http://www.stat.berkeley.edu/ terry/Group/home.html
16

Our belief in some hypotheses can have no stronger basis than
our repeated unsuccessful critical attempts to refute it. Karl
Popper (1961, The logic of Scientific Discovery)
According to Popper, we can never prove any hypothesis, but we make

useful progress by trying to disprove them. The surviving hypotheses
constitute a useful, but provisional, view of the world. This is approach to
science is sometimes called the hypothetico-deductive method. In plain
words, this means guess and test, but the guessing involves a lot of
knowledge of ones subject.
1.4.2
Examples of hypothesis rejection
Discovery of viruses: Loffler and Frosch demonstrated the presence of

ultra-microscopic infectious organisms, now known to be viruses.
They passed lymph from an animal suffering from foot-and-mouth
disease through a filter, which ruled out bacteria (of ordinary size) as
the infectious agent. They also infected animals serially, which ruled
out any non-replicating poison.
Finding promoters: More recently, the regulatory regions of genes are
frequently dissected by sequentially shortening the sequence under
study and testing for expression, an approach known in some circles
as promoter bashing.
Genetic exclusion mapping: In an experimental genetic cross involving
a classic Mendelian phenotype, variation in the location of meiotic
cross-overs in a backcross allows the investigator to rule out most of
the genome, leaving a plausible interval that decreases in size as the
data accumulate.
Rejecting chance as an explanation:
For complex (incompletely penetrant) genetic traits, failure to see a
phenotype does not rule out the genotype as a contributing (non-sufficient)
cause, it just makes it less likely. Consider a backcross experiment using the
NOD mouse model of diabetes.
17
1. We genotype the diabetic mice and look for regions of the genome
that depart from Mendelian ratios. If the departure is big, we rule out
chance variation and conclude that there is something to interpret.
2. We include genotypes for non-diabetic mice to rule out a lethal
recessive allele.
Having ruled out both chance, and lethal recessives as competing
hypotheses, we can then conclude that the genetic pattern is related to
diabetes.
We will return to this example when we study specific methods for
statistical hypothesis testing. At this point, the thing to notice is that the
ability to reject chance as an explanation for patterns in our data can rescue
the hypothesis-rejection strategy, allowing us to investigate situations
involving noise and uncertainty. The bioassay described in chapter 1
enabled Dr. Shih to study stem cells quantitatively, despite the problem
that they could not be identified by markers. Using breeding experiments,
mouse geneticists have identified regions of the genome responsible for a
disease even before any genes within those regions have been identified.
The extension of hypothesis testing into noisy realms comes at a price, in
that many observations are needed. The feasible size of the study may not
permit rejecting the chance hypothesis to the satisfaction of everyone. The
different degrees of evidence required by different individuals can re-enter
the situation when statistical methods are needed. This is particularly true
in medical research, where the accumulation of evidence that one therapy is
superior to another may, at some point, preclude further research on the
putatively inferior therapy, on ethical grounds.
1.4.3
A broad model
The foregoing are all examples where hypotheses are proposed and
compared to data in a single project. Sometimes there is a protracted
debate.
Gregor Mendel proposed a particulate model of inheritance. This was in
contrast to the notion, current in his day, that inheritance involved some
sort of blending.
18
Mendel started with true-breeding lines, e.g. a line that always produces
yellow peas, and a line that always produces green peas. When he
cross-fertilized these two lines, he did not obtain any blending of the colors.
Instead, all of the first generation offspring (the F1 generation in modern
notation) had peas of the same color. When this generation was allowed to
self-fertilize, the resulting generation produced both of the original colors of
peas in a highly repeatable 3:1 ratio, but without any blending of the
characters.
Describing what we would now call F2 generations (the result of
hybridization of two inbred lines followed by self-fertilization), Mendel
wrote4 :
If now the results of the whole of the experiments be brought
together, there is found, as between the number of forms with
the dominant and recessive characters, an average ratio of
2.98:1, or 3:1.
In passing from 2.98 to 3, he was asserting that his model fit the data, and,
in a sense, rejecting the need for any further explanation of the variation.
Mendels choice of experimental plants was crucial. The seven traits he
studied were each under the control of a single gene. Traits like height and
weight, however, seemed more a matter of blending than a transfer of
discrete genes (a word coined later). For more than a decade after the
rediscovery of Mendels paper, those who studied such traits regarded this
as evidence against Mendels model. These investigators were described as
the biomeric school, in contrasts to the Mendelian school of thought.
The tension was resolved, largely by Ronald Fisher, who explained the
inheritance of quantitative traits as the result of contributions from many
genes.
This is an example of progress on the deductive part of the
hypothetico-deductive method to explain existing data. It was not a
revision of the Mendels notion of particulate inheritance, rather is was an
improved understanding of what it implied. Far-reaching ideas like
Mendels dont necessarily stand or fall on a single fact.
4
translated, see e.g.http://www.mendelweb.org/
19
Description versus hypothesis testing

Not everything in science fits easily into the hypothetico-deductive mold.
Sequencing the human genome, for example, was done for much broader
reasons than testing any specific hypothesis. A recent pair of opinion pieces
in Nature5 debate the merits of a data-first approach versus putting
hypotheses first. Of course part of the debate is over what sort of science
should be funded. (Almost all grant proposals need to couch their
objectives in terms of hypotheses to be tested.) More constructive, perhaps,
is a news article6 in the same issue of Nature, with the title Life is
Complicated, which addresses recent efforts to incorporate rapidly
accumulating molecular information with more traditional
hypothesis-driven research. Sequencing, and many other large-scale data
collection activities are essentially observational. But sometimes the
data-first approach can involve experiments. The article cites the example
of Eric Davidsons lab, at Caltech, which works on the sea urchin as a
model system. The lab has been systematically knocking out transcription
factors to build a map of how they work together in development. This
approach combines assessment of the whole transcriptome with highly
specific experimental manipulations, and it has uncovered a modular
structure of regultory networks.
More examples from genetics
Even though everything need not fit into a neat philosophy, having some
philosophical bearings can be useful when things get difficult. In particular,
the key feature that makes an hypothesis scientific is that it is potentially
falsifiable, i.e. if it is wrong, it can be shown to be wrong. This principal is
sometimes useful for identifying ill-posed questions. Sometimes, however,
data precedes any clear hypotheses.
Exercise 1.4.1 Several brief descriptions of landmark results in genetics
are given below. For each description, can you identify a hypothesis that has
been rejected, or does it seem to be more a matter of observing and
explaining?
5
6
Nature 2010, 464:678-679

Nature 2010, 464:664-667
20

1. The early ideas about genes treated them as discrete and constant. In
the 1930s, Herman Joe Muller showed that exposing fruit flies to
X-rays produced flies with novel features that are inherited. This
meant that genes could be altered.
2. In the early 1940s chromosomes had long been recognized as the
cellular location of genes, but it was still unclear whether it was the
proteins or the DNA in chromosomes that carried the information of
heredity. Oswald Avery showed that a strain of an attenuated
bacterium could be exposed to extracts from a virulent strain and
recover virulence, which could then be passed on through cell divisions.
If the DNA in the extract was destroyed, no virulence ensued. If the
protein was destroyed, the bacteria continued to become virulent.
3. In 1953, Watson and Crick published the structure of DNA based on
the data of Franklin and Wilkins. They wrote,
It has not escaped our notice that the specific pairing we
have postulated immediately suggests a possible copying
mechanism for the genetic material.
4. Shortly after the structure of DNA had been published, there was
speculation as the nature of the genetic code for amino acid sequences.
A particularly neat hypothesis was that the code was comma-free
and would make sense in any reading frame. This reduced the 64
possible three-base codons to 20 feasible codons exactly the right
number for encoding the amino acids in proteins. This hypothesis was
soon discredited by an experiment that fed poly-U RNA to the cellular
translation machinery and obtained a monotonous peptide. UUU was
one of the forbidden codons in the comma-free hypothesis.
Paradigms
While the hypothetico-deductive method holds a prominent place in
science, there is no single scientific method. In 1962, Thomas Kuhn
published a short but very influential book, The Structure of Scientific
Revolutions7 , that introduced the notion that scientists tend to follow
7
see book review, Science Vol. 338 16, November 2012
1.5. THE NOTION OF POPULATION
21
established paradigms, which are previously successful approaches to an

area of science. Ordinary science involves working within a paradigm. A
scientific revolution is the replacement of a paradigm. And paradigms are
replaced, not necessarily because they are contradicted by experiment, but
when they cease to be productive.
A 50th anniversary review of Kuhns book is among the handouts.
One reason for taking notice of Kuhn, is that as a scientist, you will
certainly enouter the term paradigm, used in this sense. Another reason is
that the field of statistics is somewhat unusual is having two paradigms
that have long coexisted, without replacing each other. The frequentist
paradigm focuses on the evaluation of statistical methods by their
performance in repeated use. The simplest statistical methods tend to be
frequentist procedures, hence the frequentist paradgm dominates
elementary statistics textbooks. The Bayesian paradigm uses probabilities
to simultaneously model both the distribution of data and the truth of
inferential statements. It has a more coherent theory, and some advantages
of interpretation, but it requires a little more expertise in theory and
computing. It is particularly useful in problems that require combining
data from multiple sources. You might encounter mention of Bayesian
methods in statistical treatments of genetics and functional genomics.
1.5
The notion of population
In the bioassay example, the probability of engraftment was thought to be

primarily determined by the number of stem cells injected into a mouse.
This assumes that a lot other things are fixed parts of the situation. We
can think of the whole experimental set-up as generating a population of
potential transplants, with the probability of engraftment being a property
of this population.
In the taxi problem, the population was more definite. We assumed a fixed
number of taxicabs, N , working in the city. We sample from this
population by trying to hail a cab and noting the medalion. We might
worry that our sampling method leaves something to be desired, but the
notion of population is pretty clear.
The hypothetico-deductive method can be related to the notion of a
22
population. In general, we can deduce what kind of sample to expect under

each of several competing hypotheses, and then try to obtain enough data
that we can reject some of the hypotheses. If we want to estimate a
numerical quantity, like the speed of light, or the probability of a mutation,
we can consider each possible numeric value as a distinct hypothetical
value, characteristic of the population. Given a sample from the
population, the set of hypothesized values that we cannot reject as a
plausible source of the data will form an interval, which probably contains
the actual value for the population.
In the reading assignment, Ernst Mayr notes that biological populations not
only consist of non-identical individuals, but they may not be constant over
time. The length of a rabbits ears varies somewhat from rabbit to rabbit,
but the typical length characteristic of a species has probably varied during
the evolution of rabbits. We need to keep this sort of variation in mind
when we consider whether a statistical inference problem is well-posed.
Exercise 1.5.1 Figure 1.3 gives results from several studies all attempting
to measure the same physical quantity, the angular deflection of light due to
the gravity of the sun. Taken as a whole, are the observations consistent
with the general relativity value? How many experiments produced intervals
that exclude the value = 1.0? What do you make of the intervals from
these experiements.
Exercise 1.5.2 Figure 1.4 gives results from several studies, all attempting
to measure the risk of non-febrile seizures following febrile seizures. The
studies fall into two classes. The clinic-based studies sampled patients who
were seen at a variety of specialty clinics and hospitals. The
population-based studies attempted to follow all reports of febrile seizures
within a defined population. Are the clinic-based studies consistent with any
single value for the risk of subsequent non-febrile seizures? Are the
population-based studies consistent with a single numerical risk of
subsequent non-febrile seizures applicable to all patients with febrile
seizures?
1.5. THE NOTION OF POPULATION
Figure 1.3: Multiple estimates of a physical constant.
23
24
Figure 1.4: Multiple estimates based on samples from two different kinds of
population.
1.6. HOMEWORK: A COMPUTING TUTORIAL
1.6
25
Homework: A Computing Tutorial
In this exercise we will:

1. enter data from Dr. Shihs bioassay experiment into a spreadsheet
and save it as a comma-separated (.csv) file;
2. download and install the R program, which we will use again later;
3. read the data into R and produce a plot, and related calculations;
4. save our computing instructions in a file to document the calculations.
The point of this tutorial primarily to get a better idea of what is involved
in carrying out the statistical analysis of an actual experiment, and to get
the R program installed, so that we can use it for some simpler calculations
later.
1.6.1
Creating a Data File in Excel
Lets use Microsoft Excel for entering the bone system engraftment data
into a file. Excel is a very common tool for organizing raw data. If you
dont have a copy of Excel, you can use any plain text editor, such as
Notepad, to create a file with columns of numbers separated by commas.
Dont use commas for any other purpose, however, e.g. dont use commas
within large numbers. Other alternatives, in the absence of excel, are to
download either Libre Office from http://www.libreoffice.org/ or Open
Office from http://www.openoffice.org/.
Create three columns of numbers, with a header row at the top to label the
columns. The data should be organized as in Table 1.1.
Aside from the header row, each cell should contain one number and
nothing else. You should start at the top, and leave no blank rows, nor any
blank fields within the rectangular table of data. The space between Cell
and Dose in one of the variable names is ok. When we read the data into R,
this will get converted into a period, yielding the name Cell.Dose.
Save the table as a comma-separated (.csv) file. Lets assume the file name
is bone.csv, and the file is in the folder C:/biostat.
26

Cell Dose
r n
10000 53 60
3000 12 60
1000 4 60
300 0 60
Table 1.1: Spreadsheet layout for the bioassay data.
The bone.csv file can be read by the R program, which can easily do
further calculations. If we want to use Prism, instead of R, to plot the
engraftment rates against the cell dose, we would need to calculate the
rates outside of the Prism program, e.g. in excel, so lets look at how that
can be done.
In excel, label a fourth column (top row) as p (for proportion). Move the
cursor to the second row, fourth column and type an equal sign. This tells
the spreadsheet that a formula is coming. Then click on the cell with the
53. This will write the cell name in the formula. Follow that with the slash
for division, and click on the 60 next to the 53 to put that cell name in the
formula, and complete the formula by hitting the tab key. Now highlight
the cell with your new formula, and the three cells below it. Ctl-D will copy
the formula down the column, adjusting the cell names in each formula to
refer to the different rows. You now have a column with the calculated
proportions.
1.6.2
Installation of R
The following assumes you are using Windows. Installation is similar under
Mac OS X, but some of the link names will be different. It is also assumed
that you have administrator priveledges on the computer where you are
installing R.
1. Point your web browser at http://cran.cnr.berkeley.edu/ (or do
a google search on CRAN and use any mirror site you like).
2. Download the installation file;
Follow the Download R for Windows link.
27
Select base.
Select Download R 3.0.2 for Windows, or whatever newer

version is there. Save it in any convieneient folder on your
machine, or just on the desktop.
3. Start the installer program (by double-clicking it). This is a typical
install wizard that will ask a few questions. You can take all of the
defaults, but on windows I like to install programs to my own progs
folder, leaving Program Files for the things that ITS installs. When
the installation is complete, there should be an R icon on your
desktop and an R folder on your start button. On a mac, there will
be an R application it the applications folder.
1.6.3
Trying it out
Click on the R icon. You should get an R Console window. In that window,
try typing 2 + 3 then press return. Try sin(pi/2). You can use R as a
scientific calculator.
Try typing these lines, with a return after each line.
F = c(-40,0,32,68,98.6, 212)
F
C = (F - 32)*5/9
C
cbind(F,C)
The first line combines 5 numbers, representing Farenheit temperatures,
into a vector, and assigns it to an object named F. This happens silently,
i.e. without printing anything. In the second line, giving the name of the
object, without an assignment, causes the object to be printed. The third
line does a calculation on each element of F, converting the temperatures to
Celcius, and assigning the result to an object named C. Typing C alone on
the fourth line prints the Celcius values. The final line uses the cbind
function to create a small table by binding two columns of numbers
together. Because there is no assignment, the table is printed in the R
Console window.
To quit R, you can either
28

1. close the R Console window,
2. choose Quit R from the R pull-down menu, or
3. type the command quit() at the command line, followed by return.
R is a statistical programming language, as opposed to a menu-driven

statistics and graphics application of the sort advertised in the pages of
Science. A very large number of specialized statistical tools have been
implemented in R by the scientific community. The Comprehensive R
Archive Library (CRAN) has a large general collection, and the
Bioconductor site as a more specialized collection, oriented towards
molecular biology. But because it is a language, learning it takes some
effort. Once one has a basic orientation, doing standard statistical
calculations in R is really no more difficult than using a menu-based
program. One simply looks through the help pages, instead of looking
through the menus, to find the appropriate functions.
The help system in R is available from the Help pull-down. Choose Html
help on a windows system, or R Help on a Mac. The Search Engine &
Keywords link will let you find relevant material by searching, or via a
hierarchical list of topics. The Introduction link provides an manual of
sorts, with a sample session in an appendix.
Help pages may actually be preferable to menus, as help pages are more
informative, and often offer references to help the user understand what is
being calculated. There is much more to data analysis that understanding
how to use a computer program, and it is possible for statistical tools to be
too handy. However, programming is more like playing the piano than like
riding a bicycle, in that the skill fades quickly if you dont practice. For
that reason, future computing exercises will use either Prism, which is a
commercial menu-driven package, or an add-on package for R, called R
Commander, that provides a graphical user interface (GUI) that makes R
work more like a commercial statistics package. This tutorial, however, will
use R in its basic form, which is how someone doing more involved work
would likely use it.
1.6.4
29
Data Analysis in R
This will necessarily be a rather mechanical exercise in copying commands

to the R console window and executing them.
After starting R, we need to set the working directory to the file where we
saved the dataset. You can either use the pull-down menus (File / Change
dir... under windows, Misc / Change Working Directory under Mac OS X)
and navigate to your folder, or you can type the path in a setwd command,
as below.
setwd("C:/biostat")
This just lets the R program know where to look for things like data files.
There is a counterpart to the setwd function, that you can use like so:
getwd()
This will print the full path to the working directory, i.e. it tells you where
you are. Note that you dont need to supply the getwd function with any
argument, but you do need to type the parentheses that usually hold
function arguments. The presence of parentheses tells R that you want to
execute the function. If you just type the function name, without the
parentheses, R tries to print the function object, just like any other object.
This can be handy if you have written functions in R, and want to look at
the code, but in the case of getwd, it only produces some uninteresting
technical information.
To read the data:
bone = read.csv("bone.csv", header=TRUE)
This will be silent. To see the data, type bone (followed by return, of
course). The object bone, created by the read.csv function, is something
called a data frame. It is a retangular array of variables, of possibly
different types (e.g. numeric, logical, character, factor we havent yet
defined these things), but all of the same length. Columns are variables,
and rows are distinct individuals or observations.
Lets make a plot.
30
plot(r/n ~ Cell.Dose, data=bone)

The ~ character is part of a statistical modeling notation. The command
says to plot r/n as the response, on the vertical axis, and Cell.Dose as the
predictor variable, on the horizontal axis. The variable names are inside the
bone data frame object, so we have to tell the plot function where to find
them with the data= argument. If you used different variable names when
you made the data file, you will need to adjust the plot command to use
those variable names.
The plot actually looks pretty straight. We might have gotten away with a
simpler linear model, rather than the logit model, but the later is more
broadly applicable.
Lets repeat that plot, but with a logarithmic horizontal axis, covering a
wider range of values.
plot(r/n ~ Cell.Dose, data=bone,
log="x", xlim=c(300,20000), ylim=c(0,1))
Dont worry about understanding the details. The point is that many
variations on the plot are possibly by specifying extra arguments to the plot
function. Typing ?plot will bring up a help page, but its not essential that
you look at it right now.
Now lets fit the logistic regression model, which is the main event here.
bone.fit = glm(r/n ~ log(Cell.Dose),
data=bone, family=binomial, weights=n)
bone.fit
Note that we split the long function call across two lines. If one line isnt
complete, R will keep looking on the next line. Our two-line instruction
here calls the glm (generalized linear models) function to fit the logistic
regression model, and assigns the result to an object called bone.fit. The
second instruction simply gives the name of the object, which prints the
result, in a brief form. In some programs, a statsitical procedure like this
would spew a page or more of results. In R, it is typical for a statisticial
function to wrap its results up into an object which can then be
interrogated by other functions to get what you need. Just typing the name
31
of the object usually produces a rather brief summary. Here we have fit a
model of the form

p
log
= + x
1p
and the summary gives us estimates of and . It is common is statsitics
to adorn the parameter with a hat (or similar mark) to distinguish an
estimate from the true value of the parameter. Following this convention,
we can write
= 19.357
= 2.292.
These two parameters determine the calibration curve. We might like to
look at the curve. Heres how.
x = seq(from=300, to=20000, by=20)
y = -19.357 + 2.292 * log(x)
inv.logit = function(z){ exp(z)/(1 + exp(z))}
logit = function(p) { log(p/(1 - p)) }
lines(inv.logit(y) ~ x)
The first line creates a sequence of points along the horizontal axis, from
300 to 16000, one every 20 units. (We can see how many points we created
by typing length(x).) The second line just applies the fitted model to each
of the points we generated. The model, however, links a linear function of
cell dose to the logit of the probability of engraftment, not to the
engraftment rate directly. In order to convert the linear predictor, y to a
probability, we have to apply the inverse of the logit function. The third
line defines the inverse logit as a function. The fourth line defines the logit
function. This really isnt necessary, but it allows us to check that we really
did get the inverse right, by a few calculations line these:
> inv.logit(logit(.5))
[1] 0.5
> logit(inv.logit(2.5))
[1] 2.5
Finally, the lines function was used to add a line to the plot. The line is
really a bunch of segments connecting many dots, but it looks pretty
smooth.
32
Lets use the calibration curve to evaluate the engraftment from cultured
cells. According to table 4 of Shih et al., 52 out of 56 mice engrafted after
receiving 10000 cultured cells. (This is a slightly different rate from that
which led to the estimates quoted in the text, and above, but arguably as
relevant.) If we draw a horizontal line at 52/56 0.93 and note where that
intersects the calibtation curve, we can read off the equivalent dose of fresh
cells.
abline(h=52/56, lty=2)
The abline function is for drawing lines with intercept a and slope b, but
here we use an extra argument, h for specifying the height of a horizontal
line. The lty=2 argument just specifies a broken line. (Note that
arguments to R functions can be specified by position, useful for required
arguments in the first positions, or by name, which is useful for skipping
over optional arguments.)
We can find where the rate of 52/56 intersects the calibration curve using a
linear interpolation function.
approx(inv.logit(y), x, 52/56)
We see that this is at a cell dose of about 14250 fresh cells. (Slightly lower
than the 16350 in figure 1.)
We can draw a line segment or arrow at that point with the following
instruction. The arguments are, respectively, the x and y coordinates of the
beginning and end of the arrow.
arrows(14250, 52/56, 14250, 0)
The lower bound calculation involves consideration of several sources of
error. For simplicity, we just consider one. The observation that 52 our of
56 animals engrafted may be accidentally optimistic. This is a very
common situation, in which we have a number (56) of independent trials of
an experiment yielding a binary result (engraftment or not) with the same
probability of success on each trial. The total number of successes in such a
situation is said to follow a binomial distribution. We can easily calculate a
95% lower confidence bound. This means that the method of calculating
the bound will, in 19 out of 20 experiments, yield a bound that is in fact
below the probability of engraftment.
33
binom.test(52, 56, alt="greater")

The lower bound is approximately 0.844. The binim.test function reports
a number of other things that we can ignore.
We can do the same calibration exercise using this lower bound.
abline(h=0.844, lty=2)
approx(inv.logit(y), x, 0.844)
arrows(9721, 0.844, 9721, 0)
Note that 93% engraftment is higher than any of the engraftment results
used to make the calibration curve. This is extrapolation, and it is rather
inaccurate due to the flatness of the curve in this region, and even
dangerous, as it depends on the shape of the curve in unexplored territory.
However, if we take a lower bound on a cultured cell engraftment rate, that
will be within the range of the calibration data, where calibration is more
reliable. Fortunately, the argument that culture expanded stem cells rests
on concluding that the lower bound is high enough, and the calibration
curve is adequate for this purpose.
1.6.5
Documenting the analysis
One of the advantages of using a statsitical programming language, as

opposed to a typical statistics package, is that it is easy to document your
calculations in a program. After an interactive session in which we explore
the data, correct errors, and so forth, we can collect the key calculations as
a list of R instructions, as shown below. The pound signs mark comments
that might be useful later. Everything to the left of a pound sign will be
ignored by R.
# Calibration calculations for determining equivalent fresh cells
# from engraftment rates, using data from Shih et al., 1999.
# Read the bone system data
bone = read.csv("bone.csv", header=TRUE)
# Fit a logistic regression
bone.fit = glm(r/n ~ log(Cell.Dose),
data=bone, family=binomial, weights=n)
34
# Print the estimated model:

print(bone.fit)
# Start a plot
plot(r/n ~ Cell.Dose, data=bone,
log="x", xlim=c(300,20000), ylim=c(0,1))
x = seq(from=300, to=20000, by=20)
y = -19.357 + 2.292 * log(x)
### fit results are hard-coded here
inv.logit = function(z){ exp(z)/(1 + exp(z))}
logit = function(p) { log(p/(1 - p)) }
lines(inv.logit(y) ~ x)
# calibrate an engraftment rate
abline(h=52/56, lty=2)
approx(inv.logit(y), x, 52/56)
arrows(14250, 52/56, 14250, 0)
### hard-coded calibration results
# get a lower bound and calibrate that
binom.test(52, 56, alt="greater")
abline(h=0.844, lty=2)
### hard-coded lower-bound
approx(inv.logit(y), x, 0.844)
arrows(9721, 0.844, 9721, 0)
### hard-coded calibration results
These commands can be executed again. Paste these R instructions into a
file, using a plain text editor, like notepad. Save the file as, say,
calibrate.R, in your working folder. In R, you can either go to the File
pull-down, select Source File and navigate to calibrate.R, or you can
simply type source(calibrate.R) on the command line. This will execute
the instructions, which should produce a plot.
Major caveat: Documenting what you did with one set of data and writing
a program to apply similar steps to new data are two different things. Note
that several lines have been marked with a comment that something is
hard-coded. This means that some result from earlier steps has been
copied directly into the instruction. If one were to run these instructions
with new data, these results would be incorrect. It is possible to turn a
transcript of an interactive session into a re-usable computer program, but
that involves more effort, additional programming techniques, judgement,
and testing. The proper use of code that documents analysis steps is to
read it, in order to understand any questions that arise about the details,
and to re-use it only while reading it and thinking about it.

Finishing up
Look up the title function in the R help pages. Use it to put a title on
your plot, with your name, and turn it in.
35
36
Chapter 2
Data Summary
37
38
CHAPTER 2. DATA SUMMARY
Suggested reading: Samuels and Witmer (SW) Chapter 2.
2.1
Summary Statistics
Summary statistics can be contrasted with inferential statistics. In

summarizing a dataset, we are concerned with the data at hand. Inferential
statistics arise when the data at hand are a sample of some larger
population or process. We would like to have a summary of the population,
but we have to settle for estimates based on a smaller sample.
Statistics are numerical summaries of actual observations.
Parameters are features of an unobserved population. The population
may be real and finite, or an indefinite number of potential results
from some process.
We sometimes need to distinguish a sample mean from a population mean,
or a sample standard deviation from a population standard deviation, but
in this chapter, we will focus primarily on summaries of actual data.
2.1.1
Kinds of Variables
We will use the term variable to refer to a well-defined measurement taken

on each member of a sample of interest. A dataset is often represented in a
computer file as a rectangular array, with columns representing variables
and rows representing individuals that are observed or measured, and
perhaps experimentally treated.
An often-cited classification of variables is due to Stevens1 :
nominal, e.g. blood type (A, B, AB, O);
ordinal, e.g. pathology scores (-, +, ++, +++);
interval, a variable with a well-defined unit, but without a well-defined
origin, e.g. time, or Celcius temperature;
1
SS Stevens, On the Theory of Scales of Measurement. Science, 1946, 103:677-680
39
2.1. SUMMARY STATISTICS
ratio, a positive variable with a unit and an origin, e.g. weight or number.
We often speak of quantitative measurements, without distinguishing
between interval and ratio scales, but some statistics only make sense for
variables that are strictly positive, e.g. the coefficient of variation, which is
a measure of variation expressed as a fraction of the mean. These require a
ratio-scale.
Discrete variables take values from a finite set. Continuous variables could,
in principle, take a value between any other two values. In practice, there is
always a limited resolution, and a finite scale. A categorical variable is a
discrete variable that only takes a few possible values, so many observations
will be in the same category. The pathology scores are an example of
ordered categories. Sometimes observations can be ranked, i.e. put in order,
so that there are few, if any, ties. There are special methods to deal with
ranked data, and sometimes we decide to only pay attention to the rank
order of quantitative data.
2.1.2
The Mean
The mean (average, or arithmetic mean) is a common summary of

quantitative measurements. It conveys an idea of what is typical, or central.
The mean of a set of n numbers is the total divided by n. This is usually
what people mean when they say the average.
The mean is an equal share. If you split the bill at a restaurant equally,
each person pays the mean cost.
Given a sample, x1 , . . . , xn , the sample mean, usually denoted x, can be
thought of as
an equal share of the total
P
x =
or a weighted sum
where
wi = 1.
x =
xi
n
w i xi
40
Example: Calculate the average number of alleles shared identical by

descent at the DBQ1 locus in 278 pairs of sisters with cervical cancer, given
the following data:
xi
0
1
2
wi
0.228 = 63/278
0.457 = 127/278
0.315 = 88/278
Here the xi are the three possible values for the number of shared alleles,
and the wi are the fractions of the sample with the respective value of xi , so
x =
3
X
wi xi = (0.228)0 + (0.457)1 + (0.315)2 = 1.087.
i=1
Expressing the mean as a weighted sum for grouped data like this amounts
to using the distributive law to reduce our labor.
Some things to notice about the mean:
A histogram balances when supported at the mean.
The sum of deviations from the mean is zero.
n
X
i=1
(xi x) = 0
The mean minimizes the sum of squared deviations, i.e.

n
X
i=1
(xi k)2
is minimized by taking k = x.
The mean permits recovery of the total, if the number of observations is
known. Given means and sizes of subgroups, a grand mean can be
calculated.
The mean may fail to represent what is typical if there are a few extreme
values. This can be a problem if a measuring instrument occasionally gives
wild results, or if there are genuine but extreme values in the sample or
population.
41
The Mean of a zero-one indicator variable is the proportion of ones,

as shown below.
0.2
Sample versus Population Mean

Lets re-visit the allele-sharing example, but consider the weights to be the
hypothetical neutral probabilities of sharing zero, one, or two alleles IBD.
xi
0
1
2
wi
1/4
1/2
1/4
It would be conventional to use the greek letter to denote the mean,

defined as
= 0(1/4) + 1(1/2) + 2(1/4) = 1
42
because this is the mean of a hypothetical population of indefinite size,

rather than the mean of a specific set of numbers. Sometimes this notion of
a mean is called the expected value.
2.1.3
Other Notions of Typical
The median of a set of numbers has half of the numbers above and half
below. If the number of observations is odd, it is the middle value. If there
is an even number of observations, the convention is to take the average of
the two middle values.
The median is resistant to perturbations. Several large observations can be
made extrememly large without affecting the median. The median is a
good summary for things like individual or family incomes, because it
remains representative of many actual values, even when a few values are
extremely large.
Geometric mean
ab.
The geometric mean of three numbers, a, b, c, is 3 abc.

The geometric mean of two numbers, a and b, is
Another way of thinking about these, is that the geometric mean of a set of
numbers is the mean of the logarithms of those numbers, transformed back
to the original scale. In other words, the geometric mean is the antilog of
the mean of logs.
For example, if a = 10, b = 100 and c = 1000, the mean of (common) logs is
(1 + 2 + 3)/3 = 2, so the geometric mean is 102 = 100. Compare this to the
arithmetic mean, (10 + 100 + 1000)/3 = 370. The geometric mean is less
influenced by very large observations.
Note that the base of logarithms does not matter, so long as the same base
is used for the anti-log. This is because logarithms to one base are constant
multiples of logarithms to a different base, and averaging preserves that
multiple.
Geometric means are common because analysis of data on logarithmic
scales is often useful.
43
Harmonic mean
The harmonic mean of a set of numbers is the reciprocal of the average of
reciprocals.
If we have a fleet of three vehicles with respective gas mileages of 10, 20,
and 30 miles per gallon, and we plan a trip of 30 miles, we expect to
consume
30 30 30
+
+
= 5.5
10 20 30
gallons of fuel. If all we knew about the fleet gas mileage was the
arithmetic mean of 20 mpg, our estimate would be

30
3
= 4.5
20
gallons, which is too small. If, instead, we knew the the harmonic mean,
3/(1/10 + 1/20 + 1/30) 16.36, we could calculate

30
= 5.50
3
16.36
gallons, which is as good as having the individual mileage numbers.
Both the harmonic mean and the geometric mean involve transforming
data, calculating a mean, and transforming back to the original scale.
However, the harmonic mean is probably not encountered as often as the
geometric mean.
Some exercises involving means
Exercise 2.1.1 (Restaurant Bill) Alice goes to dinner with three friends
and orders an $18 meal. Her three friends each order the $14 special. Since
Alice drove them all to the restaurant, one of Alices friends proposes that
they split the bill evenly into four equal shares. Suppose neither tip nor tax
apply. (a) What is the average price of the four meals? (b) How much does
each person contribute? (c) How much does Alice save by splitting the bill
equally? (d) How much extra is it costing each of her friends?
44
Exercise 2.1.2 (Loaves) Here is a well-known2 , but slightly more complex

puzzle about sharing. Three travelers meet on a road and share a campfire.
They also decide to share their evening meal. One of them has five small
loaves of bread. The second has three similar loaves. The third has no food,
but has eight coins. They agree to share the eight loaves equally, and the
third traveler will pay the eight coins for his share of the bread. The second
traveler (who had three loaves) suggests that he be paid three coins, and that
the first traveler be paid five coins. The first traveler says that he should get
more than five coins. Is he right? How should the money be divided up?
Exercise 2.1.3 (Clever Chemist) Long ago, a chemist wanted to weigh
a sample as accurately as possible using an old twin-pan balance. With the
sample in the left pan, the balancing weights in the right pan totaled A
grams. With the sample in the right pan, the balancing weights in the left
pan totaled B grams. Assuming the left and right lever arms of the balance
are of lengths L and R, respectively, the balancing weights satisfy
XL = AR
and
XR = BL
Where X is the unknown weight of the sample. How should the two
observations, A and B be combined?
Exercise 2.1.4 (Sum of residuals, in general) Given sample
measurements x1 , x2 , . . . , xn , their mean is
n
1X
xi .
x =
n i=1
The sum of residuals (differences from the mean) is always zero, i.e.
n
X
i=1
(xi x) = 0.
Give either an intuitive explanation, or a mathematical argument, or a

graphical explanation for why this is so.
2
Jim Loy, http://www.jimloy.com/puzz/8loaves.htm
45
Exercise 2.1.5 (Rosner Table 2.4) The following table gives the
distributon of minimal inhibitory concentratons (MIC) of penicillin G for
N. gonorrhoerae.
(g/ml) Concentration
Frequency
0
0.03125 = 2 (0.03125)
21
0.0625
= 21 (0.03125)
6
2
0.125
= 2 (0.03125)
8
3
0.250
= 2 (0.03125)
19
0.5
= 24 (0.03125)
17
5
1.0
= 2 (0.03125)
3
Calculate the geometric mean. Why might the geometric mean be desirable,
compared to the simple mean? Why might any mean be inadequate for
summarizing these data?
2.1.4
Measuring Variation
While the mean can convery an idea of a typical value, a second summary
number is needed to provide an idea of the variation around that central
number, and the standard deviation often serves this role.
The standard deviation is the root mean square of deviations around the
mean.
This conceptual definition can be applied directly, if the list of numbers is
the entire population on interest. We then call it the population standard
deviation. However, if we are using a sample of n observations to estimate
the amount of variation in a larger (perhaps infinite) population, we
generally
compute a sample standard deviation, which is inflated by a factor
p
of n/(n 1). For large samples, this inflation factor makes very little
difference.
Root Mean Square
As a preliminary, lets consider summarizing how large a set of numbers
are, not in the usual sense where large means greater than zero, but in the
sense of absolute value, where large means far from zero in either direction.
The root mean square of a list of numbers is the square root of the
46
average of the squares of the numbers.

Let x1 , . . . , xn be n numbers. The root mean square is
v
u n
u1 X
rms = t
x2 .
n i=1 i
The rms measures how much the numbers spread out around zero.
Example 2.1.1 Consider the following set of numbers: {1, 3, 4, 5, 7},
and compare the mean and the root mean square. The mean is
(1 3 + 4 + 5 7)/5 = 0.
The rms is
r
r
12 + 32 + 42 + 52 + 72
1 + 9 + 16 + 25 + 49
=
= 20 = 4.472
5
5
Example 2.1.2 Lets add 3 to each of the numbers in the list above, to get:
{4, 0, 7, 8, 4}.
What are the mean and the rms? (Answers: The mean is 3.
The rms is 29.)
Suppose we added 5 instead of 3. Can you guess what the mean and rms
would be without doing any calculations? What if we had subtracted 3 from
each observation instead of adding 3?
Standard Deviation
The standard deviation is the root mean square of the deviations from the
mean.
Instead of being a measure of how far the numbers are spread from zero, it
is a measure of how far the numbers are from their mean.
Given a set of numbers x1 , . . . , xn , the standard deviation, , is
r
1X
(xi x)2
=
n
(2.1)
where x is the mean. In other words, we subtract the mean from each
number, to obtain deviations from the mean, then we take the root mean
square of these deviations.
47
We
p often make a small-sample correction and multiply by a factor of
n/(n 1) for reasons to be discussed later. This corrected, or sample
standard deviation, can be written as
r
1 X
(xi x)2 .
s=
n1
The only difference from the population standard deviation is the use of
(n 1) rather than n in the denominator when computing the average
squared deviation.
The notion here, i.e. using for population standard deviation, and s for
sample standard deviation, folows the standard practice of using roman
letters for sample statistics, which are calculated from observed data, and
greek letters for the population parameters, which are often unobservable.
Calculators often use to label the button that invokes the 1/n formula,
and s for the button using the 1/(n 1) formula, but the convention of
using greek letters for parameters is more about the concepts of populations
versus samples than about formulas.
Example 2.1.3 Lets find the standard deviation of the set of numbers
from the example above. The mean is
x =
15
4+0+7+84
=
= 3.
5
5
We subtract the mean from each entry to get a deviation, square the
deviations, and calculate the average. Lets lay that all out in a table.
sum
mean
xi xi x (xi x)2
4
1
1
0
-3
9
7
4
16
8
5
25
-4
-7
49
15
0
100
3
0
20
The
average of the squared deviations is 20, so the standard deviation is
20 4.47, the root mean square of the deviations. Notice that the average
48
of the deviations from the mean is zero. If we use the sample correction, we
would divide the sum of squared deviations
by n 1 = 4 instead of 5, so the
sample standard deviation would be 25 = 5.

It is usually the case that approximately two-thirds of observations will fall
within one standard deviation from the mean, and approximately 19 out of
20 observations will fall within two standard deviations of the mean. If the
variation around the mean is symmetrical, we would expect the 1/3 that
fall more than one SD from the meam to be about equally split above and
below the mean. If the variation is skewed, so that, for example, the largest
values lie quite far from the mean, then we would expect more of the
departures to be on one side of the mean than the other, but we would still
expect roughly 2/3 of the observations to be within 1 SD of the mean. Like
all rules of thumb, there can be exceptions.
There are other possible summaries of variation. For example, the mean
absolute deviation (i.e. the average absoute value of deviations from the
mean), is conceptually simple, but the standard deviation has the advantage
of the above rule of thumb for one and two-standard deviations, and the
fact that the SD plays a natural role in more formal data analysis methods.
2.1.5
Linear Transformation
Adding a constant to each data value will add the same constant to the
mean, but it will not change the standard deviation. Multiplying each
observation by a constant (a scale change) will multiply both the mean and
standard deviation by the same constant.
Exercise 2.1.6 (SW prob 2.56) A biologist made certain pH
measurements in each of 24 frogs; typical values were
7.43, 7.16, 7.51, . . .
She calcualted a mean of 7.373 and a standard deviation of .129 for these
original pH measurements. Next, she transformed the data by subtracting 7
from each obasevation and then multiplying by 100. For example, 7.43 was
transformed to 43. The transformed data are
43, 16, 52, . . .
49
What are the mean and standard deviation of the transformed data?
Exercise 2.1.7 (Centering a variable) If data x1 , x2 , . . . , xn have mean
x, and
if we let y1 = x1 x, . . . , yn = xn x,
then what is the mean of y1 , . . . , yn ?
Exercise 2.1.8 (Rescaling a variable) If data x1 , x2 , . . . , xn have
standard deviation s, and
if we let y1 = x1 /s, . . . , yn = xn /s,
then what is the standard deviation of y1 , . . . , yn ?
Exercise 2.1.9 (Standardizing a variable) If data x1 , x2 , . . . , xn have
mean x and standard deviation s, and if, for each i from 1 to n we let
yi = (xi x)/s, what are the mean and standard deviation of y1 , . . . , yn ?
2.1.6
Quantiles & Percentiles
We can get a fuller description a large set of numbers by adding a few more
summary numbers. Percentiles are often used for this purpose.
The median can also be called the 50th percentile. The 25th percentile is
the number with 25% of the observations below, and 75% above. We wont
worry about interpolation to account for a sample size that doesnt divide
perfectly, other than to note that different statistical software programs
may use different conventions, which you might notice if you have a small
sample. However, one can really only estimate percentiles well with a large
sample. and you need more data as you move away from the 50th
percentile.
Quantiles are just percentiles with the percent expressed as a fraction, i.e.
the 75th percentile is the 0.75-quantile.
The 25th and 75th percentiles are also called the first and third quartiles,
since together with the median, they divide the dataset into quarters of
(almost) equal size. The educational testing people like to use quintiles,
which divide data into five equal parts.
Sometimes the difference between two landmark numbers is used to
describe variation.
50
The range (maximum - minimum) is a poor summary because it varies

dramatically among samples from the same population, and it tends to get
larger with increasing sample size. Often the two extrema (minimum and
maximum) are given, rather than their difference. While this is more
informative reporting, a larger sample is still likely to have a larger
maximum and a smaller minimum than a small sample from the same
source population. The inter-quartile range is the difference of the 1st
and 3rd quartiles the length of the box in a boxplot (described below).
This is a more reasonable measure.
A commonly used five number summary consists of the minimum, 1st
quartile, median, 3rd quartile and maximum. These are essentially what is
presented graphically in a boxplot.
2.2
2.2.1
Graphical Summaries
Histograms
A histogram is a graphical representation of a frequency distribution. The

horizontal dimension is a continuous quantitative measurement. The
vertical dimension represents the density of observations, i.e. the
frequency of observations per horizontal unit.
Figure 2.1 is a histogram showing the distribution of 380 weights of
olfactory bulbs from the brains of mice. The data were collected by
Williams et al. during an investigation to find genes involved in neural
development.3 The data are for mice from several different strains, and
include both males and females of various sizes. The authors write
Bilateral bulb weight in adult mice ranges from 10 to 30 mg. Half of this
remarkable variation can be predicted from differences in brain weight, sex,
body weight, and age. They go on to report 4 genetic loci that collectively
account for 20% of the phenotypic variance. They needed to account for
the explainable variation in order to find the relevant genetic loci, but the
histogram gives a picture of the overall variation of bulb weights.
The little vertical marks at the bottom are an extra feature, called a rug
plot. There is a mark for each observation. The height of the histogram
3
Behav Genet, 31:61-77, 2001
51
2.2. GRAPHICAL SUMMARIES
60
40
20
0
Frequency
80
100
Histogram of Olfactory Bulb Weight
10
15
20
25
30
Total OB weight (mg)
Figure 2.1: A historgram with equal bars
35
40
52
shows how crowded the data are in a given interval.

In this histogram, all of the bars have equal width, i.e. 2 mg. In this case,
the frequency is proportional to density, so we can label the vertical axis
with the frequency of observations. This provides information about the
sample size.
Sometimes the way the data are obtained does not permit having bars of
equal width. In this case, density and frequency are not proportional. Then
it is important that the height of the histogram bars be the density, so that
the area represents the relative frequency of observations in the interval
covered by a bar.
The heights of the blocks in a histogram represent density. The areas
represent percentages in an interval.
Some comments on histograms:
Histograms depict the density of a continuous measurement.
Density describes the fraction of data falling in various intervals.
Histograms can depict very large datasets.
The histograms from small samples may look quite different from that
of the parent population.
Histograms are often used in high-throughput measuring equipment such as
flow cytometers, although two-dimensional depictions such as scatterplots
and contour plots are perhaps more common.
Figure 2.2 gives some more examples of histograms. The first shows
measurements on the width of sepals from 150 iris flowers. Like many
biological measurements, the distribution is approximately hump-shaped, or
bell-shaped. Such data are well approximated by a normal (a.k.a.
Gaussian) distribution, to be discussed later. The data are well summarized
by a mean and standard deviation (defined below).
The second histogram shows the body weights of domestic cats used in an
experiment on the heart drug digitalis. The distribution trails off to the
right, but not to the left. That is because the cats were required to be at
least 2kg in order to be included in the experiment. This is an example of a
truncated distribution.
53
The third histogram gives the waiting times between eruptions of the Old
Faithful geyser in Yellowstone National Park, Wyoming. The data are
strikingly bimodal, meaning there are two prominent peaks. The mode of
the distribution of a continuous variable is the point of its maximum
density. A distribution may have multiple local maxima, in which case we
call it multi-modal. If we take repeated samples from the same distribution,
we may see minor local maxima at various places, but these may just
represent sampling variation, and they may also depend on the choice of
cut-points for the bars of the histogram. The peaks in the old faithful data,
however, are very prominent, and may well indicate some underlying
phenomenon.
The fourth example shows the survival times for mice infected with listeria.
There are two notable features. A substantial number of mice survived to
the end of the observation period. These observations are said to be
censored, i.e. the survival time is known to be longer than the last
observation time, but these mice were not observed to die, and in this
experiment, the censored mice probably cleared their infections. Among the
mice that did die on study, we see a distribution that is skewed to the right,
i.e. there are many short survival times, and a tail consisting of a few mice
that lasted considerably longer. Skewness and censoring are common
features of studies that measure survival, or the time until an event.
54
Domestic Cats
10
0
2.0
2.5
3.0
3.5
4.0
2.0
2.5
3.0
3.5
4.0
Body weight (Kg)
Old Faithful Geyser Eruptions
Listeria infection in mice
25
0 5
15
30
Frequency
50
35
Sepal Width (cm)
0 10
Frequency
Frequency
20
10
0
Frequency
30
15
Iris flowers
40
50
60
70
80
90
Waiting time (minutes)
100
150
200
250
Survival Time (hours)
Figure 2.2: Some examples of histograms
2.2.2
55
stem-and-leaf plots
Stem-and-leaf plots resemble histograms, but exhibit the actual numbers.

They were designed as a quick way to record numbers and make a plot at
the same time, using pencil and paper, but some computer programs
generate them too.
The decimal point is 1 digit(s) to the right of the |
37
38
39
40
41
42
43
|
|
|
|
|
|
|
5
237889999999
00001111111111112222222233333344444444455555666666666666777777778888
00001222358
3
7
The numbers here range from 375 to 437. The first digits are used to label
the categories, and the last is used in the plot.
2.2.3
Boxplots
Boxplots are a graphical display of a five-number summary (minimum, 1st

quartile, median, 3rd quartile, maximum). There is a box covering the
middle 50% of the data (between the quartiles), a line in the box at the
median, and wiskers out to the extrema.
Usually, the wiskers will be drawn to a length of no more than 1.5 times the
box length (interquartile range), with outliers shown individually. Some
authors (e.g. Xiong et al.) use a different convention.
Box plots are for summarizing a large number of observations by a depiction
of 5 numbers (plus individual plotting of outliers). If you dont have many
observations per box, its better to show the individual observations. You
should not use a box plot to cover up the fact that there isnt much data.
The boxplots in Figure 2.3 show individual measurements that the boxplots
summarize. The horizontal position of the points is random jitter to
allow us to see points that would otherwise lie on top of one another. The
56
data are from a study at City of Hope comparing cellular immunity to

cytomegalovirus (CMV) after hematopoeitic stem cell transplant from
CMV-positvive and CMV-nagative donors. All recipients were
CMV-positive. An interesting feature is that about half of the recipients
from CMV-negative donors had extremely low multi-functional T-cell levels
even after the stimulus of viral reactivation.
2.3
Graphical Principles
It is easy to make bad graphics, i.e. plots that are confusing or misleading.
There are a few general principles that have been proposed to help. The
most important is:
Show the data.
An example above shows boxplots the individual points superimposed. This
gives the reader a view of all of the individual observations. Sometimes
barplots with error bars diguise the fact that there are only two or three
observations per bar. Edward Tufte suggests trying to maximize the data
to ink ratio.
It is also a good idea to avoid distracting features, such as
three-dimensional effects that do not encode data, and hatched or lines fill
patterns that are hard to focus ones eyes on. It should be clear to the
viewer whether values are encoded by linear distances, or area, and it
should usually be linear, aside from special cases like the area interpretation
of histograms, or the use of different sized plotting symbols to encode one
more variable on a plot. Variables with a meaninful origin should usually
be plotted using the full scale, so that relative sizes are apparent.
Logarithic scales are generally preferable to scale breaks, but if scale breaks
are needed, they should be large and obvious.
There is an art to good graphics, that is hard to codify. The books of
Edward Tufte go into details and examples of good graphics. The books of
William Cleveland illustrate more technical or statistical ideas, such as the
use of quantile plots.
57
10
2.3. GRAPHICAL PRINCIPLES
0.1
0.01
DR+
D+R+
DR+
D+R+
PreReactivation
PostReactivation
Figure 2.3: Multi-marker positive T-cells as a fraction of IFN- positive Tcells in HSC transplant recipients from positive and negative donors, before
and after CMV reactivation.
58
2.4
Logarithms
Logarithmic scales are often used in graphical presenations of data. Here

we will quickly review a few basic properties of logarithms, and point out a
few simple, but perhaps under-appreciated properties.
Logarithms are the inverse of exponentials.
Noting that 1, 000 = 103 , and 100, 000 = 105 , we can write log(1, 000) = 3
and log(100, 000) = 5.
In general, if y = 10x , then log(y) = x.
Logarithms convert multiplication to addition.
In general,
log(XY ) = log(X) + log(Y ),
log(X/Y ) = log(X) log(Y ).
For example,

log
so
100, 000
1, 000

= log(100, 000) log(1, 000) = 5 3 = 2
100, 000
= 102 = 100.
1, 000
Changing base changes the scale

If you can calculate logs in one base, you can calculate logs in any base,
because
logb (Y )
loga (Y ) =
logb (a)
i.e. to get from base a to b, we just multiply by 1/ loga (b).
There are three bases in common use:
log10 (Y ) = X, 10X = Y ;
loge (Y ) = X, eX = Y ;
log2 (Y ) = X, 2X = Y.
If we know log10 (Y ) = X, then loge (Y ) 2.303X, and log2 (Y ) 3.322X.
59
5e+05
2.4. LOGARITHMS
18
12
5e+04
15
10
5e+03
12
5e+02
5e+01
10
15
20
25
30
Figure 2.4: A plot on a logarithmic scale, with the axes labeled several diffrent
ways.
Figure 2.4 shows a plot using a logarithmic scale on the vertical axis. (The
plot shows theoretical numbers representing exponential growth the
details are not important.) The thing to notice is the number of different
ways the plot can be labeled. On the left, we have numbers on the original
scale, but with the tick-marks spaced logarithmically. On the right, we have
three different log scales. Once we have used a logarithmic scale to place
the points on the page, we can label them with any other logarithmic scale.
The plot uses the three common log scales, but they are not labeled. Can
you tell which is which?
60
Logs of ratios represent relative change.

If a measurement grows from 100 to 125, thats a 25 percent increase, but if
it declines from 125 to 100, that is a loss of 20 percent.
Two increases of 20 percent each are not the same as a 40 percent increase.
If we start at 100, the two 20% increases raise our value first to 120, then to
144. However, an increase of 40% from the original 100, would only yield
140. This is rather like compound interest.
If we use logarithms to measure relative change, we have a pleasant
symmetry property,
log(A/B) = log(B/A).
Using natural logs in the example above, ln(125/100) = .223 and

ln(100/125) = .223, so changing the direction only changes the sign.
We also have a nice additivity property,
log(B/A) + log(C/B) = log(C/A).

In the example,
ln(120/100) = ln(144/120) = 0.1823
and
ln(144/100) = 2(0.1823) = 0.3646.
Finally, for natural logarithms (but not other logarithms) if A < B we have

BA
BA
B
loge
B
A
A
which means that the natural log of a ratio is always sandwiched in
between the two common ways of expressing relative change. For example,
ln(120/100) = 0.223, which is a compromize between a 25% increase (0.25)
and a 20% decrease (0.20). As the size of the change gets small, the two
common measures converge on the log-ratio.
We can summarize, in somewhat mathematical language, by saying that
the natural log of a ratio is the unique measure of relative change that is
symmetric, additive, and normed.
In figure 2.5, logarithms were used to measure relative difference in gene
expression, as measured by RNA-seq. The plot shows, for each gene, the
61
2.4. LOGARITHMS
log2(N7/N4)
10
10
15
20
N7 + N4
Figure 2.5:
base 2 logarithm of the ratio of transcript levels in two different samples.
Base 2 was used because it made it easy to interpret the vertical axis as
fold-change, as opposed to percent change.
Note that the log of a ratio is the same as the difference of logs. The
horizontal axis is the sum of log values for the two samples. Note that if we
convert a scatter plot of Y versus X into a plot of Y X versus Y + X, we
rotate the scatter through a 45-degree angle. Doing this with logarithmic
values provides a plot of relative difference versus a logarithmic measure of
overall expression level.
62
2.5
Homework Exercises Problem Set 1
Exercise 2.5.1 The 10 third-grade students at Lake Wobegone elementary

school took a spelling test with 10 words, each worth one point. Given that
the average score was 9 exactly, what is the maximum number of children
who could have scored above average?
Exercise 2.5.2 4 Ten people in a room have an average height of 5 feet 6
inches. An 11th person, who is 6 feet 5 inches tall, enters the room. Find
the average height of all 11 people.
Exercise 2.5.3 Twenty-one people in a room have an average height of 5
feet 6 inches. An 22nd person, who is 6 feet 5 inches tall, enters the room.
Find the average height of all 22 people. Compare to the previous exercise.
Exercise 2.5.4 Twenty-one people in a room have an average height of 5
feet 6 inches. An 22nd person enters the room. How tall would he have to
be to raise the average height by 1 inch?
Exercise 2.5.5 (a) Find the average and the r.m.s. size of the numbers on
the list
1, 3, 5, 6, 3. (b) Do the same for the list 11, 8, 9, 3, 15.
Exercise 2.5.6 Guess whether the r.m.s. size of each of the following lists
of numbers is around 1, 10, or 20. No arithmetic is necessary.
(a) 1, 5, 7, 8, 10, 9, 6, 5, 12, 17
(b) 22, 18, 33, 7, 31, 12, 1, 24, 6, 16.
(c) 1, 2, 0, 0, 1, 0, 0, 3, 0, 1
Exercise 2.5.7 (a) Find the r.m.s. size of the list 7, 7, 7, 7.
(b) Repeat, for the list 7, 7, 7, 7.
Exercise 2.5.8 Each of the numbers 103, 96, 101, 104 is almost 100, but is
off by some amount. Find the r.m.s. size of the amounts off.
4
From Statistics, by Freedman, Pisani and Purves
2.5. HOMEWORK EXERCISES PROBLEM SET 1
63
Exercise 2.5.9 The list 103, 96, 101, 104 has an average. Find it. Each
number in the list is off of the average by some amount. Find the r.m.s.
size of the amounts off.
Exercise 2.5.10 Each of the following lists has an average of 50. For
which one is the spread of the numbers around the average biggest?
smallest?
(i)
0, 20, 40, 50, 60, 80, 100
(ii) 0, 48, 49, 50, 51, 52, 100
(iii) 0, 1, 2, 50, 98, 99, 100
Exercise 2.5.11 Each of the following lists has an average of 50. For each
one, guess whether the SD is closest to 1, 2 or 10. (This does not require
any calculation).
(a) 49, 51, 49, 51, 49, 51, 49, 51, 49, 51
(b) 48, 52, 48, 52, 48, 52, 48, 52, 48, 52
(c) 48, 51, 49, 52, 47, 52, 46, 51, 53, 51
(d) 54, 49, 46, 49, 51, 53, 50, 50, 49, 49
(e) 60, 36, 31, 50, 48, 50, 54, 56, 62, 53
Exercise 2.5.12 Which of the following lists has the larger range? Which
has the larger SD?
(A) 7, 9, 10, 10, 10, 11, 13
(B) 8, 8, 8, 10, 12, 12, 12
Exercise 2.5.13 (a) A company gives a flat raise of $90 per month to all
employees. How does this change the average monthly salary of the
employees? How does it change the SD?
(b) If the company instead gave employees a 3% raise, how would that
change the average monthly salary? How would it change the SD?
Exercise 2.5.14 What is the r.m.s. size of the list 17, 17, 17, 17, 17? What
is the SD?
Exercise 2.5.15 For the list 107, 98, 93, 101, 104, which is smaller the
r.m.s. size or the SD?
Exercise 2.5.16 Can the SD ever be negative?
64
Exercise 2.5.17 For a list of positive numbers, can the SD ever be larger
than the average?
Exercise 2.5.18 (Exam scores) Suppose a class at a university has two
lab sections. One section has 18 students, who score an average of 80 on
their exam. The other section has 12 students who score an average of 75.
What is the average for the whole class?
Exercise 2.5.19 A record of litter sizes (live-born pups) for a breeding

colony of mice is shown below.
Pups per litter:
Litters:
1
2
2
10
3
14
4
34
5
36
6
32
7
31
8
24
9 10 11
14 10 4
a) What is the total number of litters?

b) What is the total number of pups?
c) Calculate the mean litter size.
Exercise 2.5.20 Note that we can represent the mean litter size as
P
xwx
x = P
wx
where x {1, 2, . . . , 11} is the numbers of pups per litter, and wx is the
number of a litters of size x (i.e. the frequency, or weight of x). Give a
similar expression for the variance of the litter size (variance is the square
of standard deviation).
Exercise 2.5.21 (mean) The mean temperature in my office is 68 degrees
Farenheit, based on five measurements on five different days. If I had
recorded the temperatures in degrees Celcius, (a) what would their mean be?
(b) Can you be sure without the original data? Why or why not?
(c) The standard deviation is 3 degrees F. Is the temperature control in my
office adequate? Explain briefly.
65
Exercise 2.5.22 (Pasadena Jan 1 Temperature) The following table

gives the high temperature (degrees Farenheit) at the Burbank Airport on
New Years day for 10 years.
Year
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
Jan 1 High Temp (F)

55
75
64
68
60
54
57
71
73
72
a) Enter the data into a spreadsheet (e.g. excel) and add a column with
Celcius temperature, C = (F 32)/1.8.
b) Add an extra row with the mean of the Farenheit and Celcius
temperatures.
c) Add an extra row with the standard deviation of the Farenheit and
Celcius temperatures.
d) How many of the 10 observations fall within one standard deviation of
the mean? How many fall within two standard deviations of the mean?
e) Give a formula for converting the mean Farenheit temperature (F ) to
Celcius (C).
f ) Give a formula for converting the standard deviation (sF ) of the
Farenheit temperature to the standard deviation (sC ) of the Celcius
temperature.
Exercise 2.5.23 A study of college students found that the men had an
average weight of about 66 kg and an SD of about 9 kg. The women had an
average weight of about 55 kg and an SD of about 9 kg.
1. Find the averages and SDs, in pounds (1 kg = 2.2 lb).
2. Just roughly, what percentage of men weighed between 57 kg and 75
kg?
66

3. If you took the men and women together, would the SD of their
weights be smaller than 9 kg, just about 9 kg, or bigger than 9 kg?
Why?
Exercise 2.5.24 The 1000 Genomes project provides public access to

human genome data, with the goal of finding most genetic variants that
have frequencies of at least 1% in the populations studied. Variant Call
Format (VCF) is one of the ways in which data on genetic variants is
presented. In VCF, one of the standard items of information on each
variant is a quality score, defined as 10log10 (p) where p is the probability
that the variant call is wrong.
(a) What is this score if the probability that the call is wrong is 0.01?
(b) What if it the probability of being wrong is 0.2?
(c) A score of 10 is sometimes required to pass a filter. What error
probability does this correspond to?
Exercise 2.5.25 Obtain the data file aconiazide.csv from the course
website. This gives the change in weight (W) in grams, for rats that were
fed one of five DOSES of a drug, aconiazide. Calculate the median, mean,
and standard deviation for each dose. Write down what tool you used (e.g.
Excel, Prism, R, R Comander, calculator, . . . ).
Exercise 2.5.26 (Difference of squares) Here is a useful algebraic
identity about the difference of two squares:
x2 y 2 = (x y)(x + y).
a) Establish that it is true by multiplying out the right-hand side.
b (Optional) Can you draw a simple picture illustrating this identity as a

statement about the areas of squares and rectangles? (Hint: Show a large
67
square with sides of length x with a smaller square of area y 2 cut out of its
corner, so that the area is x2 y 2 . Can you imagine cutting and moving a
piece of the clipped square figure to get a rectangle?)
Exercise 2.5.27 (big numbers) Compute this difference in your head:
(123456789)2 (123456787)(123456791)
Hint: Let x = 123456789, and let y = 2. We want x2 (x 2)(x + 2). If
you apply the algebraic identity from the previous exercise to the right-hand
term, you shouldnt need to pick up a pencil.
1. What is the answer you get by the mental algebra above?
2. What do you get when you plug the expression
(123456789)2 (123456791)(123456787) into your favorite calculator
(or excel)?
3. If these disagree, which answer do you believe, and why?
4. What do you get using R?
Exercise 2.5.28 Do the R computing tutorial from chapter 1, and turn in
the plot (with your name on it).
68
Chapter 3
Probability
69
70
CHAPTER 3. PROBABILITY
Reading: Probability, from Beyond Numeracy, by J.A. Paulos

Optional Reading: An alternative introduction to the same material is in
chapters 13 and 14 of Freedman, Pisani and Purves.
Reference: Experiments in Plant Hybridization, G. Mendel, 1865.
available at http://www.mendelweb.org/MWarchive.html in a variety of
formats.
Objectives: To calculate probabilities of unions and intersections of
events, distinguish between conditional, joint, and marginal probabilities,
and apply Bayes rule to reverse the order of conditioning events, with
particular attention to the specificity, sensitivity, and predictive value of
diagnostic tests.
3.1
Example: Mendels Peas
In 1865 the Bohemian monk Gregor Mendel described a theory of

particulate inheritance, which anticipated later discoveries of chromosomes,
genes, meiosis, and much more. Mendels paper is good science, and well
worth reading for that reason alone, but we will also use Mendels
experiments as an introduction to probability.
3.1.1
The choice of characters
The key to Mendels success lies in his choice of characters to investigate.

Mendel chose to work with the garden pea, Pisum sativum, which had
obvious practical advantages. Here is part of what Mendel had to say
(translated) about his choice:
The value and utility of any experiment are determined by
the fitness of the material to the purpose for which it is used,
. . . . The experimental plants must necessarily:
1. Possess constant differentiating characteristics
2. The hybrids of such plants must, during the flowering
period, be protected from the influence of all foreign
pollen, or be easily capable of such protection.
3.1. EXAMPLE: MENDELS PEAS
71
3. The hybrids and their offspring should suffer no marked

disturbance in their fertility in the successive generations.
Garden peas readily self-pollinate, and they have a peculiar flower
structure, called a keel, which makes them well protected from foreign
pollen. Mendel maintained his lines by self-pollination for two years to be
sure that they were in fact true-breeding for contrasting characters, and his
lines did produce fertile hybrids.
But it was his choice of characters with a simple and striking pattern of
inheritance that was crucial, as these turned out to be governed by single
genes. Had he studied, say, the weight of tomato fruit, which is governed by
multiple genes1 , it is unlikely that the particulate nature of inheritance
would have been apparent. In fact, at the time of the rediscovery of
Mendels work in 1900, it was not apparent how it applied to the
inheritance of human traits such as adult height. It wasnt until 1918 that
Ronald Fisher explained how the seemingly blended inheritance of a
continuous character, such as height, can arrise naturally if the trait is
governed by incremental contributions from multiple genes.
3.1.2
Hybrids and their offspring
Mendel presents data on seven pairs of contrasting characters (Table 3.1).

For each character, he started with a pair of true-breeding lines, i.e. lines of
peas that consistently produced only one form of the character when
self-pollinated. He then crossed pairs of true-breeding parental lines with
contrasting characters to obtain the hybrid generation (denoted F1 in
modern terminology). For each character, only one variant was seen in the
hybrid (F1) generation, and Mendel called this the dominant form. When
the hybrid plants were allowed to self-pollinate, the resulting generation
(F2) again exhibited both the dominant and recessive forms. In Mendels
words:
In this generation there reappear, together with the
dominant characters, also the recessive ones with their
1
Paterson et al., Resolution of quantitative traits into Mendelian factors by using a

complete linkage map of restriction fragment length polymorphisms Nature, 1988, 335,
721
72
peculiarities fully developed, and this occurs in the definitely
expressed average proportion of 3:1, so that among each 4 plants
of this generation 3 display the dominant character and one the
recessive. This relates without exception to all the characters
which were investigated in the experiments. . . . all reappear in
the numerical proportion given, without any essential alteration.
Transitional forms were not observed in any experiment.
Character
A
B
C
D
E
F
G
Seed shape
5474
Cotyledon color
6022
Seed coat color
705
Pod shape
882
Unripe pod color 428
Flower position
651
Stem length
787
Dominant
Round
Yellow
Grey-brown
Inflated
Green
Axial
Long
m
1850
2001
224
299
152
207
277
Recessive
Wrinkled
Green
White
Constricted
Yellow
Terminal
short
ratio proportion
n/m n/(n + m)
2.96
0.747
3.01
0.750
3.15
0.759
2.95
0.747
2.82
0.738
3.14
0.759
2.84
0.740
Table 3.1: Mendels F2 generation phenotype counts. n plants or seeds had

dominant form and m had recessive form. The ratio is n/m and the fraction
is n/(n + m).
After giving the results for each pair of contrasting characters he went on to
combine the results.
If now the results of the whole of the experiments be brought
together, there is found, as between the number of forms with
the dominant and recessive characters, an average ratio of
2.98:1, or 3:1.
3.1.3
Odds and Probabilities
Mendel presented the counts and ratios shown in Table 3.1. The proportion
column has been added simply to illustrate another common way of
presenting such data. The proportions approximate the probability of the
dominant form, while the ratios approximate the odds of the dominant
73
form. These are equivalent statements, but the fact that the odds can be
expressed as a ratio of small whole-numbers (3:1) makes it easier, perhaps,
to appreciate the underlying pattern.
Note that we can calculate the ratios from the proportions, and vice versa.
If m and n are the numbers of the two phenotypes, we have
p=
and
r=
n
n+m
p
n
=
.
m
1p
Using the seed shape data, for example,

p=
so
r=
5474
= 0.7474
5474 + 1850
5474
0.7474
=
= 2.959.
1850
0.2526
If we were only given that r = 2.959, without the individual counts, we

could still calculate
p=
2.959
r
=
= 0.7474.
1+r
3.959
It might be a little easier to appreciate these conversion formulas if we

consider the idealized probability and odds. Then we have two general
conversion formulas:
Probability to Odds
r=
p
0.75
=
= 3,
1p
0.25
p=
r
3
= = 0.75.
1+r
4
Odds to Probability
74
3.1.4
Subsequent generations
After describing the 3:1 ratio in the first generation produced by

self-fertilization of the hybrids (F2), Mendel went on to note that the plants
with the dominant form were of two different types (homozygous AA, or
heterozygous Aa, in a modern notation), which could be distinguished by
self-fertilization of individual plants in the F2 generation, a process we
would now call progeny testing. He notes:
Those forms which in the first generation exhibit the
recessive character do not further vary in the second generation
as regards this character; they remain constant in their offspring.
It is otherwise with those which possess the dominant
character in the first generation. Of these two-thirds yield
offspring which display the dominant and recessive characters in
the proportion of 3:1, and thereby show exactly the same ratio
as the hybrid forms, while only one-third remains with the
dominant character constant. . . . The ratio 3:1, in accordance
with which the distribution of the dominant and recessive
characters results in the first generation, resolves itself therefore
in all experiments into the ratio of 2:1:1, if the dominant
character be differentiated according to its significance as a
hybrid-character or as a parental one.
3.1.5
An explanation
Mendel hypothesized that, for each of his chosen characters, the organism
inherits something discrete from each parent, and passes an unaltered copy
of one of these on to each offspring. He came very close to modern notation
by using the fractions A/A, A/a, a/A and a/a to denote the characteristic
passed on through pollen and egg. A more modern notation is to omit the
fraction bar and ignore the order (which does not typically matter, as
Mendel established through reciprocal crosses). Another modern expository
device is to lay out all the possible gamete types along the sides of a
Punnett square (Table 3.2).
Recognising that both the AA and Aa genotypes will produce the dominant
round seed form, while only the aa combination will produce wrinkled seeds,
75
Paternal A
allele
a
maternal allele
A
a
AA
Aa
Aa
aa
Table 3.2: A Punnet square for a single character.
we can immediately see that three of the four combinations of pollen and
egg lead to round seeds, explaining the 3:1 ratio (at least if both types of
gametes are equally probable). The fact that two pollenegg combinations
yield Aa while ony one yields AA explains the 2:1 ratio of dominant plants
with hybrid as opposed to parental behavior after further selfing.
3.1.6
Multiple Characters
One of the strongest features of Mendels paper is that he does not settle
for an explanation that fits the data, but he goes on to predict the outcome
of additional experimental crosses to further test his theory. These
experiments involved manual pollination in order to cross hybrids with
parental lines in what we would now call a backcross, but they also involved
the simultaneous investigation of multiple characters.
Lets consider two characters: seed shape and cotyledon color. Suppose we
start with hybrids whose genotype is AaBb, and consider what will happen
after self-pollination. Lets consider a more or less arbitrary question and
ask: what is the probability that an offspring is simulaneously
1. homozygous at the seed shape locus (AA or aa) and
2. heterozygous at the cotyledon color locus (Bb)?
One way of answering the question is to construct the Punnett square
shown in Table 3.3. Each parent can produce 4 gametes with haplotypes
AB, Ab, aB, or ab, and each of these appears to be equally likely, as we
might expect if the genes segregate independently. Because the segregation
of genes leading to the gametes of the the parents are independent, the
sixteen combinations of haplotypes are equally likely when pollen and egg
76
AB
AB
Ab
aB
ab
Ab
aB
AABb
ab
AABb
aaBb
aaBb
Table 3.3: A Punnett square for two characters.
come together. Assuming no linkage to genes that might affect viability,

there is an equal chance for each of the combinations to produce a mature
pea plant.
There are 16 equally possible outcomes of the cross, so we can simply count
how many outcomes yield the type of plant we are looking for. The event
has probability 4/16 = 1/4.
We can avoid constructing the square if we note that:
the probability of AA OR aa is 1/4 + 1/4 = 1/2
(because the events are mutually exclusive);
the probability if Bb is 1/2;
and the probability of (AA OR aa) AND Bb is 1/2 1/2 = 1/4
(because events on different chromosomes are independent).
Suppose you have pea plants that are heterozygous for three characters
(AaBbCc) and you allow them to self-pollinate. What is the probability of
round peas with yellow cotyledons and a white seed coat? We could draw
up a Punnet square, but there are eight different kinds of gametes to
consider, so the square has 64 cells. This would be tedious. We can save
ourselves some work if we use some simple rules of probability.
The probability of round peas can be computed as
Pr(AA genotype OR Aa genotype) = Pr(AA) + Pr(Aa) = 1/4 + 1/2.
Adding the probabilities in this way is justified because the we are dealing
with mutually exclusive events. A plant cannot be simultaneously
3.2. PROBABILITY FORMALISM
77
homozygous and heterozygous for the same character. By the same

argument, the probability of yellow cotyledons (another dominant trait) is
Pr(BB) + Pr(Bb) = 3/4
The probability of a white seed coat (recessive) is
Pr(cc) = 1/4.
Because the three characters are independent, we can compute the joint
probability of simultaneously finding round peas and yellow cotyledons and
a white seed coat, as
Pr(round)Pr(yellow cotyledons)Pr(white seed coat) = 3/43/41/4 = 9/64.
Question: If we simultaneously consider all seven characters reported by
Mendel, what would be the dimensions of the Punnett square? How many
cells would there be?
3.2
Probability formalism
As things get more complicated, it is obviously easier to apply a few simple

rules than to try and list all possibilities.
Lets reconsider what we have just done, while introducing a little
vocabulary.
Events are the outcomes that have probabilities. Having round seeds is an
event. It has a probability of 3/4.
Unions and intersections of events are also events. The event that a
plant has round seeds is the union of the event that it inherits the AA
genotype, and the event that it inherits the Aa genotype. The event that
the plant has both round seeds and yellow cotyledons is an intersection of
simpler events.
Elementary events are events at the finest resolution, such as the cells in
a Punnett square.
It is fairly conventional in mathematics to use capital letters from the
beginning of the alphabet to denote events. We have just used the letters
78
A, B, . . . , G to denote dominant alleles, but we will now use A and B to

denote abstract events. When we want to refer to probabilities of
genotypes, the alleles will come in pairs, such as AA or Aa, so this
shouldnt be too confusing. The alphabet is finite, and we shouldnt stray
too far from common conventions.
Now some facts about probability, and notation.
Probabilities can range from zero to one. So if A is an event, we
always have
0 Pr(A) 1.
The intersection of two events, written Pr(A B), or simply AB, means
that both events occur.
The union A B, means A or B or both occur.
Addition of probabilities. If events A and B are mutually exclusive,

then
Pr(A or B) = Pr(A B) = Pr(A) + Pr(B).
In general (even if not mutually exclusive) we have
Pr(A or B) = Pr(A) + Pr(B) Pr(AB),

i.e. if A and B have some overlap (intersection), we need to avoid
double-counting it. Mutually exclusive just means there is no overlap, i.e.
Pr(AB) = 0.
Multiplication of Probabilities. If events A and B are independent,
Pr(AB) = Pr(A) Pr(B).
This is just what we mean by independence. If the events are not
independent, then we need the concept of conditional probability.
3.2.1
Conditional Probability
The essential idea of conditional probability was used quite naturally by

Mendel in one of the quotations above:
It is otherwise with those which possess the dominant character
in the first generation. Of these two-thirds yield offspring which
79

display the dominant and recessive characters in the proportion
of 3:1, and thereby show exactly the same ratio as the hybrid
forms, while only one-third remains with the dominant
character constant.
Plants with the dominant character constant are those with the AA
genotype. These make up 1/4 of the entire F2 generation, but they make
up 1/3 of the plants exhibiting the dominant trait. He has simply narrowed
his focus (to 3 of the 4 genotypes in Table 3.2), and reduced the
denominator accordingly.
If a plant in Mendels F2 generation (the result of self-pollinating hybrids)
has round seeds, it might have either the AA or Aa genotype. If it has the
Aa genotype, its offspring (after selfing) will have round seeds with
probability 3/4. If it has the AA genotype, its offspring will have round
seeds with probability 1. The can express these facts as a pair of
conditional probabilities
Pr(R|AA) = 1;
Pr(R|Aa) = 3/4,
where Pr(R|AA) means the probability of round seeds in the offspring,

given that the parental genotype is AA (homozygous dominant).
If we have some round seeds from the F2 generation, we wont (easily)
know which seeds have the AA genotype, and which have the Aa genotype.
If we had a large number of seeds, we could predict the fraction that would
have round seeds in the next generation as
Pr(R) = Pr(R|AA)Pr(AA) + Pr(R|Aa)Pr(Aa)
= 1(1/3) + (3/4)(2/3)
We can think of this as a branching path. We multiply the probabilities as
we move along the branching path, and add up the results for the paths
leading to R (round seeds in the next generation).
80
1
AA
AA round
1/3
Round
Seeds
in F2
2/3
3/4
AA
Aa
round
Aa
1/4
aa wrinkled
Figure 3.1: A branching path showing conditional probabilities for offspring

of the F2 generation by selfing.
81
Progeny testing
Mendel noted that, of the plants resulting from self-pollination of hybrids
(F2 generation), those exhibiting the dominant form were of two types: The
true-breeding type (AA genotype) and the hybrid type (Aa genotype).
These could be distinguished by testing their progeny. For the two traits
that could be ascertained by examing seeds directly, Mendel could easily
classify the plants of the F2 generation into the AA or Aa genotype by
examining the many seeds that result from self-fertilization. For each of the
five traits that required growing the seeds, he prodeeded as follows:
For each separate trial in the following experiments 100
plants were selected which displayed the dominant character in
the first generation, and in order to ascertain the significance of
this, ten seeds of each were cultivated.
Question: Consider the pod shape character. If 10 seeds from a plant of the
hybrid form (Dd) are cultivated (after self-fertilization), what is the
probability that none of the 10 offspring exhibit the recessive trait, making
the parent plant appear to be of the true breeding (DD) form?
Answer: Each of the 10 seeds will have the dominant phenotype with a
probability of 3/4. The probability that all 10 have the dominant
phenotype is (3/4)10 0.056.
Question: Among the F2-generation plants with inflated pods (either DD

or Dd genotype), the probability of the parental DD genotype is
1/3 = 0.333. What is the probability that an F2 plant with inflated pods
appears to be of the DD genotype by failing to exhibit any recessive
phenotypes in 10 progeny?
Answer:
Pr(DD)Pr(10 Dominant|DD) + Pr(Dd)Pr(Pr(10 Dominant|Dd)
= 1/3 1 + 2/3 0.056 = 0.371.
Note that of the 500 plants subjected to progeny testing in this way, 166
appeared to be of the parental genotype (homozygous for the dominant
allele). The rate is 166/500 = 0.332, which is rather closer to the
theoretical probability of the parental genotyope than to the probability of
82
appearing so. Is this just a coincidence, or did Mendel tweak the data a
bit? When we study hypothesis testing, we will see how to use the binomial
distribution to calculate the chance of being accidentally this far from the
predicted rate of 0.371.
3.2.2
Marginal, Joint and Conditional Probabilities
In general, if we have events A and B, we talk about

Joint probability: Pr(AB)
Conditional Probability: Pr(A|B)
Marginal Probability: Pr(A).
Given two events, A and B, with Pr(A) > 0, the equation
Pr(B|A) =
Pr(AB)
Pr(A)
(3.1)
is often taken as the definition of conditional probability. It expresses the

idea of restricting attention, and adjusting the denominator accordingly.
Sometimes this adjustment goes by the fancy name of renormalization.
If we multiply equation 3.1 by Pr(A) we get
Pr(AB) = Pr(A)Pr(B|A).
(3.2)
Sometimes it makes sense to think of the conditional probability as the

more fundamental concept, and consider equation 3.2 as the definition of
joint probability. This is more in line with Figure 3.1, especially if we think
of a chain of causeandeffect events. Regardless of which equation one
regards as more fundamental, equations 3.1 and 3.2 give the numerical
relationship between joint, marginal, and conditional probabilities.
3.2.3
Independence
Events A and B are independent if

Pr(A|B) = Pr(A)
83
or if
Pr(B|A) = Pr(B)
i.e. if one event has no impact on the probability of the other. If we
substitute this into our previous equation,
Pr(AB) = Pr(A)Pr(B|A),
we get
Pr(AB) = Pr(A)Pr(B),
the multiplication rule for independent events. Conversely, if we know the
multiplication rule holds, we can infer that the events are independent.
Example 3.2.1 (Elston and Johnson) 31 percent of the Caucasian
population of America have the HLA A1 antigen. 21% have the HLA B8
antigen. If these were independent traits, what fraction would you expect to
have both A1 and B8? The actual double positive rate for A1 and B8 is
17%. Do these traits appear to be independent?
If the traits were independent, the probability of a randomly chosen person
having both would be 0.31 0.21 = 0.0651. Since the actual double
positive rate is almost three times that, the traits to not apear to be
independent. The HLA A and B loci are very close together on
chromosome 6, and are usually inherited as a unit. This phenomenon is
called Linkage disequilibrium.
Independence is a strong assumption.
When nuclei of uranium 235 are well separated, they decay independently,
with a very long half-life. When brought into close proximity, so that one
event affects the probability on another, the result is a nuclear chain
reaction. The departure from independence can be spectacular.
In 2005, Roy Meadow, a prominent British physician and co-founder of
Londons Royal College of Paediatrics and Child Health, was found by the
General Medical Council to be guilty of serious medical misconduct for
giving expert testimony that wrongly assumed the independence of events.
He had given expert testimony about the remote chance of two incidents of
sudden infant death syndrome in the same family, which was crucial in
84
what was latter deemed a wrongful conviction of a mother in the deaths of

her two children2 .
3.2.4
Complementary Events
If A is an event, the opposite, or complementary event is noted Ac , and

Pr(A) + Pr(Ac ) = 1.
Either the event or its complement (opposite) must occur. If we are
interested in the probability of an event, A, it is sometimes easier to
calculate the probability of its complement, Ac . Then it is trivial to
calculate
Pr(A) = 1 Pr(Ac ).
An event and its compliment form the simplest example of a partition of all
possible outcomes.
A partition is a set of events, A1 , A2 , . . . , An , that are:
mutually exclusive, i.e. Pr(Ai Aj ) = 0 whenever i 6= j; and

exhaustive, i.e. the probabilities add to one, Pr(A1 ) + . . . + Pr(An ) = 1.
These are just mathematical ways of saying that exactly one member of the
partition has to happen.
The law of total probability says that we can break an event up into
pieces by considering its intersection with a partition, then add up the
pieces to get the probability of the original event. Events B and B c form a
partition (because Pr(B B c ) = 0 and Pr(B B c ) = 1) so we have
Pr(A) = Pr(A B) + Pr(A B c ).
Using the relationship between joint and conditional probabilities, we can
also compute marginal probabilities as
Pr(A) = Pr(A|B)Pr(B) + Pr(A|B c )Pr(B c ).
This equation corresponds to the branching process idea (e.g. Figure 3.1).
2
Science 309 (2005) p543
3.3. MORE EXAMPLES
85
Example 3.2.2 Suppose we have DNA from humans, some with a

diagnosis of fibromyalgia syndrome, and some from healthy individuals.
Suppose also that we test one genetic marker on each of the 22 autosomes
for association with the disease. In hypothesis-driven research (as opposed
to genome searches), it is common to use statistical tests that would produce
a false impression of an association (an error) with a probability of 0.05,
i.e. 1 chance in 20. If none of these 22 independent markers was in truth
associated with disease, and we applied such a test, with a .05 error
probability, to each marker, what is the probability of at least one erroneous
result?
Let E be the event that we make at least one error. It is easier to calculate
Pr(E c ), the probability of no error in any test. That is because this is an
intersection, and the events are independent, so we get to multiply their
probabilities. The probability of being error-free for one marker is
1 .05 = .95. The probability of being error-free for the first and the
second marker is .95 .95 = .9025. The probability of being error-free for
all 22 tests is .9522 .3235. The probability of making at least one error is
then 1 .3235 = .6765, which answers our original question. Obviously, an
error rate of 0.05 on each test is too high if we are going to do 22
independent tests.
Question: What is (obviously) wrong with adding up the error probabilities
from each test?
3.3
More Examples
Example 3.3.1 Hemophilia is an X-linked trait, so a male inheriting the

disease allele on his only X chromosome will have the disease. A woman
inheriting the allele on one of her two X chromosomes will not have the
disease, but will carry the allele, transmitting the disease to half of her sons
(on average), and none of her daughters. Suppose a female carrier of
hemophilia is married to an unaffected male.
1. What is the probability that the first child is affected?
2. Given that an ultrasound has determined that the first child is a son,
what is the probability that he is affected (i.e. has the disease)?
86
Answers:
1. Because having a son or daughter are mutually exclusive (barring
twins),
Pr(affected child) =
=
=
=
P(affected son) + Pr(affected daughter)

Pr(son)Pr(affected|son) + Pr(daughter)Pr(affected|daughter)
1/2 1/2 + 1/2 0
1/4.
(3.3)
2. Pr(affected|son) = 1/2 is given, but we could also calculate it by

renormalizing the answer to the previous question, i.e.
Pr(Affected son)
1/4
=
= 1/2.
Pr(son)
1/2
Example 3.3.2 Suppose we put 10 tickets into a box. The tickets are
numbered 1 through 10 but otherwise identical. We then draw two out at
random (e.g. blindfolded, after throrough mixing), without replacement.
What is the probability that we draw a 6 on the first draw, and a 3 on the
second?
Lets define some notation. Let X1 denote the number drawn on the first
trial, and let X2 denote the number drawn on the second drawing without
replacement. Let A denote the event that X1 = 6 and let B denote the event
that X2 = 3. We want to know Pr(AB), so we can use the equation
Pr(AB) = Pr(A)Pr(B|A). There are 10 equally likely tickets on the first
draw, so Pr(A) = Pr(X1 = 6) = 1/10. There are 9 equally likely tickets on
the second draw, so Pr(B|A) = Pr(X2 = 3|X1 = 6) = 1/9. Finally,
Pr(AB) = 1/10 1/9 = 1/90.
You might be able to jump to the right conclusions in these examples, but
the idea is to see that the various ways of working with probabilities lead to
the intuitively correct answer in simple cases. The reason for formality is to
prepare ourselves for problems that arent so simple. Probability seems to
call for careful analysis because intuition about probability seems
particularly prone to failure.
3.3. MORE EXAMPLES
87
The Monte Hall Problem:

This notorious probability puzzle is named for the host of an old game
show called Lets Make a Deal. Monte would often show a contestant 3
doors. Behind one of them is a valuable prize, but behind the other two
doors are worthless gag gifts. The host would first let the contestant pick a
door. Then the host would open one of the remaining doors to reveal one of
the gag gifts (he can always do so). He then offered the contestant a choice
between taking whatever was behind the door already chosen, or switching,
and taking the prize behind the remaining door. Does switching improve
the contestants chance of getting a valuable prize?
Let S denote the event that the contestant switches, and let S c be the
complementary event that the contestant does not switch. Let W denote
the event that the contestant wins the valuable prize, so W c represents
getting stuck with a gag gift. We want to compare Pr(W |S), the
probability of winning if we switch doors, to Pr(W |S c ), the probability of
winning if we do not switch doors.
The thing to notice is simply that Pr(W |S) + Pr(W |S c ) = 1. In the end
there are only two doors to choose from, and the contestant wins one way,
but not the other. It is clear (from symmetry) that Pr(W |S c ) = 1/3, the
chance of guessing the correct one of three doors at the outset, so it must
be that Pr(W |S) = 2/3. Switching doubles the probability of winning.
Despite the fact that the solution is trivial when cast as a probability
model, the result is not obvious to many people. Marilyn vos Savant writes
a question-and-answer column for Parade magazine, which is a Sunday
supplement to many newspapers. She once published this puzzle, and
reports that it stimulated an order of magnitude more mail than any other
question, with most of the mail coming from people who asserted that the
wrong answer was correct, even after seeing her argument. It interesting
that her second-place question, in terms of mail volume, also concerned
probability.
Three-card Monte
If we go from Monte Halls television studio, to a street corner where a
three-card monte dealer is playing his game, we find a similar-looking
problem that is actually very different. The three-card monte guy puts
three cards face down on a table (or box). He shows you which one is the
88
ace of spades. You bet, and he shuffles the cards around. You point to a
card, and if it is the ace of spades, you win. Now suppose that you point to
a card, and the dealer turns over a different cand, which is not the ace of
spades, and offers you the chance of switching your choice to the one
remaininig card. Should you switch?
The trouble is that the three-card monte dealer might only offer you a
chance to switch if you have pointed to the ace. If your first guess is wrong,
he just takes your money. Monte Hall always opens one of the doors after
the initial choice. He is in the business of selling advertising, not making
money off of gamblers. This means that he is providing useful information
about which door to switch to. With three-card monty, being offered a
choice probably means that your first guess was right.
Given a well-formulated problem, like the Monte Hall problem, the rules of
probability can provide a clear and easy answer to a question that seems to
mystify most people. Getting the problem formulated correctly is another
matter.
3.4
Bayes Rule and Prediction
In dealing with conditional probabilities, it is important not to confuse

Pr(A|B) with Pr(B|A). As Paulos points out in the reading assigment, the
probability of speaking Spanish given that one is a citizen of Spain is quite
high, but the probability of being a citizen of Spain, given that one speaks
Spanish is quite low. Often we know one type of conditional probability,
but want to know the other. That is where Bayes theorem comes in. But
lets introduce the idea with a story.
3.4.1
A Diagnostic Test
Nate gets a tattoo3 .

He is refused as a blood donor.
He gets a commercial test for Hepatitis B. The website claims the
sensitivity is 0.99 and specificity is 0.995. Sensitivity is the probability that
3
Reference needed I lifted the Nate story from from another teachers notes.
3.4. BAYES RULE AND PREDICTION
89
a person with Hepatitis B will test positive. Specificity is the probability

that a person without Hepatitis B will test negative.
The test literature advises:
If you test negative, you may conclude that you are not infected.
If you test positive, do not conclude you are infected, but see a doctor
for further testing and take precautions against infecting others.
This strikes Nate as odd a negative test is convincing, but a positive test
is not despite high specificity and high sensitivity.
The reason is that the prevalence is low, about 2 cases per 100,000 non-iv
drug users, and not greatly increased by tattoo say 3 per 100,000. High
specificity means that false positives are uncommon, but low prevalence
means that true positives may be even more uncommon, implying that
most positives are false positives.
Bayes rule compares true positives to all positives, to obtain the predictive
value of a positive finding (PVP),
297
PVP
0.0059.
297 + 49998
This says that fewer than one percent of people with a positive test will
turn out to actually be infected with hepatitus B.
More generally, if P is the prevalence of a condition, the predictive value is
P Sens.
PVP =
.
(3.4)
P Sens. + (1 P ) (1 Spec.)
90
The predictive value of a diagnostic test depends not only on the properties
of the test, but also of the prevalence of the condition in the setting in
which the test is used. The context matters.
If we only used the test on people who are free of the Hepatitus B virus, all
of the positives would be false positives. In fact, that is how one estimates
the false-positive rate (and its complement, specificity).
3.4.2
Bayes Rule
Lets go through the probability algebra behind this phenomenon.

Recall our definition of conditional probablity
Pr(B|A) =
Pr(AB)
.
Pr(A)
(3.5)
We can interchange the roles of A and B to write

Pr(A|B) =
Pr(AB)
Pr(B)
and rearrange (multiply both sides by Pr(B)) to get the multiplication rule
for probabilities,
Pr(AB) = Pr(A|B)Pr(B).
(3.6)
We can substitute equation 3.6 into equation 3.5 to get a simple version of
Bayes rule:
Pr(A|B)Pr(B)
Pr(B|A) =
(3.7)
Pr(A)
The useful thing about Bayes rule is that it allows us to reverse the
conditioning, going from Pr(A|B) to Pr(B|A), with the help of the
marginal probabilities.
If event A is a positive test result, and event B is the presence of virus, we
would like to know Pr(B|A), as that tells us what to make of the test
result. We can learn about Pr(A|B) from experiments with the test, or
from the literature describing such results. Pr(B) is the prevalence of the
condition, which we need to know, guess, or suppose a range of values for.
Pr(A) is the marginal (overall) rate of positive tests, which depends on
91
both the prevalence and the sensitivity and specificity of the test. Lets see
how do decompose Pr(A) into more basic ingredients that we know.
We can elaborate on Bayes rule by using the law of total probability. This
simply says that an event can be broken into two parts, namely the part
that intersects a second event, and the part that doesnt. The probability
of the whole is then the sum of the probabilities of the parts. If B is the
second event,
Pr(A) = Pr(A B) + Pr(A B c )
= Pr(A|B)Pr(B) + Pr(A|B c )Pr(B c ).
Then Bayes rule can be written:
Pr(B|A) =
Pr(A|B)Pr(B)
Pr(A|B)Pr(B) + Pr(A|B c )Pr(B c )
(3.8)
Visual Explanation:
1

HH

HH

H

HH

H
Pr(B c )
Pr(B)

HH

H
Pr(A|B)Pr(B)
True Pos
HH
H
j
H
HH
H
H
j
H
Pr(Ac |B)Pr(B)
False Neg

HH
H
HH
Pr(A|Bc )Pr(Bc )
False Pos
We can state it with a little more generality. If B1 , . . . , Bk for a partition,

i.e. a mutually exclusive but exhaustive set, then
Pr(A|B)Pr(B)
Pr(B|A) = P
.
i Pr(A|Bi )Pr(Bi )
For the hepatitus B example, we can get away with the simpler version,
noting that Pr(A|B) is the sensitivity, and Pr(A|B c ) is the false-positive
rate, i.e. 1 specificity. If we interpret the probabilities in equation 3.8 as
prevalence, sensitivity, and specificity, we get equation 3.4 above.
H
H
j
H
Pr(Ac |B c )Pr(B c )
True Neg
92
3.4.3
Example: The ELISA test for HIV
In the mid 1980s testing for HIV was developed using an ELISA test as an
initial screening. If the test was positive, it was followed by a second test,
using different technology, to confirm the positive result. Here we will
consider only the ELISA screening test. Weiss et al. published the results
from applying the ELISA test to 88 people known to be infected with the
HIV virus, as well as 297 healthy volunteers. The data are shown below.
Weiss et. al. (1985 JAMA 253:221-225):
With HIV
Without HIV
ELISA Total
+
86
2
88
22 275
297
Some definitions:
Sensitivity = Probability a patient infected with HIV is correctly
diagnosed (ELISA +).
Estimated Sensitivity =
86
= 97.7%
88
Specificity = Probability a patient without HIV infection is correctly

diagnosed (ELISA -).
Estimated Specificity =
275
= 92.6%
297
Predictive Value of a Positive test (PVP) = Probability that a

patient who tests positive is infected with HIV.
Note that we cannot estimate the predictive value directly from the table,
because the fractions with and without HIV were fixed by design and do
not reflect the prevalence of HIV infection for any population. We can,
however, make some statements about predictive value using Bayes rule
and some assumptions about prevalence.
Let B be the event that a patient has HIV. Let A be the event that ELISA
is positive. Then the PVP is
Pr(B|A) =
Pr(A|B)Pr(B)
Pr(A|B)Pr(B) + Pr(A|B c )Pr(B c )

=
93
Pr(Pos|HIV)Pr(HIV)
Pr(Pos|HIV)Pr(HIV) + Pr(Pos|OK)Pr(OK)
True Positives
.
True Positives + False Positives
The conditional probabilities can be found from sensitivity and specificity.
Pr(A|B) is the sensitivity, and 1 Pr(A|B c ) is the specificity. But the
predictive value still depends on Pr(B), the prevelance of HIV in the
population where the test is being used.
=
If the test is used in a population with prevalence P , then

PVP =
P Sens.
P Sens. + (1 P ) (1 Spec.)
Substituting the estimates of sensitivity and specificity from the table, we

can calculate the predictive value of a positive test for some hypothetical
values of the prevalence.
P = 50% implies PVP = 93%
P = 1% implies PVP = 12%.
If the prevalence is high, the test looks pretty good, but if prevalence is low,
most positives are false positives.
3.4.4
Positive by Degrees
Sometimes the test result is not a simple yes or no outcome. Here are some
data where a radiologist expresses an opinion on an ordinal scale.
94
If we pick a level that we will regard as a positive test, we can calculate

sensitivity and specificity. If we do this for each level in turn, we can plot
the estimated sensitivity against the estimated specificity for each choice of
a cut-off value. This gives us an idea of the trade-off between sensitivity
and specificity as we change the level that we regard as positive. This kind
of plot is known as an ROC curve. ROC is an ancronym for Receiver
Operating Characteristics. The terminology comes from electrical
engineering.
Figure 3.2: ROC curve for the radiology data
Exercise 3.4.1 If there were a cut-ff that gave perfect sensitivity and
specificity, where would its point be on the chart? If we were to ignore the
data and toss a coin, calling heads a positive test, where would that test plot
on the chart? If we tossed a pair of coins and regarded two heads as positive,
where would that test plot on the chart? What would you make of a point
that fell below the diagonal line running from the lower-left to upper-right?
3.4.5
95
Perspective
Sensitivity and specificity are often referred to as the operating

characteristics of a diagnostic test. The language we use suggests that
sensitivity and specificity are properties of the test. If the test involves a
quantitative signal, its operating characteristics may be better for people
with strong signals, far from the cut-off, and worse for the borderline cases.
If we apply the test in different settings, both the operating characteristics
and the prevalence may change. Prevalence is usually the bigger worry, but
the accuracy and scope of the estimated operating characteristics is worth
some critical thought.
96
3.5
Problem Set 2, (part 1 of 2)
Exercise 3.5.1 Prior to Mendel, investigators had reported that hybrids

are inclined to revert to the parental forms. Mendel gives a theoretical
explanation, noting that we should expect the ratio of hybrid types to fixed
parental types to be 2:2 after one generation of selfing, 2:6 after two
generations of selfing, and 2:14 after three generations of selfing. These are
the odds of the hybrid type. What is the probability of the hybrid type after
1, 2, and 3 generations of selfing?
Exercise 3.5.2 If all of Mendels characters segregate independently, what

is the probability, in the F2 generation, of a plant with short stems,
terminal flowers, and inflated seed pods?
Exercise 3.5.3 Consider a plant of the F1 generation, heterozygous at all

7 genetic loci for Mendels characters. After self-fertilization, what is the
probability that an offspring is of the recessive phenotype for all 7
characters?
What is the probability that it is of the dominant phenotype for all 7

characters?
Exercise 3.5.4 In Mendels data set, there are many more observations of
seed shape and cotyledon color than of the other characters.
(a) Out of all seven characters, which gives a dominant:resessive ratio
closest to the hypothesized 3:1?
(b) Which is the next closest?
3.5. PROBLEM SET 2, (PART 1 OF 2)
97
Exercise 3.5.5 Consider the toss of a pair of standard dice, each with six
sides with one to six spots. Suppose one is red and the other green. We can
lay out all of the combinations as in a Punnett square, with the possible red
numbers on one edge, and green on the other. How many combinations are
there?
How many combinations sum to two?
How many sum to 7?
What is the probability of throwing a two? a seven?

Exercise 3.5.6 If you randomly generate a codon by selecting three bases,
independently, with replacement, from the set {A, T, C, G}, with each base
having equal probability, then what is the probability of generating a stop
codon, i.e. TAA, TAG, or TGA?
What is the probability of a stop codon if Pr(C) = Pr(G) = 0.3 and

Pr(T ) = Pr(A) = 0.2?
Exercise 3.5.7 Suppose an investigator sequences DNA from 30

individuals. Further suppose these people are from a population that
approaches random mating, so we can assume that 60 independent alleles
have been screened for mutations. If a given mutation has an allele
frequency of 1%, i.e. 1 in 100 alleles are of this form, what is the probability
that it will be detected in the sample?
98
Exercise 3.5.8 If a specific genetic variant occurs with a frequency of

10%, is the variant certain to be seen if 10 copies of the gene are
sequenced? What is the probability of detecting the variant at least once in
10 sequences?
Exercise 3.5.9 (Without selection) Consider the Punnett square of

table 3.2. Suppose the pollen parent produces pollen with the A and a alleles
in equal numbers. Suppose eggs are similarly produced with the A and a
alleles in equal numbers. Also suppose, quite plausibly, that the genotypes at
the locus in question have no impact of the chance that a pollen particle
successfully fertilizes an egg and produces a seed. We suppose that out of
100 fertilizations, about 25 plants will correspond to each cell in the Punnett
square. Consider the marginal totals for that square. What is the marginal
probability that a plant received the a allele from the pollen? From the egg?.
What is the product of these probabilities?
What is the probability of the aa genotype?
Is transmission of the a allele from the pollen independent of transmission

from the egg?
Exercise 3.5.10 (With selection) Now suppose that out of 100
fertilizations, about 25 will correspond to each cell in the Punnett square,
but that a nearby deleterious recessive allele is always transmitted along
with the a allele, so that of 25 fertilizations yielding the aa genotype, only
10 survive to produce a mature seed. (It will be helpful to make a square
with these numbers: 25 in 3 cells, 10 in the remaining cell, and to write
down the marginal totals.) What is the probability that a mature seed
received the a allele through the pollen? through the egg?
3.5. PROBLEM SET 2, (PART 1 OF 2)
99
What is the product of these probabilities (two digits)?
What is the probability of the aa genotype in mature seeds (two digits)?
Is transmission of an allele through pollen to a mature seed independent of

transmission through the egg among the viable seeds?
Exercise 3.5.11 Early in the AIDS epidemic, some politicians proposed

HIV testing for people applying for marriage licenses. This never happened.
What is wrong with this idea (from a statistical perspective, in one brief
sentence)?
Exercise 3.5.12 (a) How would you define the predictive value of a
negative test, in words describing its interpretation?
(b) Define the predictive value of a negative test as a function of sensitivity,

specificity, and prevalence, using mathematical notation. Be sure to define
each symbol that represents an event or rate.
(c) Using the data from Weiss et al., what is the predictive value of a
negative HIV ELISA test when P = 0.50? When P = 0.01?
100
Exercise 3.5.13 An investigator at COH has devised an array-based

screening test for identifying imprinted genes. In the anticipated setting for
using this method, the sensitivity of the test is about 0.80, and the
specificity is 0.999. The test will be used with an array of 10,000 genes, and
it is expected that 10 genes will really be imprinted.
(a) Of the 10 imprinted genes, how many should we expect to be detected?
(b) Of the 10, 000 10 non-imprinted genes, how many would you expect to
be incorrectly flagged as imprinted?
(c) How many genes in all should we expect to be flagged?
(d) What is the predictive value of a positve test?
Chapter 4
Estimating and Testing a
Probability
101
102
CHAPTER 4. ESTIMATING AND TESTING A PROBABILITY
Suggested Reading: S&W 3.7, 3.8, 4.1, 4.2, 4.3.

In the previous chapter we considered how to calculate the probabilities of
various events, given that we know the probabilities of other events. Such
problems are essentially mathematical. Now lets turn to the more
statistical problems of how to estimate a probability in the first place, and
how to test whether data are consistent with a hypothesized probability
value.
4.1
Example: MEFV Gene and

Fibromyalgia Syndrome
Investigators at COH recently published1 that

Missense Mutations in the MEFV Gene Are Associated with
Fibromyalgia Syndrome and Correlate with Elevated IL-1b
Plasma Levels.
100 fibromyalgia syndrome patients were identified. DNA was obtained
from the patients and their parents. The MEFV gene from each of these
300 people was sequenced. One common and 10 rare missense variants were
identified.
The evidence for a genetic association comes from tracking the transmission
of rare missense mutations from parent to affected child. The data are
shown below as Figure 4.1.
22 parents were identified that were heterozygous for a rare MEFV
mutation. In each instance, the other parent had a common allele (wt in
the table), so we can tell if the rare allele was transmitted from patent to
child.
If a rare allele is unrelated to fibromyalgia syndrome, it should be
transmitted with a probability of 1/2. If rare alleles increase the risk of
Fibromyalgia Syndrome, then a sample of fibromyalgia patients should be
enriched for rare alleles.
1
Feng et al., PLoS ONE, 2 December 2009, Volume 4, Issue 12, e8480
TH2 (IL-5 and IL-13) cytokine levels (Figure S2). Those with rare
alleles were similar to other FMS patients and family with regard
to IL-5 levels, but similar to control subjects and distinct from
other FMS patients without rare variants (p = 0.003) with regard to
IFNc, and indistinguishable from either group with regard to IL-
were elevated in subjects with the rare MEFV alleles compared to

other FMS patients (p = 0.019, legend to Figure 2), who, along
with their relatives, had elevated levels compared to unrelated
controls. MCP-1, a chemokine with similar function, showed no
such distinction of the individuals with rare alleles, but those with
4.1. EXAMPLE: MEFV GENE AND FIBROMYALGIA SYNDROME 103

Table 2. Transmission of rare variants for FMS trios.1,2
haplotypes
E148Q
L110P/E148Q
R329H
P369S/R408Q
E148Q/P369S/R408Q
E148Q/P369S/R408Q/A457V
proband
mother
father
allele
ID#
genotype
ID#
genotype
ID#
genotype
ut (C)
t (B)
FMS79
wt
FMS80
wt
FMS81
het
FMS203
wt
FMS204
wt
FMS205
het
FMS316
het
FMS317
het
FMS318
wt
FMS52
het
FMS54
wt
FMS55
het
FMS53
het
FMS54
wt
FMS55
het
FMS248
het
FMS247
het
FMS249
wt
FMS45
het
FMS47
wt
FMS46
het
FMS52
wt
FMS54
het
FMS55
wt
FMS53
het
FMS54
het
FMS55
wt
FMS321
het
FMS322
het
FMS323
wt
FMS340
wt
FMS339
wt
FMS365
het
FMS366
wt
FMS339
wt
FMS365
het
FMS411
het
FMS464
wt
FMS463
het
FMS435
het
FMS564
wt
FMS508
het
FMS51
het
FMS49
wt
FMS50
het
FMS512
het
FMS511
wt
FMS510
het
FMS540
het
FMS539
wt
FMS538
het
A289V
FMS501
het
FMS503
wt
FMS502
het
I591T
FMS495
het
FMS497
wt
FMS496
het
FMS549
het
FMS548
het
FMS547
wt
K695R
FMS254
het
FMS257
het
FMS258
wt
A744S
FMS127
het
FMS126
wt
FMS130
het
17
Total
1
Abbreviations are: wt-wildtype; het: heterozygote.

Probands FMS52 and FMS53 are sisters that both inherited one set of rare alleles from the father, while one of the sisters inherited rare variants from her mother (see
Supplementary Fig S1.)
doi:10.1371/journal.pone.0008480.t002
2
PLoS ONE | www.plosone.org
December 2009 | Volume 4 | Issue 12 | e8480
Figure 4.1: Transmission of rare variants from 22 parents in FMS trios (Feng
et al. 2009)
104
The result was that 17 of 22 parents with a rare allele transmitted that
allele to their affected child, while 5 of the 22 did not.
The null hypothesis is the name we give to the hypothesis that rare MEFV
alleles are not associated with fibromyalgia. It implies that the transmission
probability is 0.5. The alternative hypothesis is that the transmission
probability is greater than 0.5.
Is 17 out of 22 enough to discredit the null hypothesis? While 11
transmissions out of 22 opportunities would be perfectly in line the null
hypothesis, observing exactly that outcome is not very likely. We expect a
few more or less. We judge 17 to be large if 17 or more is unlikely under
the null hypothesis.
4.2. THE BINOMIAL DISTRIBUTION
105
Testing a hypothesis
To summarize, if p if the probability of inheriting a rare allele for each of
the 22 offspring, we want to test the null hypothesis
H0 : p = 0.5,
against the alternative hypothesis
HA : p > 0.5,
and we will discredit H0 if
Pr(Y 17)
is small under the null hypothesis.
We can easily calculate the probability of 17 or more rare-allele
tranmissions out of 22 opportunities, assuming that these are independent
events, each with a probability of 0.5. This is the same as the probabily of
getting 17 or more heads in 22 tosses of a fair coin. The respective
probabilities of 0, 1, . . . , n successes in n independent trials, each with the
same probability, p, is given by the binomial probability function. We say
that the number of successes follows a Binomial distribution.
4.2
The Binomial Distribution
The binomial distribution is a model of a simple and common kind of stable

random process. If Y is the number of successes out of n independent trials,
each with probability p, then we say that Y follows a binomial distribution.
As a mnemonic, we can say that the situation satisfies the BInS
assumptions:
B inary outcomes (e.g. success or failure);
I ndependent trials;
n is fixed in advance;
S ame probability on each trial.
106
Y can take values 0, 1, . . . , n. For any fixed number, y, from {1, 2, . . . , n},
the probability Pr{Y = y} is determined by the binomial distribution
formula

n y
Pr{Y = y} =
p (1 p)ny
y
where

n
n!
=
y
(n y)!y!
is the number of ways one can arrange y successes in a sequence of n trials.

The Binomial Coefficient

n
is called the binomial coefficient. It is denoted by nCy in some books. In
y
either notation, it is read as n choose y, meaning the number of ways of
chosing y objects from a set of n objects.
The binomial coefficients can be calculated recursively using Pascals
triangle:
1
1
1
1
1
2
3
4
5
1
1
3
6
10
1
4
10
1
5

For example, 41 = 4 and 42 = 6. Except for the ones down the edges, each
number in the triangle is the sum of the two adjacent numbers from the
line above. Imagine you start at the top, and must choose to go left (zero
successes) or right (one success). There is one way of making each choice.
Again go left or right to get to the second row. There are two ways to get
to the middle, namely left-right, or right-left. Going in like manner, the
numbers give the number of distinct paths to each point, and each path to
the same point requires the same number of right turns, i.e. it corresponds
to a fixed number of successes.
4.2.1
Calculating the tail probability
Returning the MEFV example, we can check that the number of families in
which the rare allele was transmitted (given that there was a single rare
4.2. THE BINOMIAL DISTRIBUTION
107
allele between the two parents) meets the conditions for a binomial
distribition.
B inary outcome: the transmission of a rare allele, or not.
I ndependent trials are the independent meioses in parents.
n is 22, the number of oportunities for a rare allele to be transmitted.
S ame probability: under the null hypothesis, the probability is 0.5 for each
transmission.
We observed Y = 17 out of 22 trials. We want to know Pr(Y 17). Lets
start with {Y = 22}, the event that it is the rare allele that is transmitted
in each of the 22 opportunities.
Pr(Y = 22) = 0.522 = 2.38 107 .
Now consider {Y = 21}. The probability of transmitting the non-rare allele
is also 0.5, but it might happen on any of the 22 occasions, so the binomial
probability gives us
Pr(Y = 21) = 22 0.522 = 5.25 106 .
If {Y = 20}, there are two parents who fail to transmit the rare allele. This
might happen for the first and second parents on our list, or for the first
and third, or for any
combination up to the 21st and 22nd.
The number of
22
22
possibilities is 2 = 231 which is of course the same as 20 because
transmission of 20 rare alleles means 2 non-rare alleles are transmitted. The
probability for each possible number of transmissions is given in Figure 4.2.
Adding up the probabilities for outcomes of 17 or more gives
Pr(Y = 17) + Pr(Y = 18) + . . . + Pr(Y = 22) = .0085,
indicating that the outcome is not very likely under the null hypothesis.
This is called the significance probability, often referred to as the p-value. A
significance probability of 0.0085 suggests rather strongly (but does not,
strictly speaking, prove) that the alternative hypothesis is true. The
alernative hypothesis, p > 0.5, is consistent with the biological hypothesis
that some rare variants of the MEFV gene are associated with fibromyalgia
syndrome, so by selecting probands with this condition, we have enriched
our sample for probands who inherited these rare variants.
108

Binomial Distribution: Trials = 22,
Probability of success = 0.5
0.15
0.10
0.05
Probability Mass
0.00
10
15
20
Number of Successes
Figure 4.2: Binomial probability function.
4.2.2
Computing
We dont generally have to work out binomial probabilities from the

formulas given above. There are many simple computing tools (and tables)
from which we can find Pr(Y 17). However, most are designed to give
the lower tail probability, Pr(Y a) for a specified value of a, rather than
the upper tail that we require, but it is simple enough to compute
Pr(Y 17) = 1 Pr(Y 16).
An important detail, when calculating a tail probability for any discrete
variable, is to know whether your computing tool is including or excluding
the observed value from the tail.
4.3. ESTIMATING A PROBABILITY
109
In Excel, one can go to any cell and enter:

=1 - BINOMDIST(16,22,0.5,TRUE)
Rather than remember the order of the arguments between the parentheses,
you can use the Insert menu, choosing Function, Statistical, and
BINOMDIST, which will present a small form to fill out. You will need to
insert the 1 - in the cell after filling in the form.
In R you can do much the same thing at the command line by typing
1 - pbinom(16,22,.5)
In R, the functions relating to probability distributions come in four
flavors, distinguished by the first letter p, d, q, or r.
pbinom is the cummulative probability (up to some number).
dbinom is the density (or probability) at a particular value.
qbinom is the quantile with a specified cummulative probabilty.
rbinom generates random numbers.
In R Commander, (start R, then enter library(Rcmdr)) One can use the
Distributions pull-down menu, chosing Discrete distributions, then
Binomial distribution and Binomial tail probabilities. This will
produce a form to fill in with spaces for Variable value (16), Binomial trials
(22), Probability of success (0.5), and a radio button for Lower tail or Upper
tail. An important detail to keep in mind is that the variable value is
always included in the lower tail, and never included in the upper tail,
regardless of which button you click. The button just does the subtraction
from one, just like in Excel or plain R.
4.3
Estimating a probability
In the study of the MEFV gene, we are interested in testing whether

p = 0.5, or p > 0.5, because rejecting the null hypothesis means that we
110
have evidence that the gene may be associated with fibromyalgia. The exact
value of p is perhaps less interesting, but still of some interest. If p = 1, it
would indicate that having a rare variant of the MEFV gene is the only way
to get fibromyalgia syndrome, and if p were only slightly elevated, it would
suggest that there are many other ways of getting fibromyalgia syndrome.
The observed fraction of tranmissions, 17/22 0.773, is a reasonable
estimate of p. We would, however, like to have some idea of how accurate
that estimate is. One way of going about that, which is simple in concept,
if not in calculation, is to consider other hypothetical values of p besides
p = 0.5, and test whether we could reject those values. To do this, we need
a fixed criterion for rejection. Lets say that we will reject a hypothetical
value p0 if
Pr(Y 17|p = p0 ) < 0.05.
We already know that we can reject p0 = 0.5 by this criterion. If we try
0.55 we get
Pr(Y 17|p = 0.55) < 0.027
so we can reject 0.55 as well. If we try 0.6 we get
Pr(Y 17|p = 0.6) < 0.072
which is larger than 0.05, so we cannot reject p0 = 0.6. The boundary for
our rejection criterion is between 0.55 and 0.6. A little more searching will
find
Pr(Y 17|p = 0.58) = 0.0498
so any value of 0.58 or smaller can be rejected by our criterion. We can say
that 0.58 is a 95% lower confidence bound on p, the probability of
transmission. This means that values of 0.58 and below are inconsistent
with the data. 95% confidence means that we have arrived at this number
by a method that will in fact put the lower bound below the true value of p
in 95 out of 100 similar experiments. This is not perfect, but perfect
confidence can only be had if we set the lower bound to zero, which would
not be useful.
Of course we dont have to do such a routine calculation by manual
searching. In R, for example, we can call the binom.test function, and get
the following output.
4.3. ESTIMATING A PROBABILITY
111
> binom.test(17,22,0.5, alt = "greater")

Exact binomial test
data: 17 and 22
number of successes = 17, number of trials = 22, p-value = 0.00845
alternative hypothesis: true probability of success is greater than 0.5
95 percent confidence interval:
0.58 1.00
sample estimates:
probability of success
0.773
The first argument is Y , the number of successes. The second is n, the
number of trials. The third is a null hypothesis value to be tested, and the
alt argument allows us specify the alternative hypothesis that p > p0 ,
rather than smaller. That results in a one-sided test, and a one-sided lower
confidence bound, rather than a confidence interval.
We can get a 95% confidence interval by breaking our 5% error probability
into two pieces, and calculating bounds such that
Pr(Y 17|p = pL ) 0.025
and
Pr(Y 17|p = pU ) 0.025.
In the R binom.test function, we can simply omit the alt = greater
argument, since two-sided tests and intervals are the default. We could
then find that
(pL , pu ) = (0.546, 0.922).
We can see that we have pretty reliably rejected the null hypothesis, but we
have not pinned down p very well.
Mendels Peas
Lets consider another example where the probability is estimated with
much better accuracy. Mendel observed the pod shape of 1181 offspring
from self-fertilization of hybrids, and reported that 882 were of the inflated
type and 299 were of the constricted type. The 95% confidence bounds on
112
the probablity of the inflated type are (0.721, 0.771). For the seed shape
phenotype, he observed 7324 seeds, of which 5474 were round, yielding a
95% confidenc interval of (0.737, 0.757) for the probability of round peas.
4.4
Random Variables
When we write an expression like Pr(Y 16), the symbol Y is a random

variable, while the number 16 is a constant. If we write Pr(Y q), noting
that it can be computed by the R expression pbinom(q, 22, .05), then q
is a variable, but not a random variable.
A random variable is associated with probabilities. Its value is a random
outcome. An ordinary variable, like q, is a placeholder for a fixed number,
so that we can discuss it in general, without specifying its value.
A random variable represents a perhaps infinite population. There is no
limit to the number of times we can toss a coin. At least there is no limit in
principle.
Mendels round and wrinkled peas are analogous to coin tosses, except that
we expect round seeds with a probability of 0.75, not 0.50. If we score a
wrinkled seed as 0 and a round seed as 1, the average of these scores is the
fraction of round seeds. We can think of the mean as a weighted sum of the
two possible scores
1850
5474
x =
1+
0 = 0.747.
7324
7324
We can think of the mean for an infinite population of possible peas as
3
1
= 1 + 0 = 0.75.
4
4
All we have done is substitute the theoretical probability of a round pea for
the empirical fraction, but the notion in important. The number 0.747 is a
sample mean, i.e. an observation, while the number 0.75 is a population
mean, i.e. a feature of an idealized model.
Bernoulli Trials
A random variable, X, that takes only two values, namely 0 and 1, is called
a Bernoulli trial. Its distribution is completely specified by one parameter,
p = Pr(X = 1).
113
4.4. RANDOM VARIABLES
The sum of n independent Bernoulli trials follows the binomial distribution,

described above. We can think of the individual peas in one of Mendels F2
generation as independent Bernoulli trials, each yielding 0 or 1 round pea.
The sum is the number of round peas out of n = 7324 trials. This sum
follows a binomial distribution.
4.4.1
Expected Value
The population mean of a random variable is often called its expected value,
and is often denoted by the notation E(X), read as the expected value of
X. If X is a Bernoulli trial, and p = Pr(X = 1), then E(X) = p. (Note
that this is technical jargon, so we go ahead and say the expected value of
X is p, even though we never expect an individual value of X to be
anything other than 0 or 1.)
If Y is the sum of n independent Bernoulli trials, each with the same
probability of success, then Y follows a binomial distribution, and
E(Y ) = np. It follows that E(Y /n) = p, so we can regard Y /n as an
unbiased estimator of p.
Definition: A statistic T is an unbiased estimator of parameter p if
E(T ) = p.
A population can also have a standard deviation. Just as with a finite
sample, the standard deviation is the root mean square (RMS) deviation
from the mean. The difference is that the mean in the RMS calculation is a
population mean.
For a Bernoulli trial, the variance, or squared standard deviation, 2 , is
2 = (0 p)2 Pr(X = 0) + (1 p)2 Pr(X = 1).
If we substitute p for Pr(X = 1) and (1 p) for Pr(X = 0), and apply a
little bit of algebra, we can reduce this formula to something simple:
2 = (0 p)2 (1 p) + (1 p)2 p
= p2 (1 p) + (1 p)2 p
= p(1 p){p + (1 p)}

= p(1 p)
114
so the standard deviation is

=
p
p(1 p).
Note that the standard deviation of a Bernoulli random variable is

completely determined by p, the one parameter that governs everything
about this simple distribution. The standard deviation of a single Bernoulli
trial may not seem very useful by itself. At this point, it is just an exercise
in applying the definition to a population, but it will soon lead us to a
standard error for estimates of probabilities.
4.5
The Law of Averages
Figure 4.3 shows results from 10,000 tosses of a coin, performed by John
Kerrich, a South African mathematician, while he was in a German prison
camp in Denmark during the second world war. We would expect about
half of the tosses to come up heads in the long run, but the figure shows
that we have to be somewhat careful about exactly what we mean by the
long run. The upper part of the figure plots, on the vertical axis, the
number of heads minus half the number of tosses, i.e. the excess number of
heads compared to our expectations. This comparison actually wanders
away from zero as the number of tosses increases. The lower part of the
figure plots the percentage of heads minus 50%, i.e. the excess of heads on a
percentage scale, which approaches zero as the number of tosses increases.
This behavior is often referred to as the Law of Large Numbers, a more
technical name for what is colloquially known as the law of averages. We
will not bother with a precise statement of the Law of Large Numbers, and
in fact there are several variations of it. With regard to a sequence of
Bernoulli trials, like coin tosses, it says that, as the number of trials grows
large, the percentage of successes will aproach the expected percentage (or
proportion), p, with high probability. This happens despite the fact that
the absolute number of successes is likely to take ever larger excursions
away from the expected number of successes. This is a case where the
colloquial name may seem more appropriate than the technical name, as it
is the averages that converge, not the absolute numbers.
115
4.5. THE LAW OF AVERAGES
4.5.1
Mean and standard deviation of a binomial
If Y has a binomial distribution with parameters n and p, then the

population mean, or expected value of Y is
E(Y ) = np.
In the MEFV gene example, we have n = 22 trials, each with probability
p = 0.5 under the null hypothesis, so the expected number of rare allele
transmissions is
np = 22(0.5) = 11.
In Mendels pea shape data, we would expect
np = 7324(0.75) = 5493
round seeds. The observed number of round seeds was 5474, off by 19. This
is only about one quarter of one percent when compared to the total of
7324 seeds.
The variance of a binomial count, Y , is
V(Y ) = np(1 p)
so the standard deviation is
SD(Y ) =
p
np(1 p).
The standard deviation of the count increases with n. In the MEFV gene
data, the standard deviation is
p
SD(Y ) = 22(.5)(.5) = 2.345
so the observation of 17 transmissions was about 2.56 standard deviations
above the 11 expected a significant difference. We generally expect
observations to be within about 2 standard deviations of the mean about
95% of the time.
In Mendels seed shape data, the standard deviation of the round seed
count is
p
7324(.75)(.25) = 37.
116
The difference of 19 between observation and expection is only about half

of a standard deviation the sort of deviation one would expect quite
often under Mendels model.
We can divide our binomial count, Y , by the sample size, n, to obtain a
fraction that estimates the probability, p. We often write
p =
Y
n
to distinguish the estimator, p, from the parameter, p, which is the target

we are trying to estimate. We then have the mean
E(Y /n) = p
variance
V(Y /n) = p(1 p)/n
and standard deviation
SD(Y /n) =
p
p(1 p)/n.
The standard deviation of an estimator is usually called a standard error.

Note that the standard error gets smaller as n gets larger. This is the law
of averages at work.
p
Also note that Y is a sum, so Y /n is a mean, and p(1 p)/n is the
standard error of the mean, when the observations are random indicator
variables, i.e. Bernoulli trials.
From Mendels seed shape data, we have the estimate
p =
5474
= .747
7324
with standard error

p
p
p(1 p)/n = .75(.25)/7324 = .005.
As noted above, and discussed below, we expect a mean, such as p, to be
within two standard errors of its expected value (which is p) in about 95%
of samples. So we expect that the interval
.747 2(.005)
4.5. THE LAW OF AVERAGES
117
will cover the population parameter, p, about 19 times out of 20. This
interval, (.737, .757), is an approximate confidence interval. Because Mendel
observed so many seeds, the approximation is quite good, and all three
digits match the exact calculation we made above, based on the binomial
distribution.
This simple device of adding and subtracting two standard errors from the
estimate also works well for Mendels pod type data. In that example,
n = 1181 so
1/2

.75 .25
= 0.0126
se(n) =
1181
and the estimate, p, give or take two standard errors yields the interval
(0.722, 0.772)
which differs from the exact interval only in the third digit.
There is a subtle issue in the above calculation, in that we used the
hypothetical probability of 0.75 rather than the estimate of 0.747 to
estimate the standard error of p. It makes no appreciable difference in this
example. When discussing standard errors, we wont typically distinguish
between population values and estimates, because the standard error is
typically just meant to give an approximate idea of accuracy.
In the MEFV example, however, with only 22 obserrvations, it would be
preferable to use the more exact method based on the binomial
distribution, which avoids the question of how to estimate the standard
error. The MEFV example is near the boundary of sample size where a two
standard deviation interval is useful. Adding 2 standard errors to the
estimate keeps one within the unit interval, but adding 2.5 or 3 se takes one
outside the unit interval, which is generally a sign that caution is needed.
118
4.6
The Normal Distribution
The binomial distribution is an example of a discrete distribution. If Y has

a binomial distribution, it can only take the values 1, 2, . . . , n.
The normal, or Gaussian distribution is an example of a distribution for
continuous random variables. The normal density curve is bell-shaped,
centered on the population mean, , with the scale determined by the
standard deviation, . These two parameters, and , completely
determine the normal distribution. We typically use Greek letters for the
underlying but unobservable parameters (such as mean and standard
deviation) of a population.
The normal density shown in Figure 4.4 shows a normal density chosen to
fit the histogram of some data on serum cholesterol measurements. It has
mean = 176 mg/dL and standard deviation = 30 mg/dL. The lower
version of the figure has the scale marked in mg/dL, but with the marks at
intervals of 1 standard deviation.
4.6. THE NORMAL DISTRIBUTION
119
Figure 4.3: 10,000 tosses of a coin (from Freedman, Pisani, and Purves,
Statistics, 3rd ed.)
120
Figure 4.4: (figure 4.1 from Samuels and Witmer). A continuous density
curve overlayed on a histogram. In the lower plot, the scale marks are at
intervals of one standard deviation.
4.6.1
121
Standardized scale
If Y has a normal distribution with mean and standard deviation , this

is often denoted
Y N(, ).
If Y has any such normal distribution, then
Z=
has a normal distribution with mean 0 and standard deviation 1. This is

called the standard normal distribution.
The normal distribution is actually a family of distributions one for each
combination of and . All normal density curves have the same shape if
you put them on the standard scale.
122
Figure 4.5: Figure from Samuels and Witmer, illustrating several normal
distributions, and standardized scale.
4.6.2
123
Some motivation
The binomial distribution applies in many situations where we believe the

necessary assumptions hold. The normal distribution is more useful as an
approximation.
There are theoretical reasons to expect a variable to follow a normal
distribution if it results from the sum of many small independent
increments. Many physical traits that result from numerous small genetic
and environmental contributions are approximately normal. In chapter 2,
the weights of mouse olfactory bulbs, the width of iris sepals, and the
weights of cats (aside from truncation) all had histograms consistent with a
normal distribution. Human height is approximately normal, and it has
recently been shown to depend on a great many genetic influences.
The normal distribution is often used as an approximation to other
distributions, such as the binomial distribution. Figure 4.6 shows the
density curve for a normal distribution with the same mean and standard
deviation as the binimial distribution that we considered in the MEFV gene
example. The goodness of the normal approximation for large sample sizes
explains why taking p plus or minus two standard errors serves as an
adequate 95% confidence interval when n is large. For a normal
distribution, the mean, give or take two standard deviations, includes about
95% of the area under the curve.
In the MEVF example, with n = 22, the normal distribution is only a
rough approximation to the binomial distribution. The normal density is a
good approximation to the binomial probabilities, but if we are to use areas
under the normal curve to approximate probabilities like Pr(Y 17), we
need to be careful about the intervals that we integrate over. In this case
we would want the area above 16.5, not 17. This is called a continuity
correction. Even if we take such extra care, with only n = 22, the
approximation would not be at all adequate if the mean were, say 0.8,
instead of 0.5. Generally, we want the mean, give or take 3 standard errors
to be within the unit interval if we are going to use the normal
approximation.
124
Binomial Distribution: Trials = 22,

Probability of success = 0.5
0.15
0.10
0.05
Probability Mass
0.00
10
15
Number of Successes
Figure 4.6: Binomial probabilities (N=22, p=0.5) overlaid with a normal

density curve ( = 11, = 2.345)
20
4.6.3
125
Central Limit Theorem
Means (and sums) of independent random variables tend to follow a normal

distribution as the sample size increases. We wont attempt to state this in
its mathematical glory, but it is the basis for taking a mean plus or minus
two standard errors as an approximate confidence interval (because this
covers 95% of a normal distribution). Figure 4.7 illustrates the normal
approximation to the mean in an example where the distribution of
individual observations is far from normal.
Figure 4.7: An illustration of the Central Limit Theorem: Even thought

the original distribution is far from normal, the distribution of means is
approximately normal as the sample size gets large.
126
4.6.4
Standard error of the mean
As sample size increases, the sample mean tends to be more closely

approximated by a normal distribution, but as the law of large numbers
implies, sample means also get more precise as sample size increases. We
saw this in the example of p as an estimator of the binomial probability, p,
but it holds generally.
If X is a random variable with population mean, , and population
is the mean of n independent observations
standard deviation, , and X
also called the
distributed as X, then the standard deviation of X,
standard error of the mean, is
= .
SE(X)
n
This is an important formula. When estimating a mean, if we want to
cut our uncertainty in half, we need to quadruple our sample size.
Example 4.6.1 (serum cholesterol) The serum cholestrol levels from a
previous figure, have a standard deviation, = 30mg/dL. If we were to
estimate the mean, based on 100 observations, the standard error of our
stimate would be
30
=
= 3mg/dL.
n
100
Of course it is natural to ask how we would know if we need to estimate
the mean. The point here is that whatever the standard deviation, , mght
be, the standard error of the mean will be only /10. The practical matter
of judging the accuracy of a mean estimate when we dont know will be
taken up when we study t-statistics.
4.6.5
Areas under the normal curve
For a continuous distribution, like the normal distribution, the area under
the density curve over an interval gives the probability that the random
variable will fall in that interval.
For a normal distribution, about 2/3 of the area falls within one standard
deviation of the mean. About 95% falls within 2 standard deviations of the
mean. Roughly 99% falls within 2.5 standard deviations of the mean.
Figure 4.8: Areas under a normal curve
127
128
4.6.6
Computing
The serum cholesterol dataset, displayed in a previous figure, has a mean of

176 mg/dL and a standard deviation of 30 mg/dL. Suppose we want to
know the probability of a new observation being over 230. If we take the
sample mean and standard deviation as population values for the sake of
this estimate, then we can easily find the answer using a variety of
computing tools.
In excel, click on a cell and use the File pull-down menu to insert the
NORMDIST function. Use 230 as X, 176 and the mean, 30 as the standard
deviation, and set cummulative to TRUE. This will insert a function for
Pr(Y < 230). You can insert 1 - to obtain
=1 - NORMDIST(230,176,30,TRUE)
You can, of course, just type this in the cell, but then you either need an
example, or you need to remember the name of the fuction and the order of
arguments. The cell should evaluate to .036 (using appropriate rounding).
In R the analogous command-line incantation is
1 - pnorm(230,176,30)
which is quite similar to the excel expression. In R Commander, you can
use the Distributions pull-down.
For continuous distributions, you dont have to worry about whether the
boundary value is included in the interval. Intervals have probabilities, but
individual points have zero probability. If that causes you any philosophical
distress, remenber that real observations have finite accuracy, so they
correspond to intervals, not points.
Any of the computing methods should produce the same answer:
Pr(Y > 230) = .036.
Fewer than 4% of the population would be expected to have serum
cholesterol above 230 mg/dL. Whether this should be regarded as an upper
limit of normal would involve various practical concerns as well.
129
4.7. SUMMARY
The old-fashioned way of getting these sorts of probabilities is to refer to a

table of the standard normal distribution. This requires standardizing the
value of interest, i.e. expressing 230 as a number of standard deviations
above the mean.

230 176
Y 176
>
= Pr(Z > 1.8) = 1Pr(Z 1.8)
Pr(Y > 230) = Pr
30
30
where Z has a standard normal distribution. Looking in a normal table,
this is 1 - 0.9641 = 0.0359.
4.7
Summary
Hypothesis Testing. In the MEFV example, we tested the null

hypothesis that p = 0.5, against the alternative hypothesis that p > 0.5,
where p is the probability of finding a transmitted rare allele in an affected
offspring. We observed the result Y = 17, where, prior to the observation,
the random variable Y had a binomial distribution. We calculated
Pr(Y 17|p = 0.5) = 0.0085
under the null hypothesis. Because the data are unlikely if the null
hypothesis were true, we consider this evidence that the null hypothesis is
false.
Confidence Interval. If we set a fixed error probability, say 0.05, the
(central) set of parameter values that we could not reject based on
Pr(Y 17|p) 0.05 form a 95% confidence interval. These are plausible
values of p. A confidence interval for p can easily be computed using the
binom.test function of R.
Binomial Distribution. The number of successes out of n independent
trials, each with probility p, follows a binomial distribution. A binomial
random variable satisfies the BInS assumptions. If Y follows a binomial
distribution, its population mean or expected value is
E(Y ) = np
and its standard deviation is
SD(Y ) =
np(1 p).
130
We can compute tail probabilities for the binomial distribution using the
BINOMDIST function of Excel, or using the pbinom function of R.
Estimator of p. The estimator p = Y /n has expected value
E(Y /n) = p
and standard error
SE(Y /n) =
p
p(1 p)/n.
Normal Approximation.
p 2SE(
p)
is an approximate 95% confidence interval for p, based on the normal
approximation. This requires that n be large enough that the confidence
interval is well within the unit interval, e.g. still within the unit interval is
you replace the 2 by 2.5 or 3.
The normal distribution can be used to approximate other things besides a
binomial random variable, and we might be interested in points of the
normal distribution other than the 2SD that cover the middle 95%, but
more on that later.
4.8. HOMEWORK EXERCISES
4.8
131
Homework Exercises
Exercise 4.8.1 In Mendels F2 generation (the result of self-fertilization of

hybrid plants), the probability that a seed is round is 0.75, and the
probability that a seed is wrinkled is 0.25.
(1) If we obtain 12 such seeds at random (i.e. without regard to the kind of
seed), what is the expected number of round seeds?
(2) What is the probability that exactly 6 of the seeds are round?
(3) What is the probability that 6 or fewer of the seeds are round?
Exercise 4.8.2 In Mendels progeny testing experiments (section 3.2.1) we

calculated a probability of 0.371 for the event that all 10 tested plants
exhibited the dominant form. Mendel subjected 500 plants to progeny
testing, and for 166 of those plants, all 10 progeny were of the dominant
form. What is the probability of observing 166 or fewer such events?
Exercise 4.8.3 One unit of distance along a chromosome is the Morgan.

For small distances, the centiMorgan (cM) corresponds to the the expected
percentage of cross-over events. For example, if two markers are 5
centiMorgans apart, the probability of a cross-over betweem them is about
0.05. Suppose we follow a pair of markers on 400 chromosomes, and
observe 40 cross-over events. What is the 95% confidence interval for the
probablity of a cross-over between the markers (to 3 digit accuracy)?
Exercise 4.8.4 Wicker et al.2 reported on an experiment in which 63 of 81

NOD mice became diabetic by 7 months of age. Using these data, compute
an exact binomial 95% confidence interval for the probability of diabetes
under these conditions. also compute the 99% confidence interval.
2
1994 J Exp Med 180:170513
132
Exercise 4.8.5 The Health and Nutrition Examination Study (HANES)

reported the mean height of 6,558 women age 1874 as 63.5 inches, with a
standard deviation of 2.5 inches.
(1) What is the standard error of the mean?
(2) Give the mean and standard deviation in cm. (Recall 2.54 cm = 1 in.)
(3) In what range of heights (in cm) would you expect to find the middle
two-thirds of the women?
(4) If we randomly selected 1000 women from this database, would you
expect the mean to be larger, smaller, or about the same?
(5) Would you expect the standard deviation for this subset to be larger,
smaller, or about the same?
(6) Would you expect the standard error of the mean to be larger, smaller,
or about the same?
Exercise 4.8.6 Mice of the inbred NOD line (all homozygous) were
crossed with B6 mice, another inbred line. The hybrid mice were
back-crossed to the parental NOD line, so at any given locus, each resulting
mouse could be either homozygous for the NOD allele, or heterozygous.
Suppose 60 of the resulting mice were diabetic, and of those diabetic mice,
43 were homozygous for the NOD allele at the IL2 locus. Are these results
consistent with the hypothesis that the probability of a dibetic mouse being
homozygous is 0.5 at this locus? Provide an appropriate calculation, either
aproximate or exact, and interpret it.
4.8. HOMEWORK EXERCISES
133
Exercise 4.8.7 Suppose 8 genes have monoallelic expression in one neural

cell type, and we plan to test each of these 8 genes for similar monoallelic
expression in a different cell type. Suppose that the reality is that none of
the 8 genes have monoallelic expression in the second cell type, but we wont
know that when we do the testing. If each of our 8 tests is independent,
with a 0.05 probability of a false-positive finding, what is the probability of
at least one false-positive finding?
Exercise 4.8.8 The percentage of CD8+ T-cells expressing a given CMV

antigen (by flow cytometry, using a tetramer assay) is best displayed on a
logarithmic scale. Suppose the geometric mean in controls is 0.5 percent,
and you are interested in whether an experimental stimulation doubles the
geometric mean. How large a difference are you looking for on a base 10
logartithmic scale?
Exercise 4.8.9 Suppose the standard deviation of a flow cytometry

measurement is 0.40 on a base 10 logarithmic scale. How many subjects
must you study if you want the standard error of the mean (on the same log
scale) to be 0.10?
134
Chapter 5
Estimation & Testing using
Students tDistribution
135
136CHAPTER 5. ESTIMATION & TESTING USING STUDENTS TDISTRIBUTION
5.1
A Single Sample
Mass is the the only unit of measurement that is still defined by a physical
object. The treaty of the meter defines the kilogram as the weight of an
object made of platinum-iridium alloy that is kept in Paris. Various other
standard objects are used to calibrate measurements of mass around the
world, and figure 5.1 shows the results of 100 weighings of one of them. The
object is called NB10, and it resides at the U.S. National Bureau of
Standards.
The units labeling the histogram are micrograms below 10 grams, which is
the nominal mass. The curve shows the Gaussian (i.e. normal) density with
the same mean and standard deviation as the 100 measurements, and the
scale without numbers shows increments of one to three standard
deviations above and below the mean.
This histogram is typical of many kinds of measurements, in that they are
approximately Gaussian, but with slightly heavier tails. Notice that two
observations out of 100 are about 5 standard deviations away from the
mean, a distance that is very unlikely under a Gaussian distribution. The
phenomenon of having one long, heavy tail is called skewness. When both
tails are heavy, the phenomenon is called kurtosis. Another aspect of
kurtosis is that there tends to be additional mass in the center of the
distribution, compared to a Gaussian distribution, otherwise we would
simply have a larger standard deviation.
The mean of the 100 measurements is x = 404.59, and the standard
deviation is s = 6.47. The standard deviation describes the spread of the
individual measurements and is reflected in the width of the normal curve.
The standard error of the mean is
= s = 6.47 = 0.647,
SE(X)
10
n
so an approximate 95% confidence interval would be the mean, give or take
two standard errors. The small dark bar on the axis shows the 95%
confidence interval. It is not much wider than one of the histogram bars. It
is important to bear in mind that the standard error and the confidence
interval are statements about how well the mean is estimated. The
standard deviation describes the variation in the measurements. The
standard error does not.
137
5.1. A SINGLE SAMPLE
0.00
0.02
0.04
Density
0.06
0.08
0.10
100 Weighings of NB10
380
390
400
410
420
430
440
Micrograms below 10g
Figure 5.1: 100 weighings of a standard object at the National Bureau of

Standards.
We can be a little more accurate about the confidence interval. If we want

three-digit accuracy, we can notice that the middle 95% of the area under a
Gaussian curve is the area within 1.96 standard errors. For two-digit
accuracy, we can use 2.0. If we need more accuracy, or if we want to use a
confidence level other than 95%, we can look up quantiles of the normal
distrubution. In R, the task would look like this.
> options(digits=3)
> qnorm(.025)
[1] -1.96
> qnorm(.975)
[1] 1.96

but we could calculate both ends of the interval in one step, by collecting
the right tail probabilities into a vector:
> qnorm(c(.025,.975))
[1] -1.96 1.96
If we wanted a 90% interval, we would do this
> qnorm(c(.05,.95))
[1] -1.64 1.64
to get the number of standard errors.
The 100 measurements are in a file, NB10wt.txt, but they are organised a
little differently than the data we have seen before. In previous examples,
we used the read.table or read.csv functions to read data that was
organized as a rectangular array, with different columns representing
different variables. This data file has five columns of numbers, separated by
blank space, but the columns are not separate variables, they are just
additional measurements. In R we can use the more primitive scan
function to read this simple format.
nb = scan("NB10wt.txt")
We can then calculate a 95% confidence interval for the mean in one step
(aside from resetting the number of digits):
> mean(nb) + qnorm(c(.025,.975)) * sd(nb)/sqrt(100)
[1] 403.3 405.9
In summary, we can say that the NB10 object is between 403 and 406
micrograms shy of 10 grams, but individual measurements will have a
standard deviation of about 6.5 micrograms, and we will occasionally see a
slightly wild measurement that is off by 30 micrograms or so. This is about
as good as it gets for weighing small objects.
5.2. A PAIRED EXPERIMENT
139
REG KILN diff

1 1903
2009 106
2 1935
1915 -20
3 1910
2011 101
4 2496
2463 -33
5 2108
2180
72
6 1961
1925 -36
7 2060
2122
62
8 1444
1482
38
9 1612
1542 -70
10 1316
1443 127
11 1511
1535
24
Mean 1841
1875 33.7
SD 342.7 332.9 66.2
Table 5.1: Gossets Corn yeild data. REG = Corn yield (lbs/acre) from
regular seed; KILN = Corn yield from kiln-dried seed; diff = Difference
(KILN - REG).
5.2
A Paired Experiment
William Gosset, who worked for the Guiness brewery, published a paper1 in
1908 that solved a practical question of inference by inventing a tool that
scientists (and many others) been using ever since. The idea was polished
up a bit by Ronald Fisher, and is now known as Students t-distribution.
The name Student refers to Gosset, whose employer didnt want
competitors to know that this sort of thing was useful to brewers.
One of the examples addressed by Gosset concerned a question in
agruculture, as to whether using kiln-dried seed improved yields. The
experiment was done in 11 fields. Each field was divided in two, with
regular seed planted in one half, and kiln-dried seed planted in the other
half. The results are shown in table 5.1.
Notice that the difference of the means for REG and KILN is the mean of
the differences. Notice also that the standard deviation of the differences is
1
W.S. Gosset, The Probable Error of a Mean, Biometrika, 6 (1908), pp 1-25.

much smaller than the standard deviations for the separate seed types.
Question: How many fields had above-average results for one kind of seed,
but below-average results for the other kind? Does this suggest why the
differences have a smaller standard deviation?
Some features to notice about the study design:
1. It is experimental, as opposed to observational. The investigators
decided where to plant the two kinds of seed. This is very different
from comparing farms that use kiln drying or not for their own
reasons.
2. It is comparative. We dont just get an estimate of yield for the
kiln-dried seed. We also get comparable yields for the regular seed.
3. It uses local controls. The regular seed is planted in the same fields as
the kiln-dried seed, not in a different set of fields.
The mean difference is about half of a standard deviation. It doesnt look
like there is much evidence that kiln-dried seed has any advantage. But if
there is some systematic difference, how large might it be?
A confidence interval would answer this question. If the sample were large,
we could add and subtract 2 standard errors to the mean difference,
knowing that this interval would cover the population mean difference
about 95% of the time.
A large-sample confidence interval (with 95% confidence) is of the form
x 1.96SE,
where SE is the standard error of x. We can genearlize this to other levels
of confidence, as
x z/2 SE,
where z/2 is the quantile of the Normal distribution that has a one-tail
probability of /2. The coverage of the interval is 100(1 ) percent. So if
= 0.05, /2 = 0.025, z/2 = 1.96, and the interval has 95% confidence.
141
5.3. THE T-DISTRIBUTION
5.3
The t-distribution
The assumption underlying the large-sample confidence interval is that the

sample mean will fall within z/2 standard errors of the population mean
with probability 1 , i.e.

|X |
< |z/2 | = 1 .
(5.1)
Pr
SE
This in turn relies on two assumptions:
follows a normal distribution;
1. X
2. SE is the true population standard error.
If the sample is large, the central limit theorem makes the first assumption
work. If the sample is small, but the individual observations are
approximately normal, then the first assumption is still fine.
Gosset was concerned with the second assumption. Because equation 5.1
involves division by the SE, underestimating it might have a big effect.
Gosset determined that in small samples, the means appear to vary more
relative to an empirical standard error than they would relative to the true
standard error. The solution was to develop something like the normal
distribution, but a little wider to catch the extra variation. This was called
Students t-ditribution.
A small-sample confidence interval could now be constructed as
x td,/2 SE
where td,/2 is a quantile of the t-distribution with d degrees of freedom,
and tail probability /2.
The new ingredient is d, the number of degrees of freedom. In our simple
example we have
d = n 1.
The degrees of freedom are the number of independent observations

contributing to the estimation of the standard error. Becasue this involves
deviations from the mean, which must sum to zero, once we know n-1
deviations, we also know the last.

In more complex problems, we will encounter slightly more involved
expressions for degrees of freedom. Typically, it is the number of
observations less the number of means (or parameters) being estimated.
How much difference does it make? That depends on the degrees of
freedom, hence on the sample size. The table below gives the .975 quantile
of Students t-distribution for a variety of degrees of freedom. As the
degrees of freedom grow, the t-quantile approaches 1.960 the corresponding
quantile for the normal distribution.
df
500
100
50
20
10
5
3
2
t
1.965
1.984
2.009
2.086
2.228
2.571
3.182
4.303
Table 5.2: .975 quantiles of Students t-distribution. Compare to 1.960 for

the Normal distribution.
In Students example, we have 10 degrees of freedom, so we need to add
and subtract 2.228 standard errors instead of 1.96. Our interval needs to be
about 14% larger, due to the limited sample size. The confidence interval
for the mean difference, i.e. the effect of kiln-drying, should be
33.7 (2.228)20
i.e. (11, 78). The effect might be to add as much as 78 lbs/acre, but it
might have no effect, or a slight harm.
5.4
Two Independent Samples
Samuels & Witmer2 present a pair of histograms showing the distribution

of hematocrit level (%) in two samples of 17-year-old American youths.
5.4. TWO INDEPENDENT SAMPLES
143
Males (Y ) Females (X)

Mean
45.8
40.6
SD
2.8
2.9
n
489
469
Table 5.3: Summary statistics for hematocrit measurements (units are percent).
The distribution for males seems to be shifted upward from that of females,
but without any increase in variation.
The sample mean for males exceeds that of females by 5.2 (percent).
How accurate is that, as an estimate of the difference between the
population means for males and females? We could put separate confidence
intervals around the two means, but that wouldnt really answer the
question about the size of the difference. We need the standard error for
the difference of the sample means.
5.4.1
Standard Error of a Difference
We know that the standard error for males is

2.8
= 0.1266
SE(Y ) =
489
2
Statistics for the Life Sciences (third ed), 2003

and the standard error for females is
= 2.9 = 0.1339.
SE(X)
469
To get the standard error of a difference (or a sum) we can combine the
standard errors by adding squares, like in the Pythagorean theorem:
q
SEY X = SE2Y + SE2X .
For the hematocrit data, this is
0.12662 + 0.13392 = 0.184.

We can now construct a 95% confidence interval for the difference,
(45.8 40.6) 1.96(0.184)
= 5.2 0.36
= (4.8, 5.6).
The large number of observations justifies the use of 1.96, the Normal
quantile, as the multiplier of the standard error.
5.4.2
Pooled Standard Deviation
If we let x and y denote the population standard deviations, then the
standard errors for the two means are x / nx and y / ny , and the
standard error of the difference of two independent averages is, using this
slightly more detailed notation,
s
y2 x2
=
SE(Y X)
+ .
(5.2)
ny nx
If we can assume that both populations have the same standard deviation,
i.e. x = y = , then we can then combine all of the deviations from the
group-specific means to form a pooled estimate of . Lets call the estimate
. We can use it in equation 5.2, replacing both x2 and y2 by

2.
145
It is easier to work with variances, which are just the squares of standard
deviations. The pooled variance estimator is a weighted average of the two
variances, with the weights being nx 1 and ny 1, the respective degrees
of freedom.
(nx 1)s2x + (ny 1)s2y
.
2 =
nx + ny 2
For the hematocrit data, we get
2 =
(468)2.92 + (488)2.82
= 8.12
468 + 488
so
= 2.85, which is right between the two sample standard deviations, as
we might expect.
The standard error of the difference of means is now a little simpler, since
we now get to move the common standard deviation out from under the
square root, i.e.
r
1
1
=
SE(Y X)
+ .
nx nx
For the hematocrit example, this is
r
= 2.85
SE(Y X)
1
1
+
= 0.184.
469 489
This is the same as we got in the previous section, when we did not assume
equal standard deviations. That is because the empirical standard
deviations were almost identical. The pooled variance can be useful, for
example, if you have a large sample from a reference group, and a very
small sample from an experimental group, which seems to have an
implausibly small standard deviation. The idea of a pooled estimate of
standard deviation will come up again in the context of regression models,
and experiments with several groups. It is also common to assume equal
standard deviations when testing the effect of a treatment. Under the null
hypothesis of no treatment effect, we expect that standard deviations as
well as means will be equal.
The confidence interval for y x is simply
t SE(Y X)
(Y X)

where t is the quantile of the t-distribution with nx + ny 2 degrees of
freedom. In the example, this is
45.8 40.6 1.96(0.184)
or (4.8, 5.6), as before.
5.4.3
Example: Diet Restriction
Figure 5.2 gives boxplots for two groups of rats in a study of the effect of
calorie restriction on longevity. Both groups received a normal diet before
weaning, but one had a calorie restricted diet after weaning. The sample
statistics are tabulated below.
N/N85 N/R50
n
57
71
mean
32.7
42.3
SD
5.13
7.77
In this case it is probably safer to avoid assuming a common standard
deviation. That means calculating the standard error of the difference of
means using equation 5.2, but it also means that we should invoke one
more conservative measure, and use a reduced number of degrees of
freedom. This adjusted version of the t-test to accomodate unequal
standard deviations (i.e. unequal variances) goes by the name of Welchs
test. In R, it is the default procedure when you call the t.test function.
Rather than give the rather tedious formula for adjusted degrees of
freedom, it is easier to just see how to compute the test from raw data. We
begin with the data in two columns of a file named diet2.csv.
> df = read.csv("diet2.csv", header=TRUE)
> summary(df)
LIFETIME
DIET
Min.
:17.9
N/N85:57
1st Qu.:32.4
N/R50:71
Median :37.4
Mean
:38.0
3rd Qu.:45.3
Max.
:51.9
147
25
30
35
40
45
50
20
N/N85
N/R50
Figure 5.2: Length of life, in months, for rats fed 85 kcal/wk after weaning
(N/N85) and for rats fed 50 kcal/wk after weaning (N/R50).
Looking at the data via summary is always a good idea.
The actual test can be invoked like so.
> t.test(LIFETIME ~ DIET, data=df)
Welch Two Sample t-test
data: LIFETIME by DIET
t = -8.39, df = 122, p-value = 1.017e-13
alternative hypothesis: true difference in means is not equal to 0
-11.87 -7.34
sample estimates:
mean in group N/N85 mean in group N/R50
32.7
42.3

The tiny p-value tells us that the difference is highly significant, i.e. unlikely
to be due to chance variation. The confidence interval tells us that the
N/N85 group lives, on average, 7 to 12 months less than the N/R50 group.
5.5
One-sided versus two-sided
The significance probbility, or p-value, computed in the diet example is a

two-sided probability. We observed a difference of 42.3 32.8 = 9.6 months
of average lifetime. The p-value is the probability of a observing a 9.6
month advantage for either group in a similar experiment, if in fact there is
no read advantage for either group. The two-sided significance probability
tells us how easy it would be to erroneously conclude that the diet matters,
one way or another, when in fact is doesnt matter at all. For this reason,
referees and editors often demand that p-values be two-sided.
Formally, if we want to compare two populations (e.g. differently treated
rats), focusing on the difference of means, x y , we can test the null
hypothesis,
H0 : x y = 0
agains either a two-sided alternative,
HA : x y 6= 0
or against a one-sided alternative,
HA : x y > 0,
or
HA : x y < 0.
For a two-sided test, we calculate the probability that the absolute value of
the difference of sample means is as large at our observed difference, i.e.
Y | > |
Pr(|X
xobs yobs |)
and Y refer to
where xobs and yobs are the observed sample means, while X
the corresponding random variables.
149
5.6. COMPUTING
We accomplish this by expressing the difference in standard error units, so

we can look up the probability using Student t-distribution,

|X Y |
|
xobs yobs |
|
xobs yobs |
Pr
>
= Pr |T | >
SE
SE
SE
In the diet restriction example, we calculate
Pr(|T | > 8.39) = 1.0 1013
which is so tiny that the one-sided versus two-sided issue doesnt matter.
Although there can be technicalities, it is usually reasonable to think of a
two-sided p-value as twice a large as a one-sided p-value. Removing the
absolute values, and possibly changing the inequality in the above
equations gives the corresponding formula for one-sided p-values.
There are two major situations where a one-sided test is appropriate. One
is where there is a clear, a priori intention to test only a one-sided
alternative. The other situation is when the aim of the test is to choose
between one or the other of two alternative treatments or groups, and their
exact equality is not plausible. In that case, we neednt worry about the
chance of concluding that there is a difference when there is none. The
one-sided p-value tells us the chance of making the wrong choice in the
limiting case of a small difference. The confidence interval is likely to be
useful in such cases.
5.6
Computing
Calculating quantiles of Students t-distribution is similar to calculating

quantiles of the Normal distribution.
In excel, entering =TINV(0.05,10) will produce 2.228, the appropriate
quantile for a 95% confidence interval, based on 10 defrees of freedom. Note
that the probability argument is the error in both tails of the interval. The
second argument is the degrees of freedom.
The qt function in R returns the value of t with a specified probability
below. For example, this code,
> options(digits=4)

> qt(.025,10)
[1] -2.228
> qt(.975,10)
[1] 2.228
shows the values of t that cut off probabilities of .025 to the left and to the
right.
In R commander (well come to that in the exercises), you can use choose
Distributions, Continuous distributions, t distribuion, t quantiles then fill in
the desired one-sided tail probability and degrees of freedom.
At the R command line, you can get a little fancier, and calculate a
confidence interval from a mean and standard deviation like so:
> options(digits=3)
> 33.7 + 66.2 * qt(c(.025, .975), 10)
[1] -114 181
The c(.025, .975) function call just collects the two probabilities in to a
vector. Giving this to the qt function, returns the positive and negative
multiples of the standard error.
> qt(c(.025, .975), 10)
[1] -2.23 2.23
5.7
Exercises (not turned in)
Get the file diet2.csv from the course web page,

http://www.infosci.coh.org/jal/class/index.html, and save it in a
convenient folder.
In GraphPad/Prism
Strat Prism, and select column data, choosing what ever lkind of plot you
prefer. Open the data table. Open diet2.csv in excel, and paste the data
for the two treatments into two separate columns in Prism. A figure should
plot automatically when you click on the graphics tab. You choices on the
left should also give you a menu where you can ask for a t-test.
5.7. EXERCISES (NOT TURNED IN)
151
In R Commander
Start R, and type library(Rcmdr) to start R Commander. (Note: If you
have already opened and closed R Commander in this R session, you may
need to type Commander() to start the GUI program a second time.)
In the R commander window, using the File button in the upper left, select
Change working directory, and set the directory to the location where you
have saved data file, diet2.csv.
At the Data pull-down, select import data, then from text file. Change the
delimiter to comma. You can enter a simple dataset name, e.g. diet, if you
like. The rest of the defaults should be fine. This reads the data file into R.
From the Statistics pull-down menu, choose means, then Independent
samples t-test. Look over the form, but the defaults should be appropriate.
Clicking on the OK button should produce the same Welch t-test as in the
example above.
Try the graphics pull-down, and see if you can get a boxplot of the two
groups.
The REG and KILN variables are available in a text file named student.txt
on the course web page. Try reading the data as above. You will need to
set the delimiter to blank sppace, not commas, as this is not a .csv file.
From Statistics, summaries, numerical summaries you can get the mean
and standard deviation. Try selecting Statistics, means, paired t-test
The R Commander script window displays the commands that are being
used, in case you would like to see how things work. These commands can
be saved to make a program that documents exactly what was done, and
that can be re-executed.
5.8
Homework Exercises (Problem Set 3)
Exercise 5.8.1 (Darwin) Charles Darwin conducted an experiment

comparing cross-fertilized plants to self-fertilized plants. In each of 15 pots,
he planted one cross-fertilized plant and one self-fertilized plant, which
germinated at the same time. The plants were measured at a fixed time
after planting and the difference in heights between the cross- and
self-fertilized plants are recorded in eighths of an inch. The 15 differences
(diff ) are listed below.
diff:
49 -67 8 16 6 23 28 41 14 29 56 24 75 60 -48
(a) Calculate a 95% confidence interval for the mean difference.
(b) Does your interval include zero?

Exercise 5.8.2 (Virus growth) Obtain the file virus.csv from the
course website. It contains data from (S&W, example 9.10 ) a series of
experiments in which two strains of mengovirus were grown on mouse cells.
Replicate experiments were run on 19 days. Each number represents the
growth of the virus in 24 hours. We would like to test the null hypothesis
that the two strains have the same underlying growth rate (same population
mean), against the alternative hypothesis that there is a difference in growth
rates, on average, between the two strains. You can read the data into R
Commander, or paste the data into Prism, or use any other tool you like.
1. What computer program(s) did you use?
2. Are these paired data?
3. Should you use a one-sided or two-sided test? Why?
4. What is the significance probability (p-value) for the test?
5.8. HOMEWORK EXERCISES (PROBLEM SET 3)
153
5. What is the standard deviation for the non-mutant strain?

For the mutant strain?
For the differences?
6. What is the standard error of the difference of means?
7. Describe your conclusions in one or two clear sentences.
Exercise 5.8.3 (E coli) E. coli were incubated for 24 hours in sterile

water or in a non-antibacterial soap solution. A fixed volume was plated on
a medium in petri dishes, and colonies counted. The results are given below
for 8 control, and 7 soap solution dishes.
Control
30
36
66
21
63
38
35
45
Soap
76
27
16
30
26
46
6
(a) Are these paired data?
(b) Calculate a 95% confidence interval for the mean number of colonies
from the control.
(c) Calculate a 95% confidence interval for the mean number of colonies
from the soap solution.

(d) Compute the standard error of the difference of means.
(e) Calculate a 95% confidence interval for the difference of means.
(f ) Is there convincing evidence that the soap solution reduces the growth of
E. coli? Cite a result calculated from the data to support your answer.
Chapter 6
Comparison Examples
155
156
6.1
CHAPTER 6. COMPARISON EXAMPLES
Example: Genetics of T-cell

Development
The non-obese diabetic (NOD) mouse is a model of auto-immune diabetes,

but also relevant to T-cell lymphoma. Mary Yui, at Caltech, has studied
the abrogation of a checkpoint in early T-cell development in an NOD
model. Rag1 knock-out mice lack the ability to rearange and express a
T-cell receptor. This would ordinarily causes immature T-cells to arrest
development at the -selection check-point. However, the NOD genetic
background disrupts -selection checkpoint control. In NOD.Rag1-/(NOD.Rag) mice, aberrant breakthrough CD4+CD8+ double-positive (DP)
cells spontaneously appear in the thymus between 5 and 8 weeks of age.
When the Rag1 knock-out is bred onto a normal genetic background, such
at that of the C57 Black 6 strain (B6), no such aberrant cells appear. A
genetic cross was used to identify the location of genes in the NOD
background that are responsible for the check-point defect.
The file CD4two.csv, available on the course webpage, contains a subset of
data from an experimental backcross. Inbred NOD.Rag and B6.Rag mouse
lines were crossed, and the hybrid (F1) offspring were backcrossed to the
NOD.Rag line. One of the variables is a measure of CD4 positive T-cells
based on flow cytometry. CD4 can serve as a marker for cells that have
inappropriately passed the developmental checkpoint of interest. The file
has three variables that record the genotype at three respective loci. The
names of these variables begin X04, X13 and X14 indicating the
chromosome. These are a selection from 127 markers spaced throughout the
genome. Each of these variables is coded using N to denote the NOD
homozygous genotype, and H to denote the heterozygous genotype. We will
explore these data, and test the association of each of the three loci with
the level of CD4 expression.
6.1.1
Computing in R Commander
Read the dataset

Obtain the file CD4two.csv from the course website,
http://www.infosci.coh.org/jal/class/index.html.
Save it in a convenient folder, such as C:\Biostat.
6.1. EXAMPLE: GENETICS OF T-CELL DEVELOPMENT
157
Start R, and type library(Rcmdr) to start R Commander.

Set the working directory to your folder, using the File menu.
Read the data using the Data pull-down menu, selecting import data then
from text file . . . . On the data input form, name the dataset CD4, and
select comma as the file separator (.csv files are comma-separated).
Inspect the dataset
Visually inspect the data, using View data set button. Note the variable
names, and whether each is numeric or categorical. Note that there are a
few missing values, denoted NA.
Summarize each variable: From the Statistics pull-down, select Summaries
/ Active data set. How many animals are represented in the file? Which
variables have missing data?
Plot CD4 levels by each genotype, and by sex. From the Graphs pull-down,
select Strip chart. Select the first genetic marker, (beginning X04) as the
factor, and select CD4 as the response variable. Select Jitter and click on
the OK button. This should produce a plot showing individual points. Do
the same for the other two genetic markers. Can you see any apparent
association between CD4 levels and any of the genotypes? How might the
plot be improved?
A computed variable
Lets repeat these plots, using a logarithmic scale for CD4 levels. One way
to do this, is to create a new variable with the logarithms of the CD4
variable. (First, what is the minimum CD4 level? If this were zero, we
would have a problem taking logarithms.)
From the Data pill-down, select Manage variables in active dataset, then
Compute new variable. In the New variable name box, type logCD4. In the
Exression to compute box, type log(CD4), then click OK. This should
create a new variable in the active dataset.
Generate strip plots using the log of CD4 levels.
Alternatives ways of using logs
Instead of creating a new variable, we could just go back to the code for a
stripchart in the script window and change CD4 to log(CD4), then highlight
the whole stripchart command, and press the Submit button.
158
If we want to keep the labeling of the chart in the original units, we can
leave the CD4 variable as is, but add log=y, as an additional argument to
the stripchart command. We then highlight and submit, as before.
Boxplots
Try producing boxplots using a log scale, with the vertical axis labeled in
original units. (On the boxplot menu, choose No identification of outliers.)
Testing Association
From the Statistics pull-down, select Means then Independent samples
t-test. Select a genetic marker as the groups variable, and logCD4 as the
response. The defaults should be fine, so click on OK. How should we
interpret the p-value? Can you interpret the confidence interval as a range
of multiplicative effects?
6.1.2
Interpretation
Each of the genetic loci separate the same CD4 data into two groups. We
can look for an association by comparing the CD4 levels in these two
groups. For a neutral gene, we would expect both genotypes to have an
equal chance of appearing in the mice with high CD4 levels, so we would
expect no difference, on average.
Should we be interested in testing or estimation? In this study, elevated
CD4 levels serve as a flag for abrogation of a developmental checkpoint.
The level of CD4 expression may crudely reflect how many cells have
broken through, and how far through their developmental program they
have progressed, and so on. However, it is the fact that a mouse has cells
breaking through the checkpoint that is of interest, so we are mainly
interested in testing, for each gene, the hypothesis of equal mean CD4 levels
versus unequal levels.
How small must the p-value be to be convincing? The same test will be
applied to 127 genetic markers, but these effectively represent the entire
mouse genome. Because markers within 50 cM on the same chromosome
are not stochastically independent, there is a limit to how much
independent testing can be done. With a very dense set of markers, if we
find an association between genotype and phenotype, we will generally find
that neighboring markers are also associated with the phenotype, because
6.1. EXAMPLE: GENETICS OF T-CELL DEVELOPMENT
159
nearby markers are only occasionally separated by cross-over events.

Lander and Kruglyak1 suggest p < 1.0 104 should be regarded as
statistically significant in the setting of a mouse backcross. In ordinary
scientific work, when a single null hypothesis is being tested against a single
well-motivated alternative, p < 0.05 or p < 0.01 might be convincing. In
case-control genetic studies of humans, where even nearby genes may
exhibit independent segregation, we may need p-values that are smaller by
several orders of magnitude.
Is this really how its done? Almost, but not quite. Genetics has a lot of
specialized methods. The usual approach to experimental crosses like this is
called interval mapping2 , which has an extensive literature and specialized
software. This refinement essentially accomplishes two things. First, it
permits testing for association at loci in between the actual markers. If the
markers are reasonably dense, the flanking markers will also be associated
with the phenotype, so this is perhaps a modest gain of efficiency. Second,
the same ability to interpolate between markers permits the method to
patch-over missing data by drawing on the information from flanking
markers. This is another modest gain in efficiency, depending on how much
of the genotype data is missing. Beyond these statistical efficiencies, interval
mapping permits the orderly processing of large batteries of markers, and
results in a nice plot that traces the evidence for genetic involvement along
the entire genome. While an investigator would probably turn the analysis
of such data over to a specialist, it is important not to regard the resulting
figures and statistics as anything mysterious. Very simple analysis of key
markers ought to confirm the results, at least in its broad-brush pattern.
6.1.3
Computing with Prism
You can use Prism to make plots and compute t-tests comparing two
genotypes, but there are a few differences. Because Prism expects the data
for the two groups being compared to be in two separate columns, you will
need to sort the data in a spreadsheet, and paste in the data separately for
each genetic marker. There is also a limit to Prisms ability to compute
small p-values, limiting the value of its calculations in this setting.
1
2
1995 Nature Genetics, 11:241-7

Lander and Botstein, Genetics, 1989, 121:185-199
160
Open the dataset in Excel, and sort the dataset by the first gene marker.
Be sure to highlight the whole dataset before sorting. If you were to sort
one column separate from the others, the integrity of the dataset would be
ruined.
Open Prism, and select column data.
Paste the CD4 values for the H genotype (of the sorted variable) into a
column of the Prism data table. Label it H04 (assuming it is the marker on
chromosome 4). Paste the CD4 values for the N genotype into a second
column, and label it something like N04.
Transform the data. Use the Analyse button, select Transform and OK.
Click on Transform Y values using, and select Y = log(Y ). Click the New
Graph button and click OK.
Go back to the Analyse button, and select t tests (under Column analyses).
In the Parameters window, next to Test Name, select Unpaired t test with
Welchs correction and click OK. You should get the same t-statistic as in
R, but compare the p-values.
Examine the plot.
Repeat for a different marker.
6.2
Tests in General
A hypothesis test involves the following:

1. A null hypothesis;
2. A test statistic;
3. A probability calculation.
The null hypothesis may be
plausible, e.g. a treatment may plausibly have no effect at all, or
dividing, e.g. two vaccines are expected to yield different average
antibody titers, but we cant predict which will yield the larger mean
titer.
6.2. TESTS IN GENERAL
161
The test statistic orders all the possible samples, telling us which samples
are more consistent with the null hypothesis, and which are more in conflict
with it.
The probability calculation is made under the assumption that the null
hypothesis is true. There are two kinds.
A p-value is the probability, under the null hypothesis, of getting data as
extreme as the data we actually observed.
A critical value is a pre-specified value of the test statistic, such that the
probability of a test statistic as large as the critical value is , under
the null hypothesis. Note that is specified in advance.
6.2.1
A Lady Tasting Tea
In a version of Ronald Fishers famous example, a lady is given 5 pairs of

cups, and points out which received milk before tea.
Null Hypothesis: She is guessing, so her success probability is 0.5 on
each trial.
Test Statistic: Her number of successes in 5 trials. Bigger counts are
more evidence against the null hypothesis.
P-value: Having observed 5 successes, we calculate the probability of 5
(she cant get more) under the null hypothesis. This is
(1/2)5 = 0.03125.
If we specify that we would only be impressed by an outcome with less than
5% probability if shes guessing, then we would only be impressed by 5 out
of 5, because there is a 19% chance of getting 4 or more correct by guessing.
So 5 would be our critical value.
6.2.2
Tests and Confidence Intervals
A 95% confidence interval consists of all those values of a parameter that

would not be rejected at the 0.05 level, if each were to be considered as a
null hypothesis.
162
We can also calculate a one-sided confidence bound, shown here using R:

> binom.test(5, n=5, p=.5, alternative="greater")
Exact binomial test
data: 5 and 5
number of successes = 5, number of trials = 5, p-value = 0.03125
alternative hypothesis: true probability of success is greater than 0.5
0.5492803 1.0000000
sample estimates:
probability of success
1
6.3
t Test
Confidence interval:
(
y1 y2 ) td,.025 SE(y1 y2 )
Hypothesis test: we reject H0 : 1 = 2 at the = 0.05 level when
|
y1 y2 |
td,.025 .
SE(y1 y2 )
Note that these use the same ingredients.
Note the use of absolute values, and /2 in the t-statistic. This is for a
two-sided test, that will reject the null hypothesis for a large difference in
either direction.
Note that the presence of the hypothesized parameter value in a
(1 ) 100% confidence interval is equivalent a hypothesis test at the
level of significance (with selected in advance).
6.3.1
Interpretation of
If the H0 : 1 = 2 is true, and many tests were done at the = 0.05 level,
we would expect
6.4. EXAMPLE: INFERENCE V. PREDICTION
163
95% would not reject H0 ,

2.5% would wrongly conclude 1 > 2 , and
2.5% would wrongly conclude 1 < 2 .
6.3.2
Type I and Type II errors

Decision
Do not Reject H0
Reject H0
State of Nature
H0 True
H0 False
Correct Type II Error
Type I Error
Correct
The test size or significance level, , is the type I error rate we are willing
to risk.
The probability of not making a type II error is the power of the test.
Power is a function it depends on 1 2 , as well as the standard
deviation and the sample size.
Effect Size:
|1 2 |
We will consider the calculation of power in a later lecture.
6.4
Example: Inference v. Prediction
Samuels and Witmer (example 7.25) present an example in which serum

Lactate Dehydrogenase (LD) was measured in healthy young men and
women.
Males Females
n
270
264
y
60
57
s
11
10
t test:
|60 57| 0
= 3.3
t= p
112 /270 + 102 /264
164

p 0.001
95% confidence interval

p
(60 57) 1.96( 112 /270 + 102 /264) = (1.2, 4.8)
Any plausible difference is small compared to the standard deviation of an
individual measurement.
Question: Does it make any sense to say men have higher LD than
women? What is the probability that a randomly selected man will have
higher LD than a randomly selected woman?
This is the probability that Y1 Y2 > 0.
Suppose Y1 Y2 is approximately normal, with mean difference of 3.3, and

with 1 = 11, and 2 = 10. Then
SD(Y1 Y2 ) = 112 + 102 = 14.87

so
0 3.3
) = 0.59
14.87
i.e. 41% of the time, its the woman who will have the larger LD value.
Pr(Y1 Y2 > 0) = Pr(Z >
We quote p 0.001 to support the inference that men have a larger average
LD.
0.59 is the probability of being correct when we predict that an individual
man has higher LD than an individual woman.
6.5
Assumptions
Students t-test makes several assumptions.

1. The data are stochastically independent.
2. The random variation in the sample means is approximately normal.
3. Sometimes we assume the variances are equal.
4. It makes sense to compare the population means.
6.5. ASSUMPTIONS
165
The first assumption is crucial. If we want to compare the mean serum

cholesterol levels of non-obese diabetic rats and another strain, it makes no
sense to select one mouse from each strain and measure each 20 times. We
need 20 separate mice, otherwise peculiarities of a single mouse will
influence all 20 measurements. Such peculiarities can arise despite the fact
that mice of a given strain are genetically identical.
The normality assumption applies to the means, not to the original data. If
the sample sizes are substantial, this is often a good assumption, unless the
original data are particularly prone to wild values. If the sample size is
small, this is more on an issue.
We can avoid the assumption of equal variance by using the Welch test, but
we may get somewhat weaker results, and we need to ask ourselves how we
would interpret a situation of unequal variances but equal means.
Probably the most important assumption is that the whole idea makes
sense. Hypothesis tests have the ring of a definitive answer, but the question
they answer, i.e. can chance account for the difference, is sometimes the
right question, and sometimes not. You have to think about it.
166
6.6
Exercises
Exercise 6.6.1 (CD4) Test each of the three genetic markers in the
CD4two.csv file for association with CD4 T-cell levels. Use the significance
criterion of Lander and Kruglyak. For each marker that you regard as
significantly related to the CD4 variable, make a plot comparing the CD4
levels for the two genotypes.
Exercise 6.6.2 (Notation) Let be a population mean.

the the sample mean, considered as a random variable.
Let X
Let x be the observed sample mean.
(a) Which has variation described by a standard error?
(a) Which is contained in a 95% confidence interval with a .95 probability?
(c) Which is in the center of the numerical confidence interval that we

report based on a t-distribution?
Exercise 6.6.3 (From summary data) An experiment was done to test
the effect of the psychoactive drug pargyline on the feeding behavior of black
blowflies. The response that was measured was consumption of a sucrose
solution (mg). A summary is given in the table.
Control Pargyline
Mean
14.9
46.5
SD
5.4
11.7
n
900
905
.
(a) Would a pooled standard deviation estimate be appropriate?
(b) Estimate the standard error of the mean for the control group.
6.6. EXERCISES
(c) Estimate the standard error of the mean for the Paryline group.
(d) Give a 95% confidence interval for the difference of the means.
167
168
Chapter 7
Contingency Tables
169
170
CHAPTER 7. CONTINGENCY TABLES
In previous chapters we considered comparing two population means using

confidence intervals and hypothesis tests. For quantitative measurements,
our tests were based on the normal distribution if the samples were large,
or on the t-distribution if the samples were small. For binary outcomes, we
could compare success rates using the normal approximation for large
samples, and using exact binomial calculations for small samples. Here we
generalize from binary outcomes to categorical outcomes with more than
two possibilities on each trial.
With more than two categories, there is a rich collection of possible
patterns, so even the analysis of a single sample can be more complex than
the comparison of two means. We will focus first on hypothesis tests, where
we can, to some extent, cut through the complexity by using a
general-purpose test. Like the previous situations, we will have an
approximate method for large samples, and exact methods for small
samples. The large-sample method is called Pearsons Chi-Square test, and
the small-sample methods are Fishers exact test and its variants.
Even though t-tests and exact binomial calculations are considered
small-sample methods, they can often be used with large samples. With
categorical data, the exact methods applicable to small samples are often
computationally infeasible for even moderately large samples. Some
situations, such as the large sparse tables of mutation spectra, may require
computer simulation methods.
7.1
Chi-square goodness-of-fit test
Consider an experiment in which mice from a diabetes-prone inbred line

(NOD) were crossed with a non-diabetic strain. Mice from the F1, or
hybrid generation were crossed to each other to generate the F2 generation.
The mice that developed diabetes were then genotyped. The table below
gives the distribution of genotypes at the IL2 locus for those mice that
developed diabetes. (The allele from the diabetic strain is denoted D, while
N denotes the allele from the non-diabetic strain.)
Data
We would like to test whether this is a significant departure from
Mendelian proportions, as that may indicate that the IL2 locus, or a nearby
7.1. CHI-SQUARE GOODNESS-OF-FIT TEST

Genotype at IL2:
Diabetic Mice:
DD
19
171
DN NN
19
2
Table 7.1: 40 diabetic F2 generation mice, with genotypes at the IL2 locus.
gene, has an effect on susceptibility to diabetes.
Recall that a hypothesis test involves three ingredients:
1. a null hypothesis;
2. a test statistic, to measure the departure from the null hypothesis;
and
3. a significance probability calculation, to determine how likely such a
large (or larger) departure is under that null hypothesis, i.e. how
easily the results can be explained by the play of chance alone.
Lets consider each ingredient in turn.
Null hypothesis
Mendelian segregation predicts that the genotype probabilities are:
Genotype
Probability
DD DN NN
0.25 0.50 0.25
Test Statistic
Multiplying these fractions by 40, the total number of diabetic mice, yields
the expected counts under the null hypothesis. The table below shows the
observed counts above the expected counts.
Observed (Diabetic)
Expected (Diabetic)
DD DN NN
19
19
2
10
20 10
A test statistic should be something that we can compute for any possible
dataset that we might observe, and it should put all of those possible
datasets in order by measuring the departure from the null hypothesis.
Datasets that are consistent with the null hypothesis should yield small
values for the statistic, and datasets that are relatively inconsistent with
the null hypothesis should yield large values.
172
How should we measure the departure from the null hypothesis in this
example?
With three genotypes, there are multiple ways that the observed counts
could differ from the expected counts. Possible departures from the null
hypothesis include:
a recessive effect, i.e. an increase in the DD genotype among diabetic
mice;
a dominant effect, i.e. comparable enrichment of both the DD and
DN genotypes among diabetic animals;
an allele-dose effect i.e. a large enrichment for DD, and a lesser
enrichment for DN genotypes among diabetic mice;
a heterozygous effect, i.e. an enrichment for the DN genotype among
diabetic mice;
any of the above, with the enrichment replaced by a deficit.
While it is reasonable to focus attention on one or another of these
alternatives to the null hypothesis, and to develop a test statistic that is
optimized for detecting such patterns, we will consider a general-purpose
test statistic that is capable of detecting, with enough data, any sort of
departure from the null hypothesis.
Pearsons chi-square statistic uses the sum of squared deviations
between observed and expected counts to measure the overall deviation
from the null hypothesis. The squared deviation for each genotype is
divided by the expected count before summing. This accomodates the fact
that, unlike proportions, which converge on probabilities, counts tend to
drift away from their expected values as the number of observations
increases, and the amount of drift tends to be proportional to the expected
count. We could describe a chi-square test in terms of proportions instead
of counts, but the count-based version is quite general, and easier to
remember.
Pearsons chi-square test statistic has the general form
X (observed expected)2
expected
7.1. CHI-SQUARE GOODNESS-OF-FIT TEST
173
In this example, the sum is over the three genotype categories. If we let Xi
be the i-th count, and ei the corresponding expected count, we have
2 =
X (Xi ei )2
ei
where the sum is over all counts.

Calculating the chi-square statistic for the example, we get
(19 10)2 (19 20)2 (2 10)2
=
+
+
= 8.10+0.05+6.40 = 14.55. (7.1)
10
20
10
2
Significance Probability
Under the null hypothesis, i.e. if the counts really do have the expected
values specified by the null hypothesis, and if the sample size is reasonably
large, then the distribution of this statistic is well approximated by a
chi-square distribution, which has tail probabilities and quantiles that we
can easily look up using a computer or a table. Like the t-distributions, the
specific distribution within the chi-square family depends on the number of
degrees of freedom, which in this case is one less than the number of
categories, i.e.
df = 3 1 = 2.
This is the number of observed counts we need to know before we can
deduce the rest from the total.
Large sample size justifies the chi-square distribution, and the sample is
generally large enough if all of the expected counts are bigger than 5. Note
that this refers to expected counts, not observed counts. We can often get
away with even smaller expected counts, but there are alternative exact
methods that we should use for small samples, if they are computationally
feasible.
The significance probability for the IL2 locus is
Pr(22 > 14.55) = .0007
where 22 denotes a chi-square random variable with 2 degrees of freedom.
Computing
We can look up the significance probability using a number of tools.
174
Excel: =CHIDIST(14.55,2), or select CHIDIST from the Insert menu.

R: 1 - pchisq(14.55, 2) or pchisq(14.55, 2, lower.tail=FALSE)
R Commander: Select Chi-squared probabilities from the Continuous
distributions menu. Enter the value (e.g. 14.55), degrees of freedom
(e.g. 2), and check Upper tail.
The instructions above are for looking up the significance probability after
we have already computed the test statistic. We can of course get a
computer to do the whole job for us, starting with the table of counts. In
R, this can be done from the command line (not an R Commander menu)
like so:
chisq.test(c(19,19,2), p=c(.25,.5,.25))
We can also get a computer to tabulate the three counts from the
individual genotype data. However, that is more a matter of data
management, and it depends on the details of how the raw data were
coded, so we wont go into such things.
7.2
Comparing 2 Groups: Testing

independence of rows and columns
A gene might cause a departure from expected counts without being related
to diabetes. For example, a gene would not follow Mendelian expectations if
it has an allele that reduces the survival chances of a mouse fetus. Such an
effect could, however, be seen in non-diabetic as well as diabetic mice. For
this reason, the NOD intercross experiment included genotyping of a sample
of non-diabetic mice, in addition to the diabetic mice whose genotypes are
given above. The combined data are given in Table 7.2 below.
We can use these data to compare the genotype distribution of the diabetic
mice to that of the non-diabetic mice, rather than comparing them to
Mendelian expectations. Our null hypothesis in this case is that the same
genotype probabilities, Mendelian or otherwise, apply to both the diabetic
and non-diabetic mice. This null hypothesis encompasses any pattern of
7.2. COMPARING 2 GROUPS: TESTING INDEPENDENCE OF ROWS AND COLUMNS175
Diabetic
Non-Diab.
DD DN N N
19
19
2
10
26
9
Table 7.2: Observed Counts, cross-classified by diabetes and IL2 genotype.

genotype probabilities, as long as that pattern is independent of diabetes.
Rejecting this broader null hypothesis rules out any effect on viability that
does not relate to diabetes, leaving only associations between genotype and
diabetes.
Under the null hypothesis, we could estimate the probability of the DD
genotype by combining all 85 mice, so the expected number of DD
genotypes among the 40 diabetic mice is (29/85)40 = 13.65, and the
expected number among the 45 non-diabetic mice is (29/85)45 = 15.35. We
can do a similar calculation for the DN and N N genotypes to get a table
of expected counts.
Expected counts:
Diabetic
Non-Diab.
Total
DD
DN NN
13.65 21.18 5.18
15.35 23.82 5.82
29
45
11
Total
40
45
85
Letting xij be the observed counts from the i-th row and j-th column, and
letting eij be the corresponding expected counts, the chi-square statistic is
X (xij eij )2
= 8.07
eij
ij
In this particular example, the calculation is
(19 13.65)2 (19 21.18)2 (2 5.18)2
+
+
13.65
21.18
5.18
(10 15.35)2 (26 23.82)2 (9 5.82)2
+
+
+
= 8.07.
15.35
23.82
5.82
There are two degrees of freedom because, given the marginal totals, when
we have two of the six cells in the table, we can get the rest by subtraction.
176
In general, if we have r rows and c columns, the number of degrees of

freedom for testing independence of rows and columns is (r 1)(c 1).
Referring 8.07 to a chi-square distribution with 2 degrees of freedom, we
can calculate
Pr(22 > 8.07) 0.02
i.e. we say that the significance probability, or p-value, is 0.02.

Computing
The chi-square test is most often computed from data organized as a

rectangular table, like the 3 by 2 table above. The Statistics menu in R
Commander has a choice for contingency tables. From there you can choose
Enter and analyze two-way table . . . . You can adjust the number of rows
and columns in the table, then enter the counts, e.g. from the three-by-two
table of observed counts.
At the R command line, i.e. witout R Commander, you can use the
chi.square function. You dont have to supply a vector of expected
counts, but you do have to put the observations into the form of a table or
matrix. There are many ways of doing so, Here is an example using the
data.frame function, which is a fundamental tool worth knowing, and
which produces nicely labels rows and columns, so you can figure out what
you were doing when you need to look at your output later.
> IL2data = data.frame(DD=c(19,10), DN=c(19,26), NN=c(2,9),
+ row.names=c("diab","non-diab"))
> IL2data
DD DN NN
diab
19 19 2
non-diab 10 26 9
> chisq.test(IL2data)
Pearsons Chi-squared test
data: IL2data
X-squared = 8.0703, df = 2, p-value = 0.01768
In excel, you can use the CHITEST function, but this requires that you

construct your own table of expected counts.
In Prism, you can select the Contingency tab under New Table &
Graph. Click continue to create an empty table, then fill in the counts,
and click Analyze in the menu at the top. Under Contingency Table
Analyses you can pick Chi-Square (and Fisher exact) test. Clicking OK
will produce the chi-square statistics, defrees of freedom, and significance
probability. Note that the One- or two-sided field says NA. That is
because there are more than two ways to depart from the null hypothesis.
7.2.1
One-sample versus two-sample chi-square tests
There are several points to notice about these two tests of the IL2 locus.
In both examples the chi-square statistic was a weighted sum of
squared deviations from expected, but when we include the diabetic
mice, there are twice as many observed counts to sum over.
Both examples had statistics with 2 degrees of freedom. If we are
given the margins of the observed table, we can only fill in two counts
before we can start deducing the rest by subtraction from the table
margins.
Finally, the comparison to Mendelian theory yielded a much more
convincing rejection of a simpler null hypothesis, but the comparison
of diabetic to non-diabetic results tested a broader null hypothesis,
the rejection of which has stronger implications.
The last of these has effects on the interpretation of the results.
An important detail of this study is that only a minority of the F2
generation mice developed diabetes. The reason for this is that
homozygosity for the NOD allele at the major histocompatibility (MHC)
locus is required for diabetes to develop. A single copy of the alternative
allele at the MHC locus protects against diabetes. That means that 75% of
the the mice in the F2 generation cannot get diabetes, regardless of the
genotype at IL2 or any other locus.
The investigators genotyped all of the F2 mice that developed diabetes, and
they all have the permissive DD genotype at the MHC locus, so we can
178
readily see the impact of other loci. The investigators also genotyped 45
non-diabetic mice, selected from a much larger number. These non-diabetic
mice consist predominantly of the 75% of mice that lack a permissive MHC
genotype, and for these mice, the genotypes at other loci are irrelevant to
diabetes, although they may be relevant to viability. If NOD alleles at the
IL2 locus promote diabetes, we might expect to see an excess of these
alleles among diabetic mice, but we can probably not detect a deficit of
such alleles among the non-diabetic mice, because the mechanism creating
a deficit would only apply to the minority of mice with the permissive MHC
genotype.
The result of these considerations is that the distribution of IL2 genotypes
among diabetic mice reflects the effect of IL2 on the risk of diabetes, but
the distribution of IL2 genotypes among non-diabetic mice will resemble
Mendelian proportions, even if the effect of IL2 is quite strong. So we
expect that any signal of association between IL2 and diabetes will come
from the diabetic mice.
Question: The IL2 genotypes of the non-diabetic mice do indeed resemble
Mendelian proportions rather closely. Why then is the significance of the
association of IL2 with diabetes so much weaker when we include the data
from both diabeic and non-diabetic animals?
When we compare diabetic mice to expected counts, only one set of counts
is random. When we compare diabetic mice to non-diabetic mice, both sets
of counts are random. Even if the observed counts from non-diabetic mice
were exactly equal to the expected counts from Mendelian proportions, the
significance of the comparative test would be weaker (p = 0.015) that the
significance of the one-sample test (p = 0.0007).
The situation might not be so extreme if the cross produced a larger
number of diabetic mice. The enrichment for a genotype among diabetic
mice would then be better matched by the deficit of the same genotype
among non-diabetic mice. However, the comparison of two groups generally
involves more noise than the comparison of one group to a fixed benchmark,
so there will still be a cost for testing the broader comparative hypothesis.
7.2.2
Multiple testing
IL2 was just one of many loci being tested for association with diabetes.
Because markers were being tested throughout the genome, rejecting the
null hypothesis based on p = 0.02 as evidence would yield a lot of false
positives. As mentioned before, Lander and Kruglyak 1 have proposed
appropriate significance levels for genome-wide linkage searches in humans
and mice. For a 2 d.f. test in a mouse intercross, they require p 5 105
for a significant finding. They regard 1.6 103 as suggestive of linkage,
i.e. of possible interest as an exploratory matter, but not reliable or
convincing. The comparison of diabetic mice to Mendelian proportions
meets this lesser criterion, while the comparison of diabetic mice to
non-diabetic mice meets neither criterion.
Lander and Krugliak give arguments why these stringent whole-genome
criteria should apply even to investigators that test only a small set of
genes. However, they also comment that Some backsliding might be
countenanced if strong prior evidence exists to restrict the search to a
region.
Ruling out hypotheses
Should we use the comparison of diabetic mice to theoretical expectations,
or the comparison of diabetic mice to non-diabetic (control) mice? Which
result should we believe?
It may help to consider the competing hypotheses that might explain the
variation from Mendelian proportions. Such variation may be due to:
Random variation,
a deleterious allele,
a diabetes risk allele.
We definitely need to reject random variation at a level of stringency
sufficient for a scan of the entire genome. The comparision of diabetic mice
to Mendelian proportions does this, at least at the more relaxed threshold
for a tentative finding.
1
Nature Genetics, 1995, 11: 241-247
180
Do we need to reject the hypotheses of a deleterious allele at every locus, or

just at the loci where we have rejected chance? If we detect, say, two loci
for which we can reject chance with a significance appropriate to a genome
scan, the more modest significance of the comparative test may allow us to
reject the hypothesis of a deleterious allele at those two loci.
7.3
A 2 by 2 Table
The general form of the chi-square statistic,

2 =
X (observed expected)2
expected
can be applied to contingency tables with arbitrary numbers of rows and

columns, so of course it can also be applied to 2 by 2 tables. Here is an
example from a clinical study.
Angina free
Not Angina free
Total
Timolol placebo
44
19
116
128
160
147
The result is 2 = 9.98 with df = 1.

For 2 by 2 tables, there is another interesting way to calculate 2 that
relates the procedure to z-tests, and allows for confidence intervals.
44
= .28
160
19
p2 =
= .13
147
63
= .205
ppool =
307
p1 p2
z=
SE(
p1 p2 )
p1 =
44/160 19/147
=p
63/307(1 63/307)(1/160 + 1/147)
181
7.3. A 2 BY 2 TABLE
= 3.16
Notice that
3.162 = 9.98.
We would refer z = 3.16 to a normal table, but we refer z 2 = 2 = 9.98 to a

chi-square table. Either way we obtain p = .0016.
The distribution of the square of a normal random variable is chi-square
with 1 degree of freedom.
With a two-by-two table, it is often natural to summarize the results using
a confidence interval for the dfference of proportions. If we can take the
numerator of the Z statistic, plus or minus 1.96 standard errors (the
denominator) to get an approximate 95% confidence interval for the
difference of angina-free proportions under the two treatments. This is
.146 1.96(.046), or (.056, .236). In other words, the improvement with
timolol is likely between 5 and 24 percentage points, but we can rule out
zero with better than 95% confidence.
182
7.4
Decomposing tables
With tables larger than two-by-two, summarizing the pattern is more

complicated than calculating a confidence interval on proportions.
Estimates of rates relative to row sums, column sums, and grand sum can
help, but the exercise below considers the problem of focusing on the most
important pattern within a contingency table.
Exercise 7.4.1 The data in the Full Table below are from a study
examining the possible association of blood type and two stomach diseases,
peptic ulcer (p) and gastric cancer (g). Controls are labeled (c). The data
are divided into subtables by either omitting or pooling various rows and
columns.
Full Table
p
g
c
O 983 383 2892
A 679 416 2625
B 134 84 570
chi-sq = 40.54
df = 4
p = "0"
p g+c
A 679 3041
B 134 654
chi-sq =
df =
p =
g
c
O 383 2892
A 416 2625
B 84 570
chi-sq =
df =
p =
p g+c
O
983 3275
A+B 813 3695
chi-sq =
df =
p =
1. Fill in the chi-square values, degrees of freedom, and p-values.

2. What is the sum of the degrees of freedom for the subtables?
3. What is the sum of the chi-square statistics for the subtables?
7.4. DECOMPOSING TABLES
183
4. The overall null hypothesis is that the three blood types occur in the
same proportions among all three groups of people. The extremely
small p-value for the full table indicates that the this hypothesis can be
rejected. In what way is the overall null hypothesis violated?
5. Enter the full table in the R Commander contingency table, and check
both Row percentages and Components of chi-square statistic. Are the
resulting tables consistent with or helpful to your interpretation?
184
7.5
Fishers exact test (small samples)
Suppose a small randomized experiment involving 20 animals has the

following results:
Treated Control
successes
1
5
failures
9
5
Is this enough evidence to conclude that the treatment affects the success
rate?
Suppose 6 animals were destined for success, and the treatment had nothing
to do with it. The randomization might have distributed these 6 animals
differently. Here are all of the possibilities, with their respective probbilities.
Treated Control
successes
0
6
failures
10
4
probability of table = 0.0054
successes
failures
1
9
5
5
successes
failures
2
8
4
6
successes
failures
3
7
3
7
successes
failures
4
6
2
8
successes
failures
5
5
1
9
successes
failures
6
4
0
10
We calculate the probability of each table using sampling with replacement.

The chance of assigning any given number of animals destined for success
7.5. FISHERS EXACT TEST (SMALL SAMPLES)
185
into the treatment group is analogous to drawing balls from a box. Suppose
there are 6 red balls (representing success), and 14 green balls in a box. If
we draw 10 at random (assignment to treatment), without replacement, the
probability we get exactly one red ball is 0.065. This is called a
hypergeometric probability. We will not examine the probability function
for the hypergeometric distribution, which is slightly more complicated
than the binomial, but there are computer routines to do the calculations.
The following shows a use of the dhyper function in R.
> dhyper(1,6,14,10)
[1] 0.06501548
The appropriate p-value is the probability of all the tables that are as
extreme as the observed tables. That is
p = 0.0054 + 0.065 + 0.065 + 0.0054 = 0.1408 0.14.
This is called Fishers exact test.
This test can be computationally intensive, so it obviously calls for a
computer. In R Commander, the contingency table form has a check box for
computing Fishers exact test, as well as expected counts. It is a good idea
to use the exact test whenever some of the expected counts are less than 5.
The Chi-square test actually performs reasonably well at somewhat smaller
sample sizes, but this rule of thumb is simple, safe and generally feasible.
> tbl <- matrix(c(1,9,5,5),nrow=2)
> tbl
[,1] [,2]
[1,]
1
5
[2,]
9
5
> fisher.test(tbl)
Fishers Exact Test for Count Data
data: tbl
p-value = 0.1409
alternative hypothesis: true odds ratio is not equal to 1
186
0.00213 1.56584
sample estimates:
odds ratio
0.125
There are between one and two chances in 10 of getting data as extreme as
the observed data, so few people would regard this as very compelling
evidence for rejecting the null hypothesis. The result could be due to way
the randomization came out, and nothing more. With only six successes in
20 opportunities, all of the successes have to be in the same group to get
p < .05.
187
7.6. EXERCISES
7.6
Exercises
Exercise 7.6.1 (S&W 10.1) A cross between white and yellow summer
squash gave progeny of the following colors
Color
White
Number of Progeny
155
Yellow
40
Green
10
Are these data consistent with the 12:3:1 ratio predicted by a genetic model?
Use a Chi-square test at = .10. Report the Chi-square statistic, degrees of
freedom, p-value, and conclusion.
Exercise 7.6.2 Using the timolol versus placebo data from 2 by 2 tables
section of the lecture notes:
(A) Enter the timolol versus placebo counts into the contingency table menu
of R Commander, and verify that 2 = 10 (rounded to two digits). Report
2 to three digits, and p to two digits. (Note that my calculations above may
have rounding error.)
(B) check the Print expected frequencies box, and report the table of
expected counts.
(C) Write out, clearly, the calculation of the expected count for the upper
left cell of the table, using the marginal totals.
188
Exercise 7.6.3 An experimental cross was made between the NOD mouse
line, and a congenic line derived from B6, which had an NOD-derived MHC
locus, but was otherwise identical to B6. (A cross of this sort was done by
Linda Wicker and John Todd, but the data for this exercise are
hypothetical.) Suppose that of 100 F2 mice, half were diabetic, and the
results at the IL2 locus were as shown in the following R transcript:
> tbl = data.frame(DD=c(17,3), DN=c(19,21), NN=c(4,16),
+ row.names=c("diab","non-diab"))
> tbl
DD DN NN
diab
17 19 4
non-diab 3 21 16
> chisq.test(tbl)
Pearsons Chi-squared test
data: tbl
Note that you can select the first row of data, and compare it to Mendelian
proportions like so:
> tbl[1,]
DD DN NN
diab 17 19 4
> chisq.test(tbl[1,], p=c(.25,.5,.25))
Chi-squared test for given probabilities
data: tbl[1, ]
(A) Calculate the analogous test, comparing the non-diabetic mice to
Mendelian proportions. Report the p-value.
7.6. EXERCISES
189
(B) Which gives the larger Chi-square: the comparison diabetic to

Mendelian expectations, or the comparison of non-diabetic to Mendelian
expectations? Why (briefly)?
(C) Does comparing diabetic to non-diabetic mice give a more significant or

less significant finding compared to the comparison of diabetic mice to
Mendelian proportions? How does this compare to the experiment described
earlier, and why?
(D) If you had twice as many diabetic mice, with the same genotype
proportions (just multiply the top row of the table by 2), what would the
chi-square and p-value be for the comparison of diabetic mice to Mendelian
proportions? Does this match anything else calculated for this problem?
Exercise 7.6.4 For the data in the previous problem (full 2-by-3 table),
190
how does the p-value from Fishers exact test compare to the p-value from
the Chi-squared test?
Chapter 8
Power, Sample size,
Non-parametric tests
191
192CHAPTER 8. POWER, SAMPLE SIZE, NON-PARAMETRIC TESTS

In this chapter and the next, we will shift attention from the analysis of
data, to the design of studies.
Three major statistical considerations is study design are:
1. Replication;
2. Randomization; and
3. Local control.
In this chapter we will deal with replication, i.e. sample size, and with the
idea of local control in the form of designs that produce paired data. We
will also consider some alternative data analysis methods that require fewer
assumptions. The chapter following this one will deal with the issue of
randomization, and the distinction between experimental and observational
studies.
8.1
Sample Size
The data below are a subset of data from the aconiazide2.txt file, that
we have worked with before. The experiment involved giving various doses
of aconiazide to rats and measuring their weight change (grams), as an
indication of toxicity. Here we consider only two of the five groups, the
control group, and the lowest dose.
Ctl Low
5.7
8.3
10.2 12.3
13.9
6.1
10.3 10.1
1.3
6.3
12.0 12.0
14.0 13.0
15.1 13.4
8.8 11.9
12.7
9.9
Mean 10.4 10.3
SD
4.3
2.7
193
8.1. SAMPLE SIZE
8.1.1
Sample Size for Confidence Intervals
The observed mean difference is 0.1 grams, very close to zero. The
confidence interval for the difference of means is (3.3, 3.5).
The margin of error (the distance from the center of the confidence interval
one of the bounds) is approximately 3.4 grams. Suppose we want more
accurate results, so that we might assert that we have estimated the
difference of means to a margin of error of 2.5 grams. How many more rats
would we need to study?
Lets be a little more general, and let E denote the desired margin of error,
so
s
2 2
+
E z/2
n1 n2
where for simplicity we are using z/2 , a quantile of the normal distribution,
instead of Students t-distribution. If we take n1 = n2 = n, then we can
write this as
r
2
.
E z/2
n
Solving for n, we get
2
,
E2
remembering that this is the number per dose group. If we want an
equation for the total sample size, we need to double n, so
n = (z/2 )2 2
ntotal = (z/2 )2 2
4
.
E2
To use the equation, lets estimate as 3.6, the pooled sample estimate.
Take = 0.05, so that z/2 = 1.96, and take E = 2.5 as our target margin
of error. Plugging these numbers into the equation and rounding up, we get
n = 16 per group. The increase in accuracy will cost us about six
additional animals per group.
Since this is still a small enough number that the difference between the t
and normal distributions makes some difference, we can be a little more
accurate by estimating that we will have 30 degrees of freedom (n 1 from
each sample) and replacing our normal quantile, z.05/2 = 1.96, by

t.05/2,30 = 2.04. This increases n to a little over 17, which we should round
up to 18. Further updating the of degrees of freedom should not matter.
We can easily solve the simpler problem of estimating the mean for single
sample (perhaps of differences in a paired experiment) , rather than the
difference of two means from independent samples. We simply omit the
factor of two (or four in the case of ntotal ) that arose from the standard
error for the difference of two means. So, the sample size to control the
margin of error is, for a one-sample interval estimate of ,
n = (z/2 )2 2
1
,
E2
and for a two-sample interval estimate of a b ,

ntotal = (z/2 )2 2
4
.
E2
In the two-sample problem, the total sample size is doubled once because
there are two samples, and it is re-doubled because the difference of two
means has two sources of error.
Note that we need to increase the sample size if we want to:
reduce the margin of error, E; or
reduce the error probability, (increasing the confidence level); or
increase the standard deviation, , that we must deal with.
The actual margin of error will depend on the sample standard deviation
that we actually observe. The estimate of that we use in planning is only
a prediction. Under-estimation of will lead to an under-estimate of the
necessary sample size.
There are three broad approaches to getting tighter confidence intervals:
1. Reduce by improving the experiment;
2. Increase n, i.e. do a bigger study; or
3. Use a paired design, which can reduce as well as removing the
factor of two penalty for comparing independent samples.
8.1. SAMPLE SIZE
8.1.2
195
Sample Size and the Power of a Test
Controlling the margin of error is usually only part of the problem. If we

want to assert that the low dose of aconiazide does not plausibly reduce
weight by more than 2.5 grams, we need to do more. If the true difference
of means is a b = 2.5, then repeated samples would generate mean
differences that vary around 2.5. Sometimes the difference will be a little
smaller, and sometimes a little larger. If all of the margins of error were
exactly 2.5, then the means that were a little high would generate
confidence intervals that exclude zero, while the means that were a little
low would generate intervals that include zero. About half the time we
would fail to detect a significant difference.
Power is the probability of detecting a given departure from the null
hypothesis. Denote this departure by = a b . We usually focus
attention on a value of that we want to detect with a reasonably high
probability (power), such as 0.9. If we want to detect a 2.5 gram difference
of population means with more than 50% probability, then we clearly need
a sample size greater than 18 per group. The relevant sample size formula
requires one additional ingredient, namely a second normal quantile
representing the desired power. The motivation for this extra ingredient is
depicted in Figure 8.1.
scenarios, respectively. For didactic purposes we simplify the

relationship by taking an average standard error, so that the relation
can be approximated as
Var[(all possible)
where Var[Y |X] is th

(innite number of) p
(Z/2 + Z)SE < ,
square of the simple/
or, more usefully, in variance terms, as
illustrates how the slo
2
2
Y values from the tru
(Z/2 + Z) Var[Parameter Estimate] < .
(2.1)
196CHAPTER 8. POWER, SAMPLE SIZE, NON-PARAMETRIC TESTSinuential factors, an
Some modications to the functions of and are needed if sample
the larger the spread
sizes are very small, e.g., when a t-(rather than a z-) distribution is more
Substituting this
into the generic form
so as to isolate n, we
Null
n > ( Za 2 + Z b ) 2 V
where refers to the

dierence in the slop
1 SE
Although this e
approach, it also de
designs. In the follow
a product of severa
shown in Table 1 to
/2
Z[a/2]
x SE[null]
Z SE
/2
null
Z[b]
x SE[non-null]
Z SE
alt
1 SE
Non-Null
Figure 1: Basis for sample size formulae. Shown in a lighter colour, and with
the usual orientation, is the sampling distribution of the test statistic under the
null scenario. Shown in the darker colour, and upside down to distinguish its
landmarks more easily, is the sampling distribution under the alternative (nonnull) scenario. Although they are in this instance, the standard errors under
the null and alternative will not necessarily be equal. The two distributions are
separated by an amount , the difference in the (comparative) parameter of
interest. For a 2-sided test with a false positive rate of a to have at least 100(1b) power against a non-zero difference , the two absolute distances Za/2SEnull
and ZbSEalt must sum to less than . In this example, a= 0:05 (so that Za/2=
1:96) and (1-b) = 0.8 (so that Zb= -0:84).
Special case: Sam

and proportions
The dierence of
of the simple regress
{X1,y1}, ..., {Xn,yn}, w
numerical value of 1
1, or 0 to indicate s
variable, the true slop
in the sub-population
=(Xsample
X size
)
i.e.,
XBase
Table 3.1:
=1
=0
power (1 ).
If the two sample
Significan
Two-sided
0.20
0.10
0.05
0.01
0.001
Table 1: Base sample s

power (1 ).
where Var[Y |X] is the (p
J Biomet
Figure
8.1: Biostat
From Hanley and Moodle, J Biomet Biostat 2011, 2:5 number of) possible Y val
ISSN:2155-6180 JBMBS, an open access journal
simple/multiple correlation
more volatile the larger the

the correlation of X with
n. It is less volatile the lar
Substituting this specifi
form in the previous sectio

2
197
8.1. SAMPLE SIZE

If we assume that the standard errors under the null and alternative
hypotheses are the same, we can use a simple extension of the previous
sample size equations. The sample size for power, 1 , for a
one-sample test of H0 : = 0 versus HA : = 0 + is
n = (z/2 + z )2 2
1
,
2
and for a two-sample test of H0 : a b = 0 versus HA : a b = is

ntotal = (z/2 + z )2 2
4
.
2
If we take = 2.5, and = 0.1 so we have 90% power, and as above, take
= 3.6 and = 0.05, we get ntotal = 88. Substituting t/2,88 for z/2 and
rounding up to an even number results in ntotal = 90. Compare this to the
36 total that we need to control the expected margin of error. Controlling
only the margin of error is equivalent to requiring only 50% power. If we
only demand 80% power, we can get by with ntotal = 66. Power can be
costly.
8.1.3
Computing
There is a convenient R function for estimating sample size or power for a

t-test. An example is shown below.
> power.t.test(n=NULL, delta=2.5, sd=3.6, sig.level=0.05, power=0.9)
Two-sample t test power calculation
n
delta
sd
sig.level
power
alternative
=
=
=
=
=
=
44.55904
2.5
3.6
0.05
0.9
two.sided
NOTE: n is number in *each* group

Rounding up gives a requirement of 45 per group, or 90 total.
In order to calculate sample size using the power.t.test function, we enter
n = NULL as an argument. The power.t.test function relates the sample
size, effect size, standard deviation, significance level and power. We need
to specify all but one, the missing value being denoted NULL, and that
quantity will be calculated from the others. This permits us, for example,
to calculate the sample size for a specified power, as above, or calculate the
power that we can acheive with a given sample size. We might also enquire
as to how small the difference of means (delta) can be for a given sample
size, or ask how we might benefit in reduced sample size requirements if we
can use a more accurate measurement method that reduces the standard
deviation.
8.1.4
Other Situations
We have just considered some examples of estimating the sample size

necessary for a confidence interval of specified width, or for a hypothesis
test of a given power. In both cases, we assumed that the data analysis
would be based on a t-statistic. If we need to deal with count data, and
plan on using a chi-square test, we would need different sample size or
power formulas. Here is an example of another power and sample size
function in R, this one designed for comparing proportions with count data.
> power.prop.test(n=NULL, p1=.25, p2=.5, power=.8)
Two-sample comparison of proportions power calculation
n
p1
p2
sig.level
power
alternative
=
=
=
=
=
=
57.67344
0.25
0.5
0.05
0.8
two.sided
NOTE: n is number in *each* group
8.2. PAIRED DESIGN
199
When the data are counts, power sometimes depends largely on how many
events are observed. One obviously needs to follow many individual mice,
people, or cells, if one is looking for rare events. When people are studied,
one often uses a retrospective design, identifying cases by some kind of
screening that can be applied very broadly, then obtaining detailed
measurements on those people exhibiting the rare event. For comparison,
one selects a subset of control subjects from the large number without the
event of interest. This is called a case-control design. A similar strategy
was used in an earlier example of the NOD B6 mouse cross and diabetes.
All diabetic mice were genotyped, but only a subset of the much more
numerous non-diabetic mice were genotyped.
There are many computer programs for statistical sample size calculations.
As with any statistical software, it is important to understand exactly what
the inputs mean, and what is being calculated. When planning a major
research effort, a good reality check is to construct artificial data and try
out the planned analysis method to see if effects of a size that you regard as
interesting are in fact detectable.
Planning is, in general, a very approximate matter, that requires estimating
(or guessing) quite a number of quantities. As dubious as it may be,
aproximate planning is usually preferable to no planning.
8.2
Paired Design
We have already seen examples of paired experiments. Gossets analysis of

kiln-dried corn seed, and Darwins cross-fertilized versus self fertilized plant
experiment both involved applying the same pair of treatments in each of
several homogeneous units (fields or pots). Now lets look at why this is
often a good design.
Results of an experiment on the growth of a virus (Samuels and Witmer,
Example 9.10) are shown below, followed by an interesting way of looking
at differences. The data are from experiments on wild-type and
interferon-sensitive mengovirus1 . The measurements are in plaque-forming
units, determined by serial dilution, so the final zeros of the three digit
numbers are not significant.
1
Journal of General Virology, 64:1543
The plot illustrates the effectiveness of pairing, and how that effectiveness
depends on the correlation of the pair of variables. The variance (and SD)
of the differences is considerably smaller that the variance of either
individual variable. Pairing is primarily used to reduce variation. Pairing
might involve splitting homogeneous specimens, or it might involve finding
pairs of subjects that are well-matched on some set of variables.
With a positive correlation, differences will have small variation, and
pairing is advantageous. If we calculate a paired t-test for the virus growth
data (the correct analysis) and compare that to the incorrect use of an
unpaired t-test, we will see that the reduced variation due to the paired
design permits a highly significant finding, but if the pairing is ignored, no
significant difference is found.
8.2. PAIRED DESIGN
201
The effectiveness of pairing does depend on the correlation. If there is no

correlation, or a negative correlation between the pair of variables, then a
paired analysis may actually have worse precision than a two-sample
comparison.
8.2.1
Sign Test
We now shift attention to some simple analysis methods that reduce our
dependency on assumptions, and extend hypothesis testing to situations
where the scale of measurement may be very crude. Even with these
methods, however, we need to assume that the observations (or pairs of
observations) are stochastically independent of each other. This is the BIG
ASSUMPTION for essentially all elementary methods.
The table below shows survival times for pairs of temporary skin grafts
given to burn patients. Each patient received two cadaveric grafts, one with
close HLA-matching, and the other with poor matching. Note that two
observations are censored (indicated with a + next to the number), i.e. we
know that the graft survived beyond the recorded time, but we dont know
how far beyond. One study subject died with a graft still unrejected, and
another was lost to follow-up for unrecorded reasons.
Although the censoring prevents a clear definition of means, we can
establish, for each pair, which of the two kinds of graft lasted longer (the
sign column).
Patient
1
2
3
4
5
6
7
8
9
10
11
HL-A Compatibility
Close Poor
Sign of
y1
y2
d = y1 y2
37
29
+
19
13
+
57+
15
+
93
26
+
16
11
+
23
18
+
20
26
63
43
+
29
18
+
60+
42
+
18
19
-
Null hypothesis: H0 : Close and Poor matches have the same distribution of
survival times.
Alternative: Close matches tend to survive longer.
We will ignore the actual number of days, especially since some of them
were not observed exactly. We will only pay attention to the sign of the
difference.
Let N+ be the number of positive differences. If the null hypothesis held,
each pair would be equally likely to be positive or negative, and N+ would
have a binomial distribution, with p = 1/2 and n = 11 (just as if each
patient tossed a coin). We calculate Pr(N+ 9) under this binomial
distribution.
> 1 - pbinom(8,11,.5)
[1] 0.033
This is called the sign test.
Hypothetical Example
The sign test can be useful as a simple approach to complex problems.
Consider a hypothetical situation illustrated in figure 8.2. We suppose that
samples from a somewhat variable preparation of cells are given one of two
treatments (open and filled circles on the plot) to be compared, and the
preparations are allowed to incubate for 3, 6, or 10 hours before they are
203
8.2. PAIRED DESIGN
destructively measured. Suppose also that the experiment is replicated on

three separate days, which involves some substantial variation in the whole
system.
20
Hypothetical Example
10
Response
15
10
Time
Figure 8.2:
At each time on each day, we have a pair of observations that received the
two treatments of interest, but share the features peculiar to the day of the
experiment, and the incubation time. It might be rather difficult to model
the effect of time, along with all of the shared random effects that might be
present, but it is simple to notice that the observations are paired, and that
the direction of the difference is consistent across all 9 pairs.
Under the null hypothesis of no systematic treatment difference, both
treatments have equal probability of producing the larger measurement. If
the identity of the winner is independent across all the times and days, we

can treat this very simply, using a sign test. The chance of the perfect
consistency that we observed is p = (0.5)9 = 2 1012 . This is a one-sided
p-value. If either direction might have been of interest, we should double
this, but it is still highly significant.
The assumption of independence is crucial. If the measurements at different
times involved repeated sampling from an animal, then variation in one
animal could have a parallel effect on three measurements. The analogy
would them be with three coin tosses, not nine, and no amount of
consistency would be very impressive.
Any common element shared by all of the treatment A measurements on a
day, but not shared with treatment B on the same day, would create a
dependency problem.
Note that lines connecting the dots across times are often used in published
figures, simply to distinguish which results were from which treatment on
which day, without implying that the same experimental units are being
followed longitudinally. It is important to clearly report whether such a
figure represents independent experimental units (cultures, mice or
whatever) or longitudial measurements of the same experimental units. The
vertical lines, linking the observations to be compared, are actually the
more important connections in the figure.
8.2.2
The Wilcoxon Signed-Rank Test
In the sign test, we simply count how many pairs have Xi > Yi . It doesnt
matter how big the difference is.
The Wilcoxon test is a refinement in which the bigger differences count for
more than the smaller differences, but only according to the rank order of
the differences. This limits the impact of really huge differences.
Example 9.17 of S&W:
8.2. PAIRED DESIGN
205
The differences are ranked (assigned numbers 1, 2, . . . , n) with regard to

their absolute value, and the test statistic is simply the sum of the ranks
corresponding to positive differences. Calculating a p-value involves either
special tables, of approximations with somewhat involved formulas, but
there is a handy function in R that does the calculations, starting with the
raw data.

> siteI = c(50.6,39.2,35.2,17.0,11.2,14.2,24.2,37.4,35.2)
> siteII =c(38.0,18.6,23.2,19.0, 6.6,16.4,14.4,37.6,24.4)
> wilcox.test(siteI, siteII, paired=TRUE)
Wilcoxon signed rank test
data: siteI and siteII
V = 39, p-value = 0.05469
alternative hypothesis: true mu is not equal to 0
These data can also be easily analysed in Prism. Enter the data for the two
sites in different columns, and be sure to specify the paired Wilcoxon
(signed rank) test.
8.3
Two Independent Groups
Figure 8.3 compares subjects with fibromyalgia syndrome to control

subjects with regard to a selected pair of laboratory measurements. The
data are from the same paper that reported on the MEVF gene. The label
no guai refers to the fact that the patients are not taking a commonly
used medication.
The investigators are interested in testing whether the population of FMS
patients differ from the population of control subjects with regard to the
distribution of circulating IL-10 and IL-15.
The distributions are skewed, with many small values and fewer large
values. The skewness can be seen in the figure, and is apparent in the fact
that the means are quite a bit larger than the medians. The median IL-10
level is 19.4, while the mean is 34.4. For IL-15, the median is 18.4 and the
mean is 29.8.
To deal with the concern over the skewness, the comparison can be based
on the ranks of the observations, rather than their numerical values. All of
the data are combined for the ranking, then the ranks of one of the groups
are summed. This rank sum is compared to its expected value under the
null hypothesis.
For IL-10, we get a two-sided p = 0.77, i.e. no significant difference. For
207
150
400
8.3. TWO INDEPENDENT GROUPS
300
100
IL.15
200
IL.10
50
CTL
no guai
100
0
CTL
Group
no guai
Group
Figure 8.3:
IL-15, we get p = 0.76, also two-sided). The differences in the sample
distributions are consistent with chance variation, and do not require any
further explanation.
This Wilcoxon rank-sum is for two independent samples, without pairing.
The Wilcoxon signed-rank test for a paired design. To make the names of
tests even harder to keep track of, there is something called the
Mann-Whitney U-test, which is identical to Wilcoxons rank-sum test. The
two names come from two diffrent ways of motivating what turns out to be
the same test. From the Mann-Whitney point of view, we consider all of
the different ways of pairing an observation from one sample with an
observation from another sample, and consider the number of such pairs in
which the member from group A is greater than the member from group B.

It turns out that these two tests will always produce the same p-value, so
they amount to the same thing.
8.4. EXERCISES
8.4
209
Exercises
Exercise 8.4.1 (Review) For each of the following situations, suppose

H0 : 1 = 2 is being tested against HA : 1 6= 2 . State whether or not H0
would be rejected at the specified level of significance ().
(a) t = 2.6 with 19 degrees of freedom, = 0.01
(b) t = 2.6 with 5 degrees of freedom, = 0.10
(c) t = 2.1 with 7 degrees of freedom, = 0.05
(d) t = 2.5 with 7 degrees of freedom, = 0.05
Exercise 8.4.2 Cyclic adenosine monophosphate (cAMP) can mediate

cellular response to hormones. In a study of maturation of egg cells in the
frog Xenopus laevis, oocytes from each of four females were divided into
batches; one batch was exposed to progesterone and the other was not. After
two minutes, each batch was assayed for its cAMP content, with the results
given in the table.
cAMP (pmol/oocyte)
Frog Control Progesterone Difference
1
6.01
5.23
0.78
2
2.28
1.21
1.07
3
1.51
1.40
0.11
4
2.12
1.38
0.74
Mean
2.98
2.31
0.68
SD
2.05
1.95
0.40
1. Calculate the standard error of the mean difference, based on the
standard deviation of the differences.
2. Calculate the 95% CI for the mean difference 1 2 , using the

standard error that you computed.
3. Calculate the standard error of the mean difference as if the data were
not paired. (Dont worry about pooling standard deviations, just
combine them using the Pythagorean relationship.)
4. Calcualte the 95% CI for the mean difference 1 2 , as if the data

were not paired.
5. Give the t-statistic and two-sided p-value of the paired t-test for these
data.
6. Give the t-statistic and two-sided p-value of the independent-samples

t-test for these data.
7. What do you conclude about the value of pairing in this experiment?
Exercise 8.4.3 Use the frog egg cAMP data from the previous exercise.
1. What is the p-value for a two-sided Wilcoxon signed -rank
(paired-data) test?
2. What is the p-value for a two-sided Wilcoxon rank-sum (independent

samples) test?
8.4. EXERCISES
211
3. What is the p-value for a one-sided signed-rank test?
4. What do you conclude about the relative power of the rank-based tests
versus t-tests in this example?
5. What is the p-value for a one-sided sign test?
6. What is the smallest possible p-value for a one-sided sign test could
possibly give with 4 pairs of observations?
Exercise 8.4.4 Suppose we have preliminary flow cytometry data on the

percentage of CD4-positive T-cells recognizing a cytomegalovirus epitope
after exposure to viral antigens in a humanized mouse model. We hope to
detect a doubling of these percentages after boosting immunity
experimentally. We will analyze the data on the common log scale, and the
standard deviation on that scale is 0.46. A doubling of the geometric mean
percentage correspons to an increase of 0.3 on the common log scale.
1. How many mice per group would we have to study, to have 80% power
in a one-sided, 0.05-level t-test comparing boosted and unboosted mice?
2. How many mice per group would we have to study if the standard
deviation could be reduced to 0.3?
Chapter 9
Correlation and Regression
213
214
CHAPTER 9. CORRELATION AND REGRESSION
Goals: To estimate a correlation coefficient, compute a regression line,

and articulate the meaning of each.
Handout: Regression towards the mean J Martin Bland, Douglas G
Altman
We have examined ways of summarizing a single variable, and of comparing
the means of two independent variables. Here we consider the association
between a pair of quantitative variables.
9.1
Example: Galtons height data
Sir Francis Galton, was interested in heredity in the latter part of the 19th
century, before Mendels work had become widely known. The reading
assignment includes a short Statistics Note by Bland and Altman, which
shows a historical figure from Galtons study of the heritability of height.
That figure is redrawn in Figure 9.1.
Because the measurements are in whole inches, The figure shows the
numbers of families that plot to the same point. The horizontal (X) axis is
the parental height. To deal with the fact that a person has two parents,
Galton took the average height of the two parents, after adjusting the
heights of mothers upward by 8 percent to account for the fact that women
tend to be shorter than men. The resulting pooled parental measurement is
called the midparent height. All of the children are sons, and their heights
are plotted on the vertical (Y ) axis.
The plot has a dark round mark at the centroid, i.e. the point with the
mean of the parent heights as its x-coordinate and the mean of the child
heights as its y-coordinate. The mean height of the sons is a little over 68
inches, while the average mid-parent height is a little under 68 inches. Bars
showing the standard deviations are shown just inside of the axes. The
standard deviation of the mid-parent heights is a little smaller than that of
the child height, as we would expect, given that the mid-parent height is
the average of two different heights.
The aspect of the plot that we want to focus in is the association between
parent height and child height, i.e. the tendency for tall parents to have tall
sons, and for short parents to have short sons. Notice that the families with
9.2. CORRELATION COEFFICIENT
215
mid-parent height of 65 inches have sons ranging from 64 to 70 inches,

while the families with mid-parent heights of 70 inches have sons ranging
from 67 to 73 inches.
Each gray circle shows a conditional mean height, i.e. the average height for
a subset of families with a given mid-parent height. These fall very close to
a straight line that passes through the centroid, called the regression line.
It is defined as the line that minimizes the sum of squared vertical distances
from the line to the points. This is also called the least-squares line.
In general, if we have two variables, X and Y , the regression of Y on X is
the conditional mean of Y , given X. This could be a non-linear function of
X, but we are often interested in a linear regression function, because we
are willing to assume that the conditional means fall on a straight line, or
because we have little evidence of any departure from linearity, and want a
simple summary of the change in y per unit change in x. In any case, the
least squares line is, in a well-defined sense, the best fitting straight line.
The vertical distances from the points to the regression line are called
residuals, so the regression line minimizes the sum of squared residuals. It
is important that these residuals are vertical distances. The regression line
is optimized for predicting Y from X, i.e. for predicting child height from
mid-parent height, not the other way round.
The slope of the regression line reflects the association of the two variables,
but the slope is in units determined by the particular details of the
example. If we were to use the standard deviations of X and Y as our units
of measurement, then the slope would be a unitless number, called the
correlation coefficient. The correlation coefficient can serve as a general
measure of the degree of association between two variables.
9.2
Correlation Coefficient
The correlation coefficient, r is the average product of
standardized values.
It is a single, unitless number that measures the association between two

variables, taking values between 1 and 1.
216
A standardized variable is one that is expressed at the number of

standard deviations above or below the mean. Symbolically, given a
variable with values x1 , . . . , xn , the standardized values are defined by
zi =
xi x
s
where x is the sample mean, and s is the sample standard deviation (of x).
Let (x1 , y1 ), . . . , (xn , yn ) be a sample of points. The pairing is important.
The sample correlation coefficient is
n
1 X
r=
n 1 i=1
xi x
sx

yi y
sy
where x and y are the respective means of x and y, and where sx and sy are
the respective standard deviations. The use of n 1 in place of n in the
deminator for the sample correlation is analogous to the n 1 correction
applied to sample standard deviations.
Example 9.2.1 Calculate the correlation coffient for the following data set.
i
1
2
3
4
5
mean
s.d.
x
y
3.5
20
8.0
31
5.0
32
8.5
39
11.1
47
7.22 33.8
3
10
For each value of x, we get the corresponding standardized value by

subtracting the mean, x = 7.22, and divide by the S.D., sx = 3 so for
example, the standardized version of x1 = 3.5 becomes
3.5 7.22
= 1.24.
3
Similarly, for each value of y, we subtract y = 33.8, and divide by sy = 10.
replacing the data by the standardized values, we compute the (n 1)
adjusted mean as in the table below.

i
1
2
3
4
5
Mean
Mean/4
zx
-1.24
0.26
-0.74
0.43
1.29
zy
-1.37
-0.28
-0.18
0.52
1.32
217
zx zy
1.704
-0.072
0.133
0.221
1.700
3.680
0.921
The sample correlation is r = 0.921.

If X and Y are positively correlated, most of the standardized deviates will
fall in the upper right and lower left quadrants of the scatterplot. For a
positive correlation, if X is above average, then Y tends to also be above
average, and if X is below average, then Y is usually also below average.
Thus positive values will predominate in the product, making the
correlation a positive number. If X and Y are negatively correlated, most
of the standardized deviates will fall in the upper left and lower right
quadrants of the scatterplot. Some examples of data with different
correlation coefficients are shown in figure 9.2.
Here is how Galton describes correlation:
Two variable organs are said to be co-related when the
variation of the one is accompanied on the average by more or
less variation of the other, and in the same direction. Thus the
length of the arm is said to be co-related with that of the leg,
because a person with a long arm has usually a long leg, and
conversely. If the co-relation be close, then a person with a very
long arm would usually have a very long leg; if it be moderately
close, then the length of his leg would usuaually be only long,
not very long; and if there were no co-relation at all then the
length of his leg would on the average be mediocre. It is easy to
see that co-relation must be the consequence of the variations of
the two organs being partly due to common causes. If they were
wholly due to common causes, the co-relation would be perfect,
as is approximately the case with the symmetrically disposed
parts of the body. If they were in no respect due to common
causes, the co-relation would be nil. Between these two
218

extremes are an endless number of intermediate cases, and it
will be shown how the closeness of co-relation in any particular
case admits of being expressed by a simple number. 1
Francis Galton, Co-relations and their measurement, chiefly from Anthropometric

data, Proceedings of the Royal Society, Session of December 20, 1888, volume 45, 1888,
pages 135-145.
219
7 10 8
6 10 2
5
8 13 12
4 11 14
12 8
6 11 13 10 5
66
64
SD(y)
72
68
70
64
y = Child Height (inches)
74
8 11 6
66
2 1
SD(x)
68
70
72
74
x = Mid Parent Height (inches)
Figure 9.1: Galtons data on heights of parents and sons
220
1
0
2
correlation = 0.7
correlation = 0.9
2
1
0
1
3
2
1
0
correlation = 0.4
2 1
correlation = 0
Figure 9.2: Illustrations of different levels of correlation. Units are standard

deviations.
221
9.3. REGRESSION
9.3
Regression
Having defined the regression line as the line that minimizes the root mean
square residuals, how do we compute it? We can figure that out from two
key facts.
1. The regression line passes through the centroid.
2. The slope of the regression line is
r
sy
sx
where r is the correlation coefficient and sx and sy are the standard

deviations of the two variables.
In other words, for every increase on one standard deviation in the X
direction, the regression line changes by r standard deviations in the Y
direction. Because 1 r 1, the regression line is no steeper than the
diagonal line that rises one s.d. in Y for every one s.d. in X. It is typically
much flatter.
The sample correlation for the 314 families in Galtons dataset is r = 0.388.
For every increase of 1 standard deviation in the X direction, the regression
line increases 0.388 standard deviations in the Y direction.
If we let y denote the expected value of Y given that X = x We can write
the regression line in the usual form of
y = a + bx.
We already noted that the slope is
b=r
sy
.
sx
Because the line must pass through the centroid, the intercept is
a = y b
x.
222
For Galtons data, the sample statistics are:

x
sx
y
sy
r
=
=
=
=
=
67.6,
1.49,
68.4,
2.21,
0.388.
sy
70
72
rsy
sx
66
68
64
y = Child Height (inches)
74
sx
64
66
68
70
72
74
x = Mid Parent Height (inches)
Figure 9.3: Regression line for Galtons data
223
9.3. REGRESSION
From the sample statistics be can calculate the slope andintercept as
b = 0.388
2.21
0.575
1.49
and
a = 68.4 (0.575)67.6 = 29.5,
so the equation of the regression line is
y = 29.5 + (0.575)x.
(9.1)
224
A key fact: The correlation coefficient is the slope of the regression line
when the variables are expressed in standard units.
The regression line consists of points satisfying
(y y)
(x x)
=r
sy
sx
e.g. if x is one standard deviation above x, then y only exceeds y by rsy .
Notice that the regression line does not rise a full the SD in child height
with an increase of 1 SD in mid-parent height. Parents that are taller than
average by one standard deviation tend to have sons that are also taller
than average, but not by a full standard deviation. Parents that are a
standard deviation shorter than average tend to have sons that are also
shorter than average, but not by a full standard deviation. Here is how
Galton described the situation.
When Mid-Parents are taller than mediocrity, their children
tend to be shorter than they. When Mid-Parents are shorter
than mediocrity, their children tend to be taller than they.
This phenomenon is called regression to the mean, and the name regression
has stuck to the line for predicting one variable from another.
Note that the correlation is 1 (or -1) only when the points all fall exactly on
the regression line. In that case,the regression line would increase by sy as
x increases by sx . Lets call this the SD line. At the other extreme, if the
correlation is zero, X does not help us predict Y . The regression line is
then flat, with a slope of zero. In between these extremes, the correlation
coefficient tells us the slope of the regression line as a fraction of the slope
of the SD line.
Lets try to predict the average height of sons for some specific values of
midparent height.
Example: Suppose the midparent for a given family is the average value of
67.6 inches. What is the regression prediction for the sons height?
Answer: 68.4, which is the mean of all sons. The regression line always goes
through the mean for both variables.
Example: Suppose the midparent value for a family is 70.5, which 2
standard deviations above the mean. What is the predicted height for the
son?
9.3. REGRESSION
225
Answer: Recall that the mean height of sons is 68.4 inches, the s.d. for sons
is 2.21, and the correlation coefficient is 0.388. If X is 2 standard deviations
above its mean, then Y exceeds its mean by 2r standard deviations. The
height of the regression line is thus 2(0.388) = 0.776 standard deviations
above y. The standard deviation of Y is 2.21, so we add (0.776)2.21 to 68.4
to get 70 inches (to 2 digit accuracy).
Although things are perhaps clearest in standard deviation units, we can of
course simply plug into the equation for the regression line that we
calculated above, yeilding 29.5 + (0.575)70.5 = 70 inches.
There are two regression lines
When we observe pairs of observations, (X, Y ), we can estimate two
regression lines, one for predicting y from X and the other for predicting X
from Y . The regression line for predicting the heights of sons from the
midparent height has a flatter slope than the SD line, but the regression
line for predicting the midparent height from the height of the son will have
a slope that is steeper than the SD line, but that is only because we are
predicting in the horizontal direction. If we flip the plot around so that we
always plot the predictor on the horizintal axis, and the variable being
predicted on the vertical axis, then the regression line is always flatter than
the SD line. But if we try to put both regression lines on the same plot, one
of them has to have the variables the other way round.
Note that if the correlation is very high (say 0.9), both regression lines are
near the SD line. If the correlation is zero, the regression lines are at right
angles to each other.
Correlation is a symmetric concept. Regression is not. The regression line
depends on which variable is the pedictor and which is the response.
9.3.1
Example: Spouse education
Friedman, Pisani and Purves give the following provocative little problem.
Suppose that in some study of education, the correlation between the
educational level of husbands and wives was about 0.50; both averaged 12
years of schooling completed, with an SD of 3 years.
1. Predict the educational level of a woman whose husband has
226

completed 18 years of schooling.
Answer: The husbands schooling is 6 years above the mean for men,
i.e. 2 SD above the mean. The slope of the regression line is
b = rsy /sx . In this case, where sy = sx , the slope is just the
correlation, 0.50. So we predict the mean schooling for the wives to
be (0.50)6 = 3 years above the mean, or 15 years.
2. Predict the educational level of a man whose wife has completed 15

years of schooling.
Answer: 15 years is 1 SD above the mean, so we predict 0.5 SD above
the mean for the husband. This is 13.5 years of schooling.
3. Apparently, well educated men marry women who are less well
educated than themselves. But the women marry men with even less
education. How is this possible? Answer: It isnt. The wives whose
husbands have 18 years of schooling are not the same set of women as
the wives who have 15 years of schooling themselves. The former
group has a substantial range of years of schooling (with 15 years
being the average), while the latter group all have 15 years. The
phrase But the women . . . seems to refer back to the first group.
However it is the second group whose husbands have an average of
13.5 years of schooling.
This isnt just a trick question. The regression phenomenon can seem a
little mysterious. The situation, as described, is completely symmetric.
Husbands with more than the average number of years of schooling have,
on average, wives with less schooling than they (but more than average).
Wives with more than the average number of years of schooling have, on
average, husbands with less schooling than they (but more than average).
How can this be? According to the description of the problem, there should
be about as many couples where the wife has 18 years of schooling as where
the husband has 18 years of schooling. But if we select couples where the
husband has a lot of schooling, and we dont apply any selection to the
wives, we include wives with less schooling, which brings down their
average. We can do the same kind of selection for wives with a lot of
education, and observe a lower average for their husbands. One kind of
selection gives us a vertical strip, the other a horizontal strip. The group we
select to be high comes out higher. Selection is important.
9.4. USES OF LINEAR REGRESSION
9.4
227
Uses of Linear Regression
There are two main contexts for linear regression:

1. X is determined by the experimenter, and Y is observed.
2. X and Y are observed as a pair for each individual in a sample from
some population.
Galtons dataset is an example of the latter. Families are sampled, and the
midparent and sons heights are observed. The study of aconiazide toxicity,
that we saw in chapter 2, is an example where X is determined by the
experimenter. The data are shown in Figure 9.4. In an experimental study
like this, regression is used to estimate the rate at which Y (weight gain or
loss), changes per unit increase in X (dose). Samuels & Witmer give
another pair of examples in section 12.1.
Within these two contexts, linear models have many uses. Here are some
examples.
Prediction. In a study2 of bacterial clearance in the purple sea urchin, the
investigators needed to deliver a dose of bacteria that was proportional to
coelomic fluid volume in animals of quite variable size. The investigators
first sacrificed some sea urchins to obtain paired weight and coelomic fluid
volume data, then fit a linear model that permitted accurate prediction of
volume from weight.
Calibration. In chapter one we encountered Dr. Shihs study3 of cultured
hematopoetic stem cells. In that study, a linear model was fit to the
logarithm of the odds of engraftment, as a function of fresh cell dose (at
doses determined by the experimenter). The model was then used in the
reverse direction to estimate the dose of fresh cells that would give the
same engraftment rate as a dose of cultured cells.
Estimation. In a study of cocaine metabolism, a linear model was fit to
the logarithms of blood concentrations as a function of time since injection.
The slope of the line was 0.462, which allows us to calculate
(log(1/2)/slope) that the half-life of cocaine in the blood is 1.5 hours (at
least in the subject that was tested).
2
3
Biol Bull (1983) 165: 473-486

Blood (1999) 94: 1623-1636
228
Testing. Fixed doses of drugs are often given to patients (or animals) who
then metabolize the drugs with variable rates, yeielding different blood
levels. To address whether this variation is associated with an effect of the
drug, such as methylation of DNA in white blood cells, we can fit a linear
model to the drug levels (X) and methylation levels (Y ), and test whether
the slope differs from zero. We will take up the question of hypothesis
testing below, after a closer look at the linear regression model.
9.5
The simple linear regression model
Lets model the distribution of an outcome Y , conditional on observing or

setting variable X to a specific value, X = x. The model has two parts.
The systematic part of the model specifies the expected value of Y as a
function of x, i.e.
E(Y |x) = + x.
The random part of the model specifies that
Y = E(Y |x) +
where the are independent random perturbations, with zero mean and a
common standard deviation, . We usually go further and assume that the
random variables have Gaussian (normal) distributions.
The major assumptions involved in a simple linear regression are outlined
below.
1. Linearity of the conditional mean response, E(Y |x), as a function of
x.
2. Constant variance violation can make standard errors
inaccurate.
3. Normality needed for prediction intervals, but not as crucial as
other assumptions.
4. Independence A BIG ASSUMPTION dependencies require
special treatment, or standard errors will be too small, and inferences
will be wrong.
9.6. COMPUTING
229
Least squares might be used to put a line on a scatterplot whenever you

think it useful, but to justify the statistical statements that often go with
the line you need the assumptions above.
9.6
Computing
The file calcium.csv contains two columns, and 27 rows. The rows are
labeled time and cal. These are data from a biochemical analysis of
intracellular storage and transport of calcium across the plasma membrane.
Cells were suspended in a solution of radioactive calcium for a certain
length of time and then the amount of radioactive calcium that was
absorbed by the cells was measured. The experiment was repeated
independently with 9 different times of suspension, each replicated 3 times.
We can plot the data and fit a least-squares line with many software tools.
In R Commander:
Change working directory (File menu) to the folder where the data
file resides.
Import Data (Data menu), specifying comma-separated.
Look at the data (View dataset button).
Make a scatterplot (Graphs menu), choosing time as the X variable,
and cal as the Y variable, and checking least-squares line in the
scatterplot menu.
Fit a linear regression model (Statistics / Fit models) with cal as the
response variable, and time as the explanatory variable.
The plot is in Figure 9.5.
The numerical results will look like this:
Call:
lm(formula = cal ~ time, data = Dataset)
230
Residuals:
Min
1Q
-1.24196 -0.47607
Median
0.07946
3Q
0.53899
Max
1.12549
Coefficients:
Estimate Std. Error t value
(Intercept) 1.02650
0.23707
4.330
time
0.24269
0.02794
8.686
--Signif. codes: 0 *** 0.001 ** 0.01
Pr(>|t|)
0.000212 ***
5.08e-09 ***
* 0.05 . 0.1 1
Residual standard error: 0.728 on 25 degrees of freedom

Multiple R-squared: 0.7511,Adjusted R-squared: 0.7411
F-statistic: 75.44 on 1 and 25 DF, p-value: 5.08e-09
This says that the regression equation is
cal = 1.02650 + 0.24269(time) +
where is a random error, with expected value of zero (i.e. it is just a
disturbance, not a bias), and with standard deviation 0.728. This is the
standard deviation of the residuals, i.e. the vertical distances from the
points to the regression line. Note that there are 25 degrees of freedom for
this standard error. That is 2 less than the 27 observations (pairs of
numbers), because the model requires estimating two parameters, i.e. a
slope and an intercept.
Notice that the estimates of the slope (the coefficient of time) and the
intercept, both have standard errors. It is beyond the scope of this
discussion to explain how these are estimated, but dividing the slope
estimate by its standard error produces a t-statistic, and the probability of
such a large absolute t-value, according to the t-distribution with 25
degrees of freedom, is a p-value, which addresses the null hypothesis that
the slope is zero. This is strongly rejected, which is consistent with the
plot. The calcuim levels clearly increase with time. The relationship may
have some curvature, but there is definitely an upward trend.
The report on the linear regression includes an F-test, which we will discuss
when we study analysis of variance. In this simple linear regression, with
only one explanatory variable, the p-values from the F-test and from the
9.6. COMPUTING
231
t-test for the slope are exactly the same. The F-test, however, generalizes
to more complex models, with multiple explanatory variables.
Computing with Prism
Select the scatterplot tab, and paste the two columns of data into the data
table. The scatterplot should be generated automatically, and report on the
regression model should provide estimates of slope and intercept, with
standard errors and p-values.
9.6.1
Transformed variables
Figure 9.6 shows four plots of species number versus area, for a set of
islands. Aside from illustrating a geneal principle of island biogeography,
the figure illustrates that the relationship between variables may be
non-linear on the original scale, but linear when one or both scales are
logarithmic.
232
5
0
-5
Weight Gain
10
15
200
400
600
DOSE
Figure 9.4: Aconiazide toxicity study.
233
cal
9.6. COMPUTING
10
time
Figure 9.5: Plot of the calcuim data.
15
234
80 100
20
60
SPECIES
60
40
80 100
40
20
SPECIES
Amphibian Species v Area
10000 20000 30000 40000
10
100
20
SPECIES
10
50
20
50
100
AREA
10
SPECIES
100
AREA
1000 10000
10000 20000 30000 40000

AREA
10
100
1000 10000
AREA
Figure 9.6: Island biogeography example, where variables are linearity related
when both scales are logarithmic. This example is discussed in more detail
in The Statistical Sleuth, Chapter 8.
235
9.6. COMPUTING
The theoretical model for these data is
S = CA
where S is the predicted number of species, A is area, and C and are
constants, so
= log(C) + log(A)
log(S)
or
= 0 + 1 log(A).
log(S)
When the species count and area are expressed on logarithmic scales, the
as the intercept.
theoretical model is linear, with as the slope and log(S)
We can easily estimate the parameters from data. Here are results using R,
where the data have already been read into a dataset named isles.
> fit = lm(log(SPECIES) ~ log(AREA), data=isles)
> coef(fit)
(Intercept)
log(AREA)
1.9365081
0.2496799
Rounding a bit, we get = 0.25, and log(C) = 1.9365. Taking the antilog
of the latter, C = exp(1.9365) = 6.93.
We can examine the accuracy of these estimates using confidence intervals.
The standard errors for our estimates of and log(C) are listed by the
summary function, although they are burried among other things.
> summary(fit)
Call:
lm(formula = log(SPECIES) ~ log(AREA), data = isles)
Residuals:
1
-0.0021358
2
3
0.1769753 -0.2154872
4
5
0.0009468 -0.0292440
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.93651
0.08813
21.97 3.62e-06 ***
6
0.0595428
7
0.0094020
236
log(AREA)
0.24968
0.01211
20.62 4.96e-06 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 0.1283 on 5 degrees of freedom
Multiple R-Squared: 0.9884,Adjusted R-squared: 0.9861
F-statistic: 425.3 on 1 and 5 DF, p-value: 4.962e-06
Since = 1 is the coefficient of log(AREA), we can find the estimate
(0.24968 0.25) and standard error (0.01211 0.012) on that line.
We can get the confidence interval for (a.k.a. 1 ) by computing
1 t.975,5 SE().
Note that we have 5 residual degrees of freedom (7 observations minus 2
parameters estimated). By specifying the two quantiles of the t-distribution
we get the plus or minus as a vector.
> qt(c(.025,.975),5)
[1] -2.570582 2.570582
> .25 + qt(c(.025,.975),5) * 0.012
[1] 0.219153 0.280847
This rounds to (0.219, 0.281).
We can do something similar to get a confidence interval for log(C), which
corresponds to the intercept.
> beta0.limits = 1.93651 + qt(c(.025,.975),5)* 0.08813
> beta0.limits
[1] 1.709965 2.163055
> exp(beta0.limits)
[1] 5.528766 8.697672
The exponential function (exp) was used to translate the limits on log(C)
into limits on C.
The natural objective in this example is estimation, not testing, because
the general pattern that species number tends to increase with area is well
known.
9.7. EXERCISES
9.7
237
Exercises
Exercise 9.7.1 Williams et al.(2001) give data on brain and body weights
of mice. Let x denote the bodyweight in grams, and let y denote the brain
weight in mg. The mean bodyweight is x = 21.1g and the mean brainweight
is y = 426mg. The standard deviations are, for bodyweight sx = 5.3g, and
for brainweight sy = 41.4mg. The correlation coefficient is r = 0.435. Note:
both the standard deviation line and the regression line go through the mean
point x, y) = (21.1, 426). If we add one standard deviation to the mean body
weight, we get 21.1g + 5.3g = 26.4g.
A. How high is the regression line at that point?
B.What is the slope of the regression line in units of mg of brain weight per
gram body weight?
C.What is the equation of the regression line for the mean of brain weight
as a function of body weight?
238
Exercise 9.7.2 ((AF 3.17)) Consider the data:

x:
y:
3 4 5 6 7
8 13 12 14 16
a. Sketch a scatter plot.
b. If one particular pair of (x, y) values is removed, the correlation for the
remaining pairs equals 1. Which pair is it?
c. If one particular y value is changed, the correlation for the five pairs
equals 1. Identify the y value and how it must be changed for this to happen.
9.7. EXERCISES
239
Exercise 9.7.3 Which of the following is not a property of r?

a. r is always between -1 and 1.
b. r depends on which of the two variables is designated as the response
variable.
c. r measures the strength of the linear relationship between x and y.
d. r does not depend on the units of y or x.
e. r has the same sign as the slope of the regression equation.
Exercise 9.7.4 Which one of the follwing statements is correct?
a. The correlation is always the same as the slope of the regression line.
b. The mean of the residuals from the least-squares regression is 0 only
when r = 0.
c. The correlation is the percentage of points that lie in the quadrants where
x and y are both above the mean or both below the mean.
d. The correlation is inappropriate if a U-shaped relationship exists between
x and y.
Exercise 9.7.5 Indicate the correct completion: The slope of the regression
equation and the correlation are similar in the sense that
a. they do not depend on the units of measurement.
b. they both must fall between -1 and +1.
c. they both have the same sign.
d. neither can be affected by severe regression outliers.
Exercise 9.7.6 Obtain the file aconiazide.txt from the course website
(http://www.infosci.coh.org/jal/Class/index.html) and read the
data into the program of your choice (e.g.R, Prism, Excel).
a. Make a scatter plot.
b. What are the slope and intercept of the regression equation?
c. What does the model predict as the mean weight gain for a dose of 200
mg/kg/day?
d. What is the observed mean weight gain for the rats on a dose of 200
mg/kg/day?
240
e. Do you think the data look like they are well summarized by a linear
relationship?
9.7. EXERCISES
241
R Tutorial
If you would like to do the computing for the exercise above using R, you
can follow the following instructions. You can use another program if you
prefer.
(1) Save the file in a folder that you know the path to, (e.g. c:/stat).
(2) Look at the file in a text editor, such as notepad. Note that there is
both description and data.
(3) Start R, and set the working directory to this folder. You can use the
File pull-down (Misc on a Mac), selecting Change dir. Alternatively, you
can enter setwd(C:/stat) at the command prompt.
(4) Read the data and check it by entering
rats = read.table("aconiazide.txt", header=TRUE, skip=19)
summary(rats)
dim(rats)
The first command reads the data, skipping over the 19 lines of text. The
data now reside in an object called rats. This is a matrix-like object called
a data frame consisting of two columns, respectively named W and DOSE.
The names came from the header line in the file. The dim command tells us
that there are 50 obvservations on 2 variables.
(5) Try these commands. Anything to the right of a pound sign is just a
comment.
attach(rats)
stem(W)
table(DOSE)
# allows reference to W or DOSE within rats object

# for a stem-and-leaf display
# for a table of frequencies.
We see that there are 10 rats at each of 5 doses, and the weight gains
stretch out more in the negative direction, and stack up more in the
positive direction.
(6) Lets plot the data.
plot(W ~ DOSE, data=rats)
(7) Get the regression equation.
242
fit = lm(W ~ DOSE, data=rats)

fit
(8) Add the line to the plot.
abline(fit)
(9) Get the mean for one dose.
mean(W[DOSE == 200])
The square brackets select observations meeting the criterion. The double
equal sign is the equality comparison operator. Single = is used to assign
values to objects, among other things.
(10) Lets add the mean weight gain for each dose group to our plot.
W.mean <- tapply(W, DOSE, mean)
DOSE.mean <- tapply(DOSE, DOSE, mean)
points(DOSE.mean, W.mean, pch=3, cex=2.5)
The tapply function may look a little mysterious. It applys a statistic (the
third argument, mean) to its first argument (W) for each value of its second
argument (DOSE) Typing help(points) will provide some information on,
among other things, the possible choices for plotting symbols (pch).
Chapter 10
Comparing several means
243
244
CHAPTER 10. COMPARING SEVERAL MEANS
We have previously considered comparing two groups using a t-test with

pooled variance. Here we consider comparisons involving three or more
groups. Lets begin with an example.
10.1
Calorie Restriction and Longevity
Experiments in several species have found that restricting caloric intake can
increase longevity. Caloric restriction seems to be about the only thing that
has such an effect. This has lead to recent work trying to shed light on the
mechanism for the effect. We will look at data from Weindruch et al.1 that
is used in a statistics text book called the Statistical Sleuth.
Data file. The file calrestrict.csv is available at the course webpage. If
you open the file in excel or import the data into R Commander, you will
see a column labeld LIFETIME, giving lifetime, in months, of mice in the
experiment. The second column is labeled DIET, and contains codes for the
six experimental groups. The groups are:
NP: Mice were fed as much as they pleased (ad libitum diet).
N/N85: The mice were fed normally both before and after weaning, but
the ratio was controlled to 85 kcal/wk after weaning. This, rather
than the ad libitum diet, is considered the reference group because the
calorie intake is held fairly constant.
N/R50: This group had a normal diet before weaning, and a 50 kcal/wk
diet after weaning. This is the basic calorie-restricted diet.
R/R50: These mice had a restricted 50 kcal/wk diet before and after
weaning.
N/R50 lopro: Normal diet before weaning, and 50 kcal/wk after weaning,
with an age-adjusted limitation of protein.
N/R40: Normal diet before weaning, and a 40 kcal/wk restricted diet
after weaning.
1
Journal of Nutrition, 116 (1986) 641654
10.1. CALORIE RESTRICTION AND LONGEVITY
245
A stripchart (with jitter) will show that the ad libitum diet clearly produces
the shortest lifetimes, with all of the NP animals below the median for the
calorie restriction groups. It also shows that the distributions are somewhat
skewed, with a few animals in each group being unusually short lived.
10.1.1
Global F-test
A natural starting place for hypothesis testing is the global null hypothesis
that none of the diets have any effect on longevity. This global null
hypothesis can be written as
H0 :
N P = N/N 85 = . . . = N/R40 ,
where denotes the mean lifetime of the group indicated by the subscript.
To make the notation a little easier, lets number the groups from 1 to 6.
Then we can write
H0 :
1 = 2 = . . . = 6 .
The global null hypothesis is typically tested using an F-statistic from an
analysis of variance (AOV or ANOVA) table. The ANOVA table breaks
down the variation around the grand mean (over all groups) into parts that
reflect the variation among (between) the group means, and the variation of
the individuals within the groups. Figure 10.1 shows hypothetical data for
five groups, under the null hypothesis, and under the alternative hypothesis.
Under the null hypothesis, there is variation among the group means, but it
is small compared to the variation of individual observations within each
group. Under the alternative hypothesis, there is more variation among the
group means relative to the variation within groups.
Some notation
If we let i = 1, 2, . . . , I index the I groups, and let j = 1, 2, . . . , ni index the
individuals within a group, then:
xij is an individual observation on the j-th individual in the i-th group;
xi is the mean of the observations in the i-th group;
x is the grand mean;
246
80
60
20
40
60
40
20
80
100
Alternative Hypothesis
100
Null Hypothesis
Group
3
Group
Figure 10.1:
ni is the number of observations in group i; and
n is the grand total number of observations.
With a little algebra (which well avoid) it can be shown that
ni
k X
X
i=1 j=1
(xij x) =
ni
k X
X
i=1 j=1
(xij xi ) +
k
X
i=1
ni (
xi x)2
Letting SS denote Sum of Squares (of deviations), we can write this in

words:
Total SS = Within SS + Between SS.
(10.1)
The Total SS is just the sum of squared deviations from the grand mean. If
we divide the Total SS by n 1 we get something called the Total Mean
Square (Total MS). This is just the sample variance, i.e. the square of the
247
familiar sample standard deviation, based on lumping all the data together,
ignoring the groups. It could serve as an estimate of 2 , the population
variance, provided we could assume the null hypothesis is true. It would
only be a resonable estimate, however, if the grouping of the data didnt
matter. If some of the groups were shifted up or down relative to others,
the Total MS would increase.
Note that each group contributes ni 1 degrees of freedom towards its
within-group standard deviation. If we add all of these degrees of freedom
together, we get
X
(ni 1) = n I
i
where I is the number of groups. If we divide the Within SS by its degrees

of freedom, (n I) we call the result the Within-groups Mean Square
(Within MS). This is a pooled estimate of the variance, 2 , just like we
used in the t-test. Of course this only makes sense if we assume that the
same variance (i.e. the same standard deviation) applies to all groups. This
usually seems reasonable when the null hypothesis is plausible.
Note that the Within MS is based on deviations from group-specific means,
so if we were to shift the data in some groups by a constant amount,
leaving other groups unchanged, we would not change the Within MS.
Under H0 , both of these estimate the same thing:
Total Mean Square = Total SS ,
Total df
Within Mean Square = Within SS .
Within df
Under the alternative hypothesis, however, the groups have different means,
so the Total MS increases relative to the Within MS. That implies that the
Between MS also increases relative to the Within MS.
The Between SS can be obtained by subtraction, and its degrees of freedom
can also be obtained by subtraction. Dividing the Between SS by its
degrees of freedom yields the Between-groups Mean-Square (Between MS),
which, like the Total MS, grows as the variation between groups increases.
We can test the global null hypothesis by comparing the Between MS to
the Within MS.
Note that the definition of the standard error of a mean applies to each
248
group mean, i.e.
j ) = /nj .
SE(X
If all of the groups had the same number of observations, i.e. if the nj were
all equal, we could estimate 2 from the variation among the group means.
We could simply calculate the variance of the means, and multiply by nj .
Its a little more tricky when the groups all have different size, but we can
always think of the between-groups sum of squares as the difference between
the total sum of squares and the residual (within groups) sum of squares.
These sums of squares and mean squares are often laid out in an Analysis
of Variance (aov, or ANOVA) table. We can get the analysis of variance
table for the caloric restriction data using the One-way ANOVA choice on
the Statistics/Means menu of R Commander.
DIET
Residuals
Df Sum Sq Mean Sq F value

Pr(>F)
5 12734 2546.8 57.104 < 2.2e-16 ***
343 15297
44.6
The between groups sum of squares is labeled DIET, referring to the

variable that determines the groups. The within groups sum of squares is
labeled Residuals. The F-value is the ratio of the two mean squares
(between:within). Under the null hypothesis, both the within-groups
(residual) mean square and the between groups (diet) mean square estimate
the same variance, 2 , and their ratio follows a well-know family of
distributions called the F-distributions (in honor of Ronald Fisher). The
F-distribution with 5 numerator degrees of freedom and 343 denominator
degrees of freedom determines the probability of observing such a large
F-statistic under the null hypothesis, i.e. Pr(F 57.104) = 2.2 1016 .
The interpretation is that the variation in mean lifetime between the

various diet groups is so large that it cannot be attributed to chance. The
significance probability is so small that it is really beyond our ability to say
exactly what it is. Chance is effectively ruled out.
This strong F-test result gives us some confidence that the pattern of
variation we are interpreting really does represent consistent patterns that
are worth inerpreting. This is important because there are many
comparisons that could be made using six treatment groups, so there is a
good chance that at least one of them will appear somewhat large, due to
chance alone.
10.1.2
249
Pairwise t-tests
Just as we were able to use a pooled estimate of standard deviation () in a

two-sample t-test, we can use the residual mean square as a pooled
estimate of variance, ( 2 ). These can be used to construct t-tests for any
pair of goups.
To do this by hand, you can calculate the standard error of a difference
between two means as
p
SE(Y1 Y2 ) = s2 (1/n1 + 1/n2 )
where s2 is the Within MS. You can then calculate the observed t-statistic
as
y1 y2
t=
SE(Y1 Y2 )
and refer it to a t-distribution with n I degrees of freedom. Note that the

degrees of freedom include contributions from all groups. This is a big
advantage, if you are dealing with many small groups.
With many possible pair-wise comparisons, there is an increased risk of
spurrious findings, if we naively regard a p-value less than 0.05 as
significant. One strategy for preventing spurrious findings is to regard the
global F-test as a gate-keeper, only going on to interpret pairwise
comparisons when the F-test is significant. This at least provides some
reassurance that there is some real pattern to discover, although is doesnt
guarantee that you will detect the correct pattern.
Sometimes, there are a few comparisons that are of strong interest a priori.
These might be examined regardless of the F-test result, especially if the
experiment includes a few groups that are expected to differ, and several
that are less likely to differ. In such a case, the F-test may essentially cast
too wide of a net to detect any departure from the null hypothesis.
With a modest number of comparisons of interest, it is often useful to
divide the benchmark for significance, , by the number of comparisons.
This is called a Bonferroni correction. For example, if we want to keep the
probability of a spurrious finding below 0.05, and we are making 5
comparisons, we require p < 0.05/5 = 0.01 before we consider a comparison
statitsically significant. This would generally be done instead of relying on
the F-test as a gate-keeper. There are a number more specialized methods
250
for multiple comparisons in situations that have more structure. There are
methods for making all pair-wise comparions, for comparing each treatment
to a control group, for making all possible linear contrasts, and for
structured comparisons when the groups involve the presence of absence of
multiple factors.
Planned comparisons should be considered when designing an experiment
that involves several groups. The treatment of multiple comparisons during
analysis calls for considerable thought and judgement. For the sake of
argument, we can ask whether we should attempt to control our overall
chance of spurrious findings during our careers, or perhaps if we should
break a study into least publishable units to avoid the multiple testing
issue. More seriously, geneticists have published the opinion that their work
should always be adjusted for scanning the entire genome, because the
genetics community as a whole will eventually do that, even if an individual
investigator focuses on a few candidate genes.
Returning to the calorie restriction example, if you are working with R
Commander, you can check the Pairwise comparison of means box when
requesing the one-way ANOVA, which will generate a larger report that
includes a plot of family-wise confidence intervals for the difference of mean
lifetime between each pair of groups. Each interval has been made more
conservative so as to limit the chance of any false-positive finding if the
global null hypothesis is true.
10.2
A Genetics Example
The data in Table 11.1 came from an intercross (F2 generation) that was
part of a search for non-MHC genes influencing humoral immunity in mice.
The parental strains (denoted A or B) are congenic for the MHC locus, but
differ in their production of antibodies. The summary statistics are for an
antibody titer in response to a common antigen. Mice are grouped
according to three genotypes at the D5Mit122 marker, labeled AA, AS, or
SS.
We will not analyze this example, although both the AS and SS genotypes
appear to have higher titers than the AA genotype, and the difference is
about 4 times the largest standard error (just as a rough notion of scale).
251
10.3. WHY ANOVA?

Genotype
AA
AS
SS
mean titer
SE
n
SD
-0.19 0.06 49 0.42
0.06 0.04 100 0.40
0.06 0.06 55 0.445
Table 10.1: Antibody titers to a common antigen in mice from an intercross,

grouped according to genotype at the D5Mit122 marker.
The point of presenting this example is to note that the F-test provides a
convenient statistic for flagging variation among the three genotypes,
whether the pattern is that of dominant, recessive, or allele-dose models, or
something else, like a heterozygote advantage.
Tests with multiple degrees of freedom can detect a great variety of patterns
in data.
10.3
Why ANOVA?
Why not compare each pair of groups with the two-sample t-tests that we
learned about previously?
1. Multiple Comparisons: As the number of groups increases, the
number of pairs increases even faster, and the chance of a
false-positive finding goes up. The F-test that we get from ANOVA
permits a single test of the global null hypothesis.
2. Pooled standard deviation: ANOVA uses data from all groups to
estimate the standard deviation, which is assumed constant. This
permits more accurate inferences, especially when the individual
groups are small.
3. Structure In Groups: We may, for example, want to compare the
average of two similar treatment groups to that in a control group.
We may want to do so even when there is some systematic variation
among the groups being averaged. Simply lumping groups together
would inflate the variation within the aggregated groups. It is more
252

accurate to construct the appropriate test using the pooled variance
estimator (i.e. the Within Mean Square).
matologylibrary.org by RYOTARO NAKAMURA, MD on October 9, 2011. For personal use

only. WT1 SELF-ANTIGEN DRIVES MEMORY T-CELL DEVELOPMENT
17, NUMBER 25
10.4
6819
Example: Non-transitive
Comparisons
Figure 10.2: Comparisons of T-cell groups.
Figure 10.2 shows flow cytometry results from Pospori et al. (2011 Blood)
comparing each of three types of T-cells (naive, CM and EM) across three
experimental systems (WT1 or LMP2, in an A2Kb to A2Kb transplant, or
a B6 to A2Kb transplant). Lets focus on just the EM (effector memory)
T-cells. Notice that the WT1 B6 EM group (lets call it group C) is not
significantly different from the LMP2 EM group (which well call group B).
note
that
B6
EMTCR-expressing
group (lets
ells differentiate into memory T cells without Also
vaccination.
(A) CD44
andthe
CD62LWT1
expression
on gated
(TCR ) call
and it group A) is not
mined by FACS analysis after staining with antimurine CD44 and CD62L antibodies. Similar staining patterns were observed for both
different
from
the with
WT1
A2Kb
and middle panels) Resentative plots from A2Kbsignificantly
Tg mice transplanted with
A2Kb stem cells
transduced
the WT1-TCR
(topEM
panel, group (C). Notice that the
n 5). (Bottom panel) Representative plots fromWT1
A2Kb Tg mice
transplanted
with
B6-derived
stem has
cells transduced
with the
WT1-TCR
A2Kb
EM
group
(A)
a
mean
percentage
that is significantly
CD44 , CD62L ), CM (CD44 , CD62L ), and EM (CD44 , CD62L ) T-cell frequencies in transplanted mice expressing
**P .01. A2Kb 3 A2Kb or B6 3 A2Kb as indicated
in thethan
figure. (C)
Phenotypic
analysis
of gated EM
WT1-TCR
and WT1-TCR
greater
that
of the
LMP2
group
(B). How can that be? Doesnt
pients after transplantation with TCR-Td B6 stem cells, after staining with antimurine CD44, CD62L, and CD8 antibodies. ns indicates
C = B and C = A imply that A = B?
low
high
high
high
us TCRs (Figure 6B left panel).

he secondary hosts demonstrated
r function, which correlated with
on ex vivo (Figure 6C). This clearly
igh/CD62Lhigh memory phenotype
ying antigen-specific effector func-
te in the BM without impairing
stem/progenitor cells express low
high
low
hi
lo
The thing to keep in mind is that failing to find any significant difference
However, analysis
of WT1-specific
T cellsmean
showedthat
that the
relative
between
two groups
does not
their
population means are equal.
frequency of naive and memory phenotype cells was similar in BM
compared with spleen and LNs (data not shown). This suggests that
the observed accumulation in BM of WT1-specific T cells, but not
LMP2-specific T cells, was primarily driven by the specificity and
not the memory phenotype of WT1-T cells.
Finally, we examined whether the presence of WT1-specific
T cells resulted in damage to the stem/progenitor cells in the BM of
transplanted mice. BM was harvested from WT1-TCR transplanted
mice and from control mice, followed by transplantation into
myeloablated secondary recipients to measure long-term reconstitution of hematopoiesis. Analysis revealed that both myeloid and
10.4. EXAMPLE: NON-TRANSITIVE COMPARISONS
253
It just means that we dont have enough evidence to decide if they differ.
The absence of evidence is not evidence of absence (of a difference). We
have reasonably good evidence that groups A and B differ, but we cant
reliably conclude that group C is intermediate between them.
254
10.5
Exercises
Exercise 10.5.1 Download the WT1.csv dataset from the course page.
Using a program of your choice to do the following for the three groups of
results for EM T-cells. (A) make a plot of the data.
(B) Calculate the mean and standard deviation for each group.
(C) Calculate a p-value for each pairwise comparison.
(D) Interpret your findings in a clear, coherent, and brief statement,
limiting the chance of any false positive finding to less than 5 percent.
Exercise 10.5.2 The authors of the calorie restriction paper planned to

compare the N/R50 group to each of the other groups except for the NP
group, which was to be compared to the N/N85 group. Use R commander to
get a plot of family-wise confidence intervals, and use that plot to answer
the following questions. In each case, state the pair of groups that are being
compared, answer the question yes or no acording to the confidence interval,
and state the confidence interval (as a pair of numbers).
1. Does reducing from 85 to 50 kcal/wk increase lifespan?
2. Is there an effect of pre-weaning diet restriction on lifespan?
3. Does further reduction from 50 to 40 kcal/wk yield a further increase

in lifespan?
4. Does reduction in protein, with the same calories, change the

distribution of lifespans?
10.5. EXERCISES
255
5. Do the control mice, eating 85 kcal/wk have the same distribution of

lifespan as mice on an ordinary laboratory diet?
256
Chapter 11
Context Issues
257
258
CHAPTER 11. CONTEXT ISSUES
Statistical analysis methods are tools which may be applied in a large

number of contexts. In this chapter we will explore two major dichotomies
which seem to require a lot of thought about context.
1. Observational versus Experimental Comparisons: The same
statistical method, with the same numerical result, may yield a much
stronger conclusion in an experimental study, as opposed to an
observational study.
2. Hypothesis-driven versus High Through-put Research:
Studies that sift through a large number of tests can easily generate
large numbers of false-positive findings. They require special methods
and stringent standards if the results are to be reliable, and the
strategy depends heavily on context.
11.1
Observational & Experimental studies
In an experiment, the investigator actively intervenes in the system being

studied. In an observational study, the investigator only observes the
system, without manipulating it.
The observational versus experimental dichotomy is perhaps the
single most important distinction to keep in mind when drawing
conslusions from data.
Major Points
Observational studies reveal associations.
Experiments reveal causes.
In observational studies, aparent treatment effects may be due to
confounding factors.
Randomized assignment of treatments prevents confounding.
Observational comparisons can arise in experimental studies.
Without concurrent controls, sampling problems can be huge.
Large numbers cannot overcome bias due to study design.
11.1. OBSERVATIONAL & EXPERIMENTAL STUDIES
11.1.1
259
Association versus Cause and Effect
If we can only observe, without intervention, then drawing conclusions

about cause-effect relationships, if possible at all, will require knowledge or
assumptions in addition to the data. As one statistician1 put it, an
observational study of adult men might show that height, weight and girth
are all positively associated, and to a roughly similar degree. It takes
additional knowledge to conclude that reducing a mans weight will reduce
his girth but not his height.
Two Historical Examples
Here are two big questions from the mid 1950s:
1. Is the Salk Polio Vaccine effective?
2. Does smoking cause lung cancer?
The answers required different efforts:
A modest vaccine effect (risk ratio about 2.5) was firmly established
by a single large experiment.
A very strong smoking effect (relative risk about 10) required a
decade of diverse observational studies.
11.1.2
Randomization
Observational studies often make comparisons, but when we compare

groups as found, the groups may differ in ways that we dont know about.
Experimental studies may involve interventions, but if treatments are
assigned according to some pre-existing feature, then the treatment groups
may differ in important ways that we dont know about.
Randomization means making treatment assignments using some
randomizing device, like a computer program, or a table of random
numbers. If we use a randomizing device, we know how treatments were
assigned, and we know that the assignment is unrelated to anything else.
1
Moses, Chapter 1 in: Medical Uses of Statistics
260
Haphazard assignment of treatments is not randomization. With haphazard

assignment, we do not know how treatments were assigned. We only know
that any pattern that might have crept into the assignment went
unobserved.
The Salk Vaccine Trial
The Salk vaccine was the first vaccine against polio, and the field trial of
the vaccine was the largest public health intervention study ever. It is of
special interest for our purpose because it involved two sub-studies. Both
substudies involved experimental intervention, comparing children and
without the vaccine, but only one substudy involved randomization.
1. The original NFIP plan:
vaccine offered to second graders;
grades 1 and 3 are unvaccinated controls.

Problem: Only one group has drop-outs due to lack of consent.
2. The Randomized trial, used in some school districts:
Consent before treatment assignment.
All get injection vaccine or placebo, decided at random.

Double blind.
The Results:2 (Size of groups and rate of polio cases per 100,000.)
The randomized experiment
Size Rate
Treatment 200,000
28
Control
200,000
71
No consent 350,000
46
The NFIP study

Grade 2 (vaccine)
Grades 1 and 3 (control)
Grade 2 (no consent)
Questions:
2
Francis (1955) Am J Pub Health 45:1
Size Rate
225,000
25
725,000
54
125,000
44
261
1. In the randomized experiment, did the vaccinated children fare better

than the control children? By how much?
2. In the NFIP study, did the vaccinated children fare better than the
control children? By how much?
3. What was the difference in polio rates between the two studies:
(a) among treated (i.e.vaccinated) children?
(b) among children in the control group?
(c) among children without consent?
Fill in the blank cells of the following table with yes or no as appropriate.
Group
Treatment
Control
No consent
Randomized experiment
consent? vaccinated?
yes
yes
NFIP study
consent? vaccinated?
yes
yes
Comparing this table to your answers from the previous question, how
would you explain the different results from the two studies?
The Clofibrate Trial
This study has become a standard cautionary tale, concerning the
distintion between randomized assignment of subjects to treatments, and
self-assignment by the subjects.
The Coronary Drug Project was a large, randomized, placebo-controlled
trial of five drugs for the prevention of heart attacks in men with heart
trouble. For each drug, the mortality rate after five years on the study was
to be compared to that of the control (placebo) group.
Clofibrate, a cholesterol-lowering drug, did not produce any significant
reduction in five-year mortality. (20% with clofibrate versus 21% with
placebo). However, a large number of subjects took considerably less
medicine than they were supposed to. Those taking less than 80% of the
prescribed medicine were classified as non-compliant. Among the patients
taking clofibrate, the non-compliant subjects had 25% mortality at five
262
years, compared to only 15% for the compliant patients who took most of
their drug.
It is tempting to conclude that the compliant patients had better survival
because they took more of their drug, which protected them. However,
when we make the same comparison of compliant and non-compliant
subjects in the placebo group, we see that taking most of ones placebo
seemed to have a similar or even greater protective effect than taking most
of ones clofibrate despite the fact that the placebo was designed to have
no plausible effect on survival!
Clofibrate
Number Deaths
All
1,103
20%
Compliant
708
15%
Non-compliant
357
25%
Placebo
Number Deaths
2,789
21%
1,813
15%
882
28%
Compliance is not causing the improvement in survival, it is reflecting a

pre-existing difference in the subjects. No one really knows why compliance
is a marker for better survival, but it might be that people who are in
rapidly declining health are less likely to take all of their medicine.
Which two numbers should you compare to evaluate the value of clofibrate?
Consider two null hypotheses:
1. H01 : Mortality is the same with or without Clofibrate.
2. H02 : Mortality is the same in compliant and non-compliant patients.
Does comparing compliant versus non-compliant patients on the placebo
arm contradict either of these hypotheses?
Does comparing compliant versus non-compliant patients on the clofibrate
arm contradict either of these hypotheses?
Does comparing patients as randomized contradict either of these
hypotheses? Would it contradict either of the hypotheses if the difference
for this comparison were large?
Considering all of the comparisons, which hypothesis is rejected, and which
hypothesis is left standing?
263

A Lab Experiment with Guinea Pig CMV
The clinical examples above are particularly clear illustrations of the

benefits of randomization, and of the need to compare groups as
randomized. Similar issues arise in laboratory work, and the same
principles apply. In particular, it may be tempting to make observational
comparisons within a randomized study. Such comparisons are dangerous.
Schlieiss et. al.
published a report with the following title:
Preconception vaccination with a glycoprotein B (gB) DNA

vaccine protects against cytomegalovirus (CMV) transmission in
the guinea pig model of congenital CMV infection.
The abstract states:
Preconception vaccination with gB did not decrease overall
pup mortality, although, within the gB-vaccine group, pup
mortality was lower among dams with high ELISA responses.
This statement is based on the data in the following table.
Group
Control
gB vaccine
ELISA titer
> 3.4 log10
< 3.4 log10
Dams
10
12
Pups
39
41
4
8
13
28
Dead
pups Mortality (%)
13
33
14
34
0
14
0
50
The first and second lines compare the mortality of pups from control
dams, and from dams vaccinated with the gB vaccine. The third and fourth
lines break down the vaccination results according to whether the antibody
titer in response to the vaccine was high or low.
The statement that Preconception vaccination with gB did not decrease
overall pup mortality is supported by comparing which numbers?
3
Schleiss (2003) J Infect Dis 188:1868
264
The statement that within the gB-vaccine group, pup mortality was lower
among dams with high ELISA responses is supported by comparing which
numbers?
Is there a feature of this study that is analogous to compliance in the
clofibrate trial?
Consider the following two hypotheses:
H1 : The gB vaccine protects pups (perhaps by increasing ELISA titers).
H2 : Dams capable of mounting a robust immune response (high ELISA
titers) to any vaccine will tend to have lower pup mortality, regardless
of what vaccine is used.
Questions
1. What does each hypothesis predict for the comparison of vaccinated
versus control animals?
2. What does each hypothesis predict for the comparison of high versus
low ELISA titers. either hypothesis?
3. Considering both comparisons, which hypothesis is rejected, and
which hypothesis is left standing?
Do you think that the comparison of the two subsets (high versus low titers)
of gBvaccinated dams is a reasonable thing to report in the abstract?
11.1.3
The Role of Randomized Assignment
In the NFIP sub-study within the Salk vaccine trial we see an example of
experimental intervention that gets confounded with all the factors that are
associated with consent. In the clofibrate trial, we see an example of
observational comparisons within a randomized experiment. The guinnea
pig vaccination study is very similar, in that there is an observational
comparison within an experiment. However, Schliess et al. make no
mention of randomization, so the assignment of animals to treatments was
probably haphazard.
265
It is noteworthy that the same problem with observational comparisons

that we see in clinical science are also seen in laboratory science. However,
laboratories seem less inclined to use randomization, which is a standard
tool in clinical trials. In the 1980s, medical journals began requiring that
manuscripts describing randomized studies explain how the randomization
was done, in order to distinguish genuinely randomized studies from those
with haphazard treatment assignment. Now some of the more prominent
journals are starting to demand similar standards for pre-clinical research
with animals.
The role of randomization is to ensure that nothing is systematically
associated with treatment assignment. The beauty of randomization is that
it prevents systematic bias with regard to any potentially troublesome
factor, even ones we dont know about or cant measure.
11.1.4
Simpsons Paradox
Simpsons paradox is a phenomenon that explains the danger of drawing

causal conclusions from observational data. Understanding it suggests some
things that we can (sometimes) do about it. It is probably best appreciated
by way of an example.
The Berkely Admissions Data
The data below4 give the admission rate for men and women applying to
graduate school at U.C. Berkeley in the early 1970s. Only applicants to the
six largest departments are considered.
Gender
Admit
Male Female
Admitted 1198
557
Rejected 1493
1278
The admission rate for men is
1198
= .45
1198 + 1493
4
Bickel et al. 1975 Science, discussed by Freedman et al.
266
while the admission rate for women is

557
= .30.
557 + 1278
The decisions are made at the department level, and the data for each
department are given below.
Department
A
B
C
D
E
F
total
Applicants
825
560
325
417
191
373
2691
Men
Admitted Percent
512
62
353
63
120
37
138
33
53
28
22
6
1198
45
Applicants
108
25
593
375
393
341
1835
Women
admitted Percent
89
82
17
68
202
34
131
35
94
24
24
7
557
30
Questions:
1. Which department has the biggest rate difference (in percent) in favor
of men? What is that difference?
2. Which department has the biggest rate difference in favor of women?
What is that difference?
3. In the overall admission table, there was a 15 percentage-point
difference in favor of men. Where did that big difference come from?
Weighted means
The overall admission rates can be thought of as weighted means of the
departmental rates.
First, lets consider only women, and lets introduce some notation.
Let nA , . . . , nF be the numbers of women applying to departments
A, . . . , F , respectively.
267

Let N = nA + nB + nC + nD + nE + nF be the total number of women
applying.
Let aA , . . . , aF be the respective numbers of women admitted to
department A, . . . , F .
Let a = aA + aB + aC + aD + aE + aF be the total number of women
admitted.
The overall, or crude admission rate for women is
a
557
=
= 0.30.
N
1835
Define rA = aA /nA , the admission rate for women applying to department

A, and define rB , . . . , rF similarly.
Define wA = nA /N , the fraction of women who apply to department A and
define wB , . . . wF similarly. Note that these weights sum to one.
We can then write the crude admission rate for women as a weighted
average of department-specific admission rates,
a
= wA rA + wB rB + . . . + wF rF .
N
We can see that this is so by substituting the definitions for the rates and
weights into the weighted mean, we get
nA aA
n F aF
+ ... +
N nA
N nF
1
(aA + ab + . . . + aF )
N
a
.
=
N
(11.1)
(11.2)
In numbers instead of symbols,

108 89
341 24
1
+ ... +
=
(89 + . . . + 24)
1835 108
1835 341
1835
557
=
= 0.30
1835
(11.3)
(11.4)
and this is the crude admission rate for women.

The crude admission rate for women is a weighted sum of the female
admissions rates in each department, with the weights determined by the
preferences of the women.
268
The crude admission rate for men is a weighted sum of the male admissions
rates in each department, with the weights determined by the preferences of
the men.
If we calculate the weighted mean of the mens department-specific
admission rates, but using the preferences of the women as weights, we get
a 30% average admission rate, the same as for women (to two digits).
If we weight the womens admission rates according to the mens
departmental preferences, we get a 52.1% admission rate, somewhat higher
than the men, reflecting the additional weight given to department A,
where a small number of female applicants had a very high success rate.
Comparing the success rates of men and women with a common set of
weights addresses a what if question. Suppose that we were to follow the
success of a subgroup of aplicants the following year, and suppose that the
men and women in this subgroup had similar departmental prefernces. We
can use a weighted mean to predict the overall success of men and women
in such a group, assuming that similar rates for men and women in each
department apply. But we should realize that changes in application
patterns might generate changes in success rates. For example, the
apparent advantage to women applying to department A might not survive
a major increase in popularity among women.
We should note that the crude rates are facts. It is true that a smaller
fraction of female aplicants were admitted compared to male applicants.
However, if we want to interpret the data in a way that is not influenced by
departmental preferences of men and women, we need to look beyond the
crude rates, and work with the department-specific rates.
Cause versus Association

Lets consider one set of hypothetical data, with two possible stories
attached to it.
Story 1:
Suppose C is something that might be a cause, such as taking a drug, and
suppose that E is an event that might be the effect of the drug, such as
recovery from symptoms, such as a headache. Suppose F is a confounding
factor, such as gender. Then the following table is a numerical illustration
269
of Simpsons paradox.
Combined
E E
Recovery Rate
Drug (C)
20 20 40 50%
16 24 40 40%
No Drug (C)
36 44 80
Males
E E
Recovery Rate
Drug (C)
18 12 30 60%
No Drug (C) 7 3 10 70%

25 15 40
Females
Drug (C)
No Drug (C)
E E
Recovery Rate
2 8 10 20%
9 21 30 30%
11 29 40
Note that people who take the drug are, in the aggregate, more likely to
benefit than if they dont take the drug. However, the men who take the
drug are less likely to benefit, and women who take the drug are less likely
to benefit.
In deciding whether or not to use the drug, should we believe the aggregate
result, or the gender-specific results?
It is important to distinguish seeing from doing. Suppose we hear of a
person taking the drug, but we arent told that persons sex. The table tells
us that it would be a reasonable guess that the person is male, and hence
likely to recover. This is consistent with the combined table, which shows a
better success rate for people who take the drug. However, the guess that
the person is male is a big part of the reason for expecting success.
The question of whether to take the drug is a question about doing, not
about seeing. If the drug is bad for men, and bad for women, it is bad for
you. Given that it will not change your gender (even in the fictional world
of the example), it will not help your headache.
The answer to our question is to look at the gender-specific subtables. It is
important to realize, however, that the answer might be different in another
situation. What is perhaps more surprizing is that the data at hand cannot
answer the question of which table to believe. The answer depends on the
270
context, not on the data.

Story 2: (same data)
Suppose that the drug is supposed to lower your blood pressure that being
an important part of its mechanism of action. As before, let C be a putative
cause, i.e. the event that a given person chooses to take the drug, and let E
be the effect, i.e. the event that a person gets relief from their headache.
Now suppose that factor F refers not to gender, but to the event that the
drug actually does lower a given persons blood pressure, as we hope.
In this new scheme, instead of factor F influencing both choice C, and
effect E, we have C causing F which in turn causes E. The two situations
are depicted in the figure below, where the arrows indicate the direction
from cause to effect. Note that the direction of one of the arrows is reversed
in the new situation.
Since the drug acts by way of F , the variation in F is an important step by
which the cause produces its effect. We would not want to artificially
remove that effect by looking within sub-tables that have similar values of
the intermediary, F .
271
Treatment
Treatment
Blood Pressure
Sex
E
Recovery
E
Recovery
The question we want to ask is whether taking the drug will increase the
chance of relief, i.e. whether there is a positive causal effect from C to E.
In the first story, the influence of F on both C and E creates a positive
association between them. When we look at the aggregate table, we see the
effect of this positive association. When we look at the sub-tables, we see
the effect of the direct causal link between C and E. It is the causal link,
not the association, that tells us what to expect if we persuade someone
who is disinclined toward the drug to take it anyway.
If we were to do an experiment with the drug, assigning subjects to take the
drug or not at random, we would break the influence of F on C. Analyzing
272
the sub-tables (e.g. taking an average of the effects for men and for women)
in an observational study would accomplish the same thing as a randomized
experiment, provided that gender is the only confounding variable.
In the second story, the drug acts by way of F . We still want to know
whether there is a positive causal effect from C to E, but now there are two
paths from C to E, one direct, and the other through F . Because the
variation in F is an important step by which the cause produces its effect,
we would not want to artificially remove that effect by looking within
sub-tables that have similar values of the intermediary, F . So in the second
story, we should look at the aggregate table, as that reflects both causal
pathways.
Simpsons paradox is only paradoxical if we try to interpret associations as
if they were cause-and-effect relationships.
Probability describes associations, i.e. what we see. With probability
calculus, we can have
Pr(E|C, Fi ) < Pr(E|C, Fi )
for all i
yet still have

Pr(E|C) > Pr(E|C).
Causal relationships tell us what will happen if we do something. The
notation below, due to Pearl5 , distinguishes doing from seeing. If do(C)
represents and an action, then
Pr(E|do(C), Fi ) < Pr(E|do(C), Fi )
for all i
imply that we must also have

Pr(E|do(C)) < Pr(E|do(C)).
The order reversal of Simpsons paradox cant happen with cause-effect
relationships. So when we see the reversal in associations, it seems
paradoxical.
Implications of Simpsons paradox:
Addressing causes requires looking beyond the data, at the context.
5
Pearl, J. Causality : Models, Reasoning, and Inference, Cambridge Univ Press, 2001
273
The same data can lead us to use the sub-tables in the gender story,
but the aggregate table in the blood-pressure story, so the data alone
cannot possibly tell us which table to look at.
Confounding Variables
In an observational study, a confounding variable is one that is associated
with both the putative cause, and the effect of interest, thereby creating an
association without a cause-effect relationship.
A confounding variable is
associated with the putative cause,
associated with the effect of interest, and
not on the causal pathway.
In story 1, gender is a confounding variable. It was associated with both
drug use and the desired effect. In story 2, blood pressure drop was on the
causal pathway, so the association it generates is not spurrious, but causal
and intended. The Berkeley admission example is somewhat different.
Gender does cause a difference in admission rates by way of the difference
in departmental preferences. We can think of department as lying on the
causal pathway, but the sub-tables are of interest because the interesting
question is not whether gender has a causal effect on admission, but
whether gender has any direct effect on admission that is not attributable
to department preferences. Departmental preferences appear to be
sufficient to explain essentially all of the difference in crude admission rates.
Dealing with confounding variables
Randomization balances treatment groups with regard to any relevant
factors (preventing association with the putative cause). Randomization
even balances factors that you are not aware of, i.e. you do not need to
know what the potential confounding factors are. This is a huge advantage
over observational designs.
274
Observational studies require that you know about, and can measure,
potentially confounding variables. If so, you can use any of these
approaches for reducing the confounding problem.
Matching: e.g. Each person responding to therapy is matched to a person

of similar age, disease, and time on therapy. Blood or tumor samples
might then be compared.
Stratification: Comparisions are done within similar groups, then the

comparisons are combined. Using a common set of weights to
summarize the Berkeley admission rates for men and women would be
a stratified analysis.
Statistical Adjustment: The effect of the putative cause and

confounding variables are estimated together, within a single model.
Each of these is a major subject in its own right, and we will not have time
to explore them. All if them require more complexity in design or analysis
than an experimental study, and the results will usually be less reliable
than a good randomized experiment.
Berksons paradox
The figure below adds another causal diagram to the diagrams we
considered above. A new phenomenon, called Berkesons paradox, is
revevant to model (c).
"

'
"
"
"
'
"
"
"
"
275

V
"
"
Treatment
C
Treatment
C
F
Gender
"
Recovery E
(c)
(b)
Treatment
C
Recovery E
(a)
Blood
pressure
Recovery E
"
'
In model (c), should we condition on F or not?

"
"
"
The essence of Berkesons paradox is that observations on a common

consequence of two independent causes render those causes dependent.
"

Example: I toss a nickel and a quarter.

"
"
Q1: I tell you that the nickel came up heads. What is the probability that
the quarter came up heads?

Q2: I tell you that (at least) one of the coins came up heads. What is the
probability that the quarter came up heads?
"
"
"
"
"
"
"
"
"
"
"
"
"
"
"
"
"

L
"
"

"
"

S
"
The information in Q2 rules out the possibility that both are tails. The
probability of heads is then 2/3 for both the nickel and the quarter, but the
probability that both are heads is 1/3, which is less than 2/3 2/3, so the
two events are not independent.
c
The possibilities are:

Nickel Quarter
H
H
H
T
T
H
T
T

In Q1, the nickel and quarter remaIn independent. In Q2, the fact that at
least one coin came up heads depends on both coins, so the answer creates
a dependency between them.

"
"
276
Berkesons paradox is quite general. If we select the material we study

based on some observed features, we will generate associations between the
the factors that cause those features.
The upshot for part (c) of the figure above, is that if we select, match,
stratify, or adjust based on F , we will create an apparent association
between the ancestor nodes of F , hence between C and E, i.e. we will
create confounding. This means that it is possible to over-correct for the
nuisance variables in an observational study. Being conservative by
adjusting for all conceivable possibilities is not a viable strategy. Learning
about a given causal connection from observational data requires some
information about the rest of the causal network.
Networks
For each of the three networks in the previous figure, we have answered the
question of conditioning (sub tables) or not (aggregate table). For models
(a) and (c), we need to condition on F . For model (b) it depends on
whether we want to estimate the total or direct effect of C on E. There is a
mathematical algorithm for deciding such questions quite generally, for very
complex networks, but such analysis of networks is only beginning to see
use in cell biology.
It is interesting to notice that recent approaches to complex networks in
biology rely heavily on experimental manipulations, not just appropriate
analysis of observational data. Lets briefly mention two examples.
Causal Protein-Signaling Networks Derived from Multiparameter
Single-Cell Data. Sachs et al., 2005 Science:
. . . Perturbing these cells with molecular interventions drove the
ordering of connections between pathway components, wherein
Bayesian network computational methods automatically
elucidated most of the traditionally reported signaling
relationships and predicted novel interpathway network
causalities, which we verified experimentally.. . .
Life is Complicated. Nature, 2010:
This commentary in a special issue notes that Eric Davidsons lab, at
Caltech, has spent that last decade systematically knocking out genes
11.2. HYPOTHESIS-DRIVEN RESEARCH V. HIGH THROUGH-PUT SCREENING277

involved in the development of sea urchins. They have identified enough of
the gene networks controlling early development to recognize recurring
control circuit modules.
11.2
Hypothesis-Driven Research v. High

Through-Put Screening
Statistical hypothesis tests are somewhat analogous to diagnostic screening

tests, with the a priori plausibility of an hypothesis playing the role of
prevalence. Hypothesis-driven research is analogous to a high-prevalence
situation, because we generally start out in the belief that there is likely
something to be detected, i.e. the null hypothesis is likely to be false, and
hopefully rejected. High through-put screening is analogous to a low
prevalence situation, in that many null hypotheses are tested even though
they are likely to be true, i.e.for most of our tests, there is no signal, only
noise.
The problem is old
The problem with multiple testing was recognized as an important issue
long before modern high through-put testing. In 1843, Cournout6 described
the problem with respect to an investigation of the chance of male birth:
One could distinguish first of all legitimate births from those
occuring out of wedlock, . . . one can also classify births according
to birth order, accorting to age, profession, wealth, or religion of
the parents . . . usually these attempts through which the
experimenter passed dont leave any traces; the public will only
know the result that has been found worth pointing out; and as
a consequence, someone unfamiliar with the attempts which
have led to this result completely lacks a clear rule for deciding
whether the result can or cannot be attributed to chance.
The problem is pervasive
John Bailar, a MacArthur Fellow and long-time reviewer for the New
England Journal of Medicine, wrote that the real threat to the integrity of
6
Translated from French, as quoted by Schaffer, Annu Rev Psychol 1995, 46:56184
278
science7 is not fraud or fabrication of data, but a pervasive failure to

divulge the limitations of research. He writes that
We all seem to be remarkably tolerant of deliberate deception in
the selective reporting of data, as long as nobody is lying.
While the ability to notice interesting patterns is valuable, and reporting
interesting observations has its place, it is fraudulent to present the results
of an extensive search for pattern as if it were hypothesis-driven research.
A potential scam
Several authors have described how selective reporting of predictions can be
used in a confidence scam.
The con-man starts with a large email list, dividing it in two. He sends half
the list a prediction that the stock market will go up, and half a prediction
that it will go down. When there is a notworthy move, he sends a follow-up
mailing, but only to the half who received the correct prediction. After
three such rounds, he has a 3-for-3 record with the one-eighth of his initial
large list, so he offers them an investment opportunity.
Scientists are more honest, we hope, but even an honest scientist can
unwittingly engage in something similar. He might have a well-motivated
hypothesis that implicates a signaling pathway in some phenomenon, but
which doesnt point to any one of the many molecules involved. If the
investigator sequentially pursues hunches until he hits something that looks
significant, there is a large chance of a false-positive conclusion, unless the
full potential for searching is recognized and accounted for.
A curious table
Repeatedly dividing data into subsets, as described by Cournot, can
generate spurrious results. The statistician (Richard Peto) involved in
ISIS-2, a major international clinical trial, was well aware of this problem
when referees demanded that the trial report include a table with extensive
sub-group comparisions. The conclusion he wanted to report was that the
interventions (aspirin and streptokinase) could save many lives if used
routinely. If there was a spurrious finding that the benefit did not apply to
some subgroup, then physicians might not apply these cheap and safe drugs
7
The Chronicle of Higher Education, April 21, 1995, B12

as widely, and many patients would miss out on the life-saving benefit. His
response was to include a table in which the largest sub-group difference
was for patients grouped by astrological birth sign. This first comparison
serves as a warning.
280
11.2.1
Testing One Hypothesis (Review)
Decision
Do not Reject H0
Reject H0
State of Nature
H0 True
H0 False
Correct
Type II Error
Type I Error Correct
The test size or significance level, , is the type I error rate we are willing
to risk.
Given a set of data, the p-value is the smallest value of that would lead
us to reject the null hypothesis. We can state before the experiment.
After observing the data, we calculate p. If p , we reject.
The probability of not making a type II error is the power of the test.
Power is a function it depends on 1 2 , as well as the standard
deviation and the sample size.
Effect Size:
11.2.2
|1 2 |
Multiple Testing Situations
There are a variety of situations that generate multiple hypothesis tests.

We often refer to the general problem as one of multiple comparisons, but
the hypotheses being tested do not have to be comparisons.
Groups
If we have 6 groups, we can make 6 5/2 = 15 pairwise comparisons. We
can test hypotheses such as:
H5,6 : 5 = 6 ,
H1,2,3 : 1 = 2 = 3 ,
etc. In the situation described by Cournot, the groups are not pre-defined,
as would be the case for different treatments. We might then need to
consider all possible divisions of individuals into groups.

Genetics
Even with only two groups, we may have multiple outcomes to compare. In
genetics, for example, we might regard the expression level (or allelic
proporitons) of each gene as a separate outcome to be compared across two
or more groups.
Genetic linkage studies
Genetic linkage studies often use markers spread across the entire genome.
Because nearby loci are linked, there is a limit to the effective number of
tests one can do, no matter how dense the panel of markers might be.
Lander and Kruglyak8 give significance levels that are required for a finding
of statistically significant linkage to a disease-related locus. The necessary
p-values are from about p = 0.0007 for a mouse intercross, to about
p = 0.00002 for allele-sharing studies in humans.
Genome-wide association studies
Genome-wide association studies (GWAS) compare genotypes of people
with a trait (cases) to either parents, siblings, or unrelated controls.
Because these studies involve linkage disequlibrium, which ties together
alleles over much shorter distances than ordinary gentic linkage, there are
many more independent loci that can be compared. An impressive p-value
for such a study would be on the order of 108 , unless the p-value was
computed by a method, such as permutation, that calculates the
significance on a whole-genome basis.
Gene expression microarrays
Two groups of mice can be compared with regard to the expression as
essentially every gene, or even with regard to the expression of
non-translated transcripts. These are potentially independent responses,
although they are probably related via unknown interactions.
8
Nature Genetics 1995, 11:2417
282
11.2.3
Type I Error Control
We will briefly consider some variations on the idea of controlling type I

error (false rejection of the null hypothesis) when there are many null
hypotheses being tested.
Weak versus Strong Control
Weak control of the type I error rate refers to methods that limit the
probability of rejecting any null hypothesis when all null hypotheses are
true (i.e. when nothing interesting is going on). The most common example
of such a method is called Fishers protected LSD (least significant
difference). This method is applicable when we have k groups. We first test
the global hypothesis,
H0 : 1 = 2 = . . . = k ,
typically by using a F-test, and only if we can reject this at the level do
we continue to test individual comparisons. If the global hypothesis is true,
i.e. all population means are identical, then this procedure limits the false
rejection rate to .
Suppose the global null hypothesis is false, because one group has a very
different mean from the rest, which is easily detected. Then we will
probably reject the global null hypothesis, and the chance of mistakenly
finding a difference between the set of identical means can be much larger
than . For this reason, many regard weak control as inadequate.
Strong control means that we control the probablity of false rejections, even
when some of the null hypotheses are false and should appropriately be
rejected. The methods below acheive strong control.
11.2.4
Error Rate Definitions
We first need to define the family of hypotheses that we are to consider.

Then we can define several levels of error control with respect to that
family.
PCE The per-comparison error rate, or error rate per hypothesis, is
defined for each hypothesis as simply its type I error rate. Controlling

the PCE means treating each comparison in isolation, without regard
to the multiple comparisons. This may be reasonable for
well-motivated hypothesis-driven research.
FWE The family-wise error rate is the probability of making at least one
error (false rejection). Controlling the FWE is a conservative goal.
We limit the chance of making any false claim.
PFE The per-family error rate determines the expected number of false
rejections in the family. This is without regard to how many correctly
detected deartures from the null hypothesis there are.
FDR The false discovery rate is the expected proportion of rejections that
are false rejections. Controlling the FDR is less stringent than
controlling the FWE, and it allows more hypotheses to be rejected. It
may be appropriate when the rejected null hypotheses can be verified
by further investigation, and some false-rejections can be tolerated in
pursuit of more true rejections (i.e. detections of effect).
11.2.5
FWE
Genetic studies generally require strong control of the FWE, because

pursuing a false lead can be extremely expensive.
Bonferroni method
The Bonferroni method is conservative, but very simple and expremely
general. The idea is that the probability of the union of events is no greater
than the sum of the individual event probabilities. In symbols,
X
Pr(A1 A2 . . . Ak )
Pr(Ai ).
If we let Ai be the event that we reject hypothesis Hi , which has
probability i assuming the null hypothesis, then the probability of one or
more false rejections is less than the sum of the i .
If we have n hypotheses, and if we test each hypothesis at nominal level
/n then the probability of any (one or more) false rejection is no greater
than .
284
Holms sequential method

We can gain some additional power by working in stages.
Assume that the hypotheses are numbered in order of their p-values
(smallest to largest). At the first stage, we reject H1 if p1 /n. If we fail
to reject at this stage, we stop and do not reject any hypothesis. If we
reject H1 , then we reject H2 if p2 /(n 1) and so on. This sequential
method is somewhat more powerful than the Bonferroni method, and
equally general.
Refinements
Many more refined multiple comparisons methods have been described to
take advantage of special features of different problems. If the tests are
independent, or have limited dependencies, then some modest power gains
are possible. More substantial gains are possible if there is structure to the
family, such as factorial treatment combinations. Limiting the number of
comparisons can yield large power gains. For example, we may want to
compare each group to a common control or reference group, which reduces
the number of comparisons we need to make.
11.2.6
FDR
Controlling the false discovery rate is a compromize that allows more false
rejections when there are many rejections that are likely to be true. If there
are m hypotheses, m0 of which are true (i.e. null), and m m0 of which are
false (i.e. effects to be detected), then the outcome of multiple tests in an
experiment can be described by the following table.
True null hypotheses

Non-true null hypotheses
Declared
non-significant
U
T
mR
Declared
Significant
V
S
R
Controlling the FDR controls the expected value of V /R.
Total
m0
m m0
m

One method of FDR control9 also works with the ordered p-values,
p1 p2 . . . pm . Pick a rate q, (where 0 q 1) at which we want to
control the FRD. Let k be the largest i for which
pi
i
q.
m
Then we can reject all Hi for i = 1, 2, . . . k.

FDR controlling procedures are popular for gene expression microarrays,
where the aim is often to get a short list of genes likely to have differential
expression, and subject these to more definitive testing. This might involve
using northern blots instead of microarrays.
There are two main variations on FDR control that you are likely to
encounter.
Benjamini and Hochberg (1995) control the False Discovery Rate

(FDR)
FDR = E(V /R|R > 0)Pr(R > 0)
where V is the number of false rejections, and R is the total number
of null hypotheses rejected.
Storey (2001) controls the Positive FDR
pFDR = E(V /R|R > 0)
This controls at a stricter level.
FDR control, B& H procedure

to control FDR at = 0.05:
9
Benjamini and Hochberg, 1995, Journal of the Royal Statistical Sociey, Series B,
57:289300
286
Rank (j)
1
2
3
4
5
6
7
8
9
10 = m
11.2.7

P-value
0.0008
0.009
0.165
0.205
0.396
0.450
0.641
0.781
0.900
0.993
(j/m)
0.005
0.010
0.015
0.020
0.025
0.030
0.035
0.040
0.045
0.050
Reject H0 ?
1
1
0
0
0
0
0
0
0
0
q-values
The q-value for a given feature or comparision is defined as the minimum

FDR that can be attained when calling that feature significant.
In a study of differential gene expression, if a gene has a q-value of 0.013, it
means that 1.3 % of genes that show p-values as small as this gene are
expected to be false positives.
The q-value is not an adjusted p-value, but it does depend on the whole
collection of p-values.
q-values have become the standard way of expressing significance in
microarray gene expression studies.
Computing tools for q-values:
R with siggenes package (tool of choice)
Genstat with fdrmixture
Excel with add-in (not vanilla Excel)
Local statistician
11.2.8
Summary of multiple testing
It is important to recognize the extent of the multiple testing problem, and

to fairly present the extent of searching.
Different approaches are appropriate in different contexts.
A clear a priori hypothesis requires no adjustment.
When comparing a few groups, the ANOVA F-test can serve as a
gate-keeper, and several specialized methods are available.
The Bonferroni method is quite general, and provides strong protection
against false-positive findings.
False Discovery Rates (FDR) and q-values have become standard in gene
expression studies. These methods provide a short list of genes, which may
be somewhat contaminated with false positives. This is suitable when one
has the opportunity to follow-up on the short list.
288
11.3
A Review Problem
Exercise 11.3.1 The file pipette.csv (on the course webpage) records
pipette calibration data obtained by six COH graduate students. The first
column gives the nominal volume for the pipette (L). The remaining
columns give the delivered weight of room temperature deionized water (mg)
obtained by each of six students, indicated by initials in the first row. The
first three rows of data are the respective weights of three deliveries of water
with the pipette set to deliver 50 L. The remaining rows follow a similar
pattern, but with the pipette set to deliver larger volumes.
The purpose of collecting these data is to address the question: How
accurate is a pipette? Use the data to address the question.
An answer should probably include both graphical and numerical summaries
of the data. It would be appropriate to consider natural aspects of the
question, such as how accurate in whose hands? or how accurate at what
target volume? Try not to think of this as a homework problem designed to
illustrate any particular statistical calculations. Instead, try to answer the
question in a way that seems useful to an investigator who uses a pipette.

Statnotes PDF

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Statnotes PDF

Transféré par

Droits d'auteur :

Formats disponibles

Biostatistics Notes and Exercises

1. Lecture notes & handouts. (All thats really needed)

Comments About Statistics . . . . . . . . . . . . . . . . . . .

First Example: A Bioassay for Stem Cells . . . . . . . . . . .

The taxi problem. . . . . . . . . . . . . . . . . . . . . . . . . .

Simulation of estimator performance . . . . . . . . . . 12

Examples of hypothesis rejection . . . . . . . . . . . . 16

The notion of population . . . . . . . . . . . . . . . . . . . . . 21

Homework: A Computing Tutorial . . . . . . . . . . . . . . . 25

Creating a Data File in Excel . . . . . . . . . . . . . . 25

Documenting the analysis . . . . . . . . . . . . . . . . 33

Other Notions of Typical . . . . . . . . . . . . . . . . . 42

Quantiles & Percentiles . . . . . . . . . . . . . . . . . . 49

Homework Exercises Problem Set 1 . . . . . . . . . . . . . 62

Example: Mendels Peas . . . . . . . . . . . . . . . . . . . . . 70

The choice of characters . . . . . . . . . . . . . . . . . 70

Hybrids and their offspring . . . . . . . . . . . . . . . . 71

Odds and Probabilities . . . . . . . . . . . . . . . . . . 72

Marginal, Joint and Conditional Probabilities . . . . . 82

Bayes Rule and Prediction . . . . . . . . . . . . . . . . . . . . 88

Example: The ELISA test for HIV . . . . . . . . . . . 92

Problem Set 2, (part 1 of 2) . . . . . . . . . . . . . . . . . . . 96

4 Estimating and Testing a Probability

Example: MEFV Gene and Fibromyalgia Syndrome . . . . . . 102

The Binomial Distribution . . . . . . . . . . . . . . . . . . . . 105

Calculating the tail probability . . . . . . . . . . . . . 106

Estimating a probability . . . . . . . . . . . . . . . . . . . . . 109

Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 112

The Law of Averages . . . . . . . . . . . . . . . . . . . . . . . 114

Expected Value . . . . . . . . . . . . . . . . . . . . . . 113

The Normal Distribution . . . . . . . . . . . . . . . . . . . . . 118

Standardized scale . . . . . . . . . . . . . . . . . . . . 121

Some motivation . . . . . . . . . . . . . . . . . . . . . 123

Central Limit Theorem . . . . . . . . . . . . . . . . . . 125

Standard error of the mean . . . . . . . . . . . . . . . 126

Areas under the normal curve . . . . . . . . . . . . . . 126

Homework Exercises . . . . . . . . . . . . . . . . . . . . . . . 131

5 Estimation & Testing using Students tDistribution

A Single Sample . . . . . . . . . . . . . . . . . . . . . . . . . . 136

A Paired Experiment . . . . . . . . . . . . . . . . . . . . . . . 139

The t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . 141

Two Independent Samples . . . . . . . . . . . . . . . . . . . . 142

Standard Error of a Difference . . . . . . . . . . . . . . 143

Pooled Standard Deviation . . . . . . . . . . . . . . . . 144

Example: Diet Restriction . . . . . . . . . . . . . . . . 146

One-sided versus two-sided . . . . . . . . . . . . . . . . . . . . 148

Exercises (not turned in) . . . . . . . . . . . . . . . . . . . . . 150

Homework Exercises (Problem Set 3) . . . . . . . . . . . . . . 152

Example: Genetics of T-cell Development . . . . . . . . . . . . 156

Computing in R Commander . . . . . . . . . . . . . . 156

Computing with Prism . . . . . . . . . . . . . . . . . . 159

Tests in General . . . . . . . . . . . . . . . . . . . . . . . . . . 160

A Lady Tasting Tea . . . . . . . . . . . . . . . . . . . 161

Tests and Confidence Intervals . . . . . . . . . . . . . . 161

Type I and Type II errors . . . . . . . . . . . . . . . . 163

Example: Inference v. Prediction . . . . . . . . . . . . . . . . 163

Chi-square goodness-of-fit test . . . . . . . . . . . . . . . . . . 170

Comparing 2 Groups: Testing independence of rows and columns174

One-sample versus two-sample chi-square tests . . . . . 177

Multiple testing . . . . . . . . . . . . . . . . . . . . . . 179

Decomposing tables . . . . . . . . . . . . . . . . . . . . . . . . 182

Fishers exact test (small samples) . . . . . . . . . . . . . . . . 184