Biomedwet Bachelor PDF

Introduction to Biostatistics
Geert Verbeke
Biostatistical Centre, K.U.Leuven
geert.verbeke@med.kuleuven.be
http://perswww.kuleuven.be/geert verbeke
Bachelor Biomedical Sciences Bachelor Pharmaceutical Sciences
Contents
I
Introduction, motivation and example
Introductory material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Homeopathy: The test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
II
Basic principles of statistical methodology
What is statistics ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Population versus sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Causality and randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
27
III
Describing and summarizing data
Types of outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Graphical presentation of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Summary statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
IV
Basic concepts of statistical inference
Describing the population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
10
From the population to the sample, and back to the population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
11
Estimation, sampling variability, bias, and precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
12
Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
13
Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Some frequently used tests
14
The comparison of two means: Unpaired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
101
151
269
ii
15
The comparison of two proportions: Unpaired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
16
The comparison of two means: Paired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
17
The comparison of two proportions: Paired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
VI
Further topics on statistical inference
18
Errors in statistics: Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
19
Errors in statistics: Practical implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
20
One-sided versus two-sided tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
21
Describing associations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
22
Non-parametric statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
Bibliography
358
499
iii
Part I
Introduction, motivation and example
Chapter 1
Introductory material
. Motivation
. Course material
. Evaluation system
1.1
Motivation
Master thesis
Statistics in the (bio-)medical literature
Correct analysis of collected data
Correct interpretation of results
1.2
Course material
Copies of the course notes: Toledo

Data sets analysed in the course: Toledo
Papers from biomedical literature, discussed in course: Toledo
Statistica software:
. Available in all K.U.Leuven PC rooms

. Available through LUDIT:
http://ludit.kuleuven.be/software/
. ...
Vestac JAVA applets
. Online:
http://ucs.kuleuven.be/links/index.htm
. Local installation:
http://ucs.kuleuven.be/java/download/download.html
and follow instructions
1.3
Evaluation system
Part A:
. Take-home assignment (individualized)

. Data analysis
. Project initiated during practical sessions
Part B: Open book, individualized, written, multiple choice exam
Chapter 2
Homeopathy: The test
. The controversy
. The movie
. Blinding
. Placebo
. The ultimate experiment
. The statistics
. Errors in statistics
2.1
The controversy
2.2
The movie
Original version: BBC (Horizon)

Dutch version: VRT (Overleven)
2.3
Blinding
Scientists dont always

think rationally. . .
They can fool
themselves.
(J.Randy)
10
Bias can be introduced if the scientist knows what samples are being investigated.
This can be avoided by blinding
Blinding is obtained by randomly assigning codes to the samples/treatments.
The codes are broken after all data have been collected.
The less objective the measurements are, the more important is blinding:
. Survival of the patient is an objective measure
. Tumour reduction is a semi-objective measure
11
12
In some cases it is important that patients themselves do not know what

treatment was received:
. Pain measurements
. Quality of life measurements
One can then use double blinding
Blinding is not always possible:
. Comparison of different bandages

. Comparison of different surgery suture techniques (staples versus thread)
13
2.4
Placebo
pills that contain no active ingredient at all,

just plain sugar
14
The fact that treated patients improve is no evidence for the efficacy of the
treatment:
. Natural improvement can occur
. Improvement can be the result of the attention given to the patient
Hence, showing efficacy of a treatment requires comparison to placebo
This explains the popularity of the placebo-controled trials in bio-medical sciences
Is the use of placebo ethical ?
. : The advancement of science is dependent on the sacrifice of a few for the

benefit of many others
. : Physicians/investigators are never relieved of their obligations of care
towards their patients (Declaration of Helsinki, 1964)
15
In cases where the use of placebo is considered unethical, the new treatment is
often compared to a standard one.
The aim of the study is then to show that the new treatment is at least as
good as the standard treatment.
16
2.5
The ultimate experiment
2 5 tubes are prepared, with 5C dilution. The first five starting from active
substance, the second five starting from pure water
These 10 tubes are given a random label: blinding
The tubes are diluted further to obtain 2 20 dilutions of 18C
17
New labels are assigned, in order to rule out any form of fraude.
A sample of living human cells is added to a drop of each tube.
How many cells have been activated by the different test solutions ?
Measurements are performed by two different labs in parallel.
Labs were told there were 20 active solutions and 20 placebo solutions. This
was done to avoid that researchers would classify all solutions as non-active.
18
2.6
The statistics
The solutions have been analysed in parallel by two different labs

Lets focuss on the results from the second lab:
19
The results can also be summarized as a 2 2 table:
Decision
Reality
Homeopathy
Placebo
Homeopathy
11
20
Placebo
11
20
20
20
40
11 out or the 20 H tubes are scored as being active

Since the labs knew that there were 20 active solutions, this immediately implies
that 9 out of the 20 P tubes are scored as non-active
20
If P and H were really equivalent we would expect:
Decision
Reality
Homeopathy
Placebo
Homeopathy
10
10
20
Placebo
10
10
20
20
20
40
Since we observed more correct classifications (11/20), the result of the

experiment can be considered as some evidence for H efficacy.
On the other hand, 11/20 could have occurred by pure chance
21
This random variability is also reflected in the results of the first lab:
Decision
Reality
Homeopathy
Placebo
Homeopathy
11
20
Placebo
11
20
20
20
40
In general, repeating an experiment would rarely lead to exactly the same results
By random chance, one may obtain results which are slightly different from 10/20
How much is slightly ?
11/20 ?
12/20 ?
13/20 ?
...
22
What number x of correct positive test results should have been obtained in order
to consider this sufficient evidence in favour of H ?
The answer should be based on the probability of having at least x correct positive
test results by pure chance
Such probabilities can be calculated using probability theory:
x
11
p: Probability of at least x correct

positive test results by pure chance
0.3762
12
0.1715
13
0.0564
14
0.0128
15
0.0019
23
Note how unlikely it would be to observe, e.g., x = 15 correct positive test results
if H and P would be equivalent (p = 0.0019)
Therefore, observing x = 15 could be considered strong evidence in favour of H
On the other hand, if there is no difference between H and P, one will observe at
least 11 correct positive test results in 37.62% of the cases, by pure random
chance.
Our experiment therefore does not provide any evidence in favour of H
Theres absolutely no evidence at all

to say that there is a difference. . .
(M.Bland)
24
2.7
Errors in statistics
Since what is observed in an experiment is subject to random variation, also the

conclusion is subject to random variation.
For example, even if H and P are equivalent, one may observe 15 correct positive
test results by pure chance: This will happen in 19 experiments out of 10000.
In such a case one would conclude that there is evidence for a difference between
H and P, even when there is no difference at all.
This shows that statistics will only help in summarizing and expressing the
strength of the evidence, in favour or against some specific statement.
One always has to keep in mind that errors can be made in the conclusions
25
Conclusion:
Statistics can prove everything
Statistics helps to:
. quantify the errors

. control the errors
26
Part II
Basic principles of statistical methodology
27
Chapter 3
What is statistics ?
. Examples
. Conclusion
28
3.1
Example: Sickness absence
In occupational medicine, one is interested in studying factors that influence

absence due to sickness
The following data were obtained from 585 employees with a similar job:
Sickness absence
Gender
No
Yes
female
245
184
429
male
98
58
156
343
242
585
29
Research question:
Is there a relation between absence and gender ?
184/429 = 42.9% of the females, and 58/156 = 37.2% of the males

have been absent
This suggests that females are more absent than males
However, even if absence due to sickness is equally frequent amongst males and
females, the above results could have occurred by pure chance.
It therefore would be of interest to calculate how likely it would be to observe such
differences, by pure chance
30
If this would be very unlikely, then the data provide evidence for a relation
between gender and absence
If this would not be unlikely, then the data provide no evidence for such a relation.
31
3.2
Example: Cervical cancer
Graham and Shotz [1]; Hand et al. [2] p. 247

In order to study the relationship between the occurrence of cervical cancer and
the age at first pregnancy, data were collected on 49 cancer cases and 317
non-cancer cases (controls). All women were asked about their age at first
pregnancy, and the data are summarized as:
Disease status
Age
Cervical cancer
Control
25
42
203
245
> 25
114
121
49
317
366
32
Research question:
Is there a relation between cancer and age ?
Note that 7/49 = 14.3% of the cancer cases had their first pregnancy after the
age of 25 years, while this is 114/317 = 35.96% in the control group
This suggests that cancer is more likely to occur when the first pregnancy was
before the age of 25 years.
How likely are the observed differences to occur by pure chance, if there is no
relation at all between cancer and age at first pregnancy ?
33
If the chance of observing this would be extremely small, then the observed data
provide evidence that there indeed is a relationship
If this chance is high, then the above data do not provide evidence for any relation
at all.
34
3.3
Example: Weight gain in rats
Armitage & Berry [3] p. 109

Consider the gain in weight (g) of 19 female rats between 28 and 84 days after
birth.
12 were fed on a high protein diet and 7 on a low protein diet:
Research question:
Does the weight gain depend on the diet ?
35
Dataset Weightgain:
Heigh protein
Low protein
134
70
146
118
104
101
119
85
124
107
161
132
107
94
83
Average (g)
High protein:
120
Low protein:
101
113
129
97
123
The averages suggest differences.

However, the observed differences could have occurred by pure chance.
36
It would be of interest to know how likely such differences are to occur by pure
chance, i.e., if weight gains would be completely unrelated to protein intake.
If this is very unlikely, the above data provide evidence that weight gains really
depend on the diet.
If such differences are likely to occur by pure chance, then the above data do not
provide evidence that weight gains show any relation with protein intake.
37
3.4
Example: Survival times of cancer patients
Cameron and Pauling [4]; Hand et al. [2] p. 255

Patients with advanced cancer of the stomach, bronchus, colon, ovary, or breast
were treated (in addition to standard treatment) with ascorbate.
The outcome of interest is the survival time (days)
Research question:
Do survival times differ with organ affected ?
38
Dataset Cancer:
Stomach
Bronchus
Colon
Ovary
Breast
124
81
248
1234
1235
42
461
377
89
24
25
20
189
201
1581
45
450
1843
356
1166
412
246
180
2970
40
51
166
537
456
727
1112
63
519
3808
46
64
455
791
103
155
406
1804
876
859
365
3460
146
151
942
719
340
166
776
396
37
372
223
163
138
101
72
20
245
283
Average (days)
Stomach:
286
Bronchus:
211.6
Colon:
457.4
Ovary:
884.3
Breast:
1395.9
39
The average survival times suggest differences.

However, these differences could have occurred by pure chance.
It would be of interest to know how likely such differences are to occur by pure
chance.
If this is very unlikely, the above data provide evidence that survival times really
differ with the organ affected.
If such differences are likely to occur by pure chance, then the above data do not
provide evidence that survival would differ with the organ affected.
40
3.5
Example: Captopril data
MacGregor et al. [5]; Hand et al. [2] p. 56

15 patients with hypertension
The response of interest is the supine blood pressure, before and after treatment
with CAPTOPRIL
Research question:
How does treatment affect BP ?
41
Dataset Captopril
Before
After
Patient
SBP
DBP
SBP
DBP
210
130
201
125
169
122
165
121
187
124
166
121
160
104
157
106
167
112
147
101
176
101
145
85
185
121
168
98
206
124
180
105
173
115
147
103
10
146
102
136
98
11
174
98
151
90
12
201
119
168
98
13
198
106
179
110
14
148
107
129
103
15
154
100
131
82
Average (mm Hg)

Diastolic before:
112.3
Diastolic after:
103.1
Systolic before:
176.9
Systolic after:
158.0
42
It would be of interest to know how likely the observed changes in BP are to occur
by pure chance.
If this is very unlikely, the above data provide evidence that BP indeed decreases
after treatment with Captopril. Otherwise, the above data do not provide evidence
for efficacy of Captopril.
43
3.6
Example: Prevalence of severe colds in children
Bland [6] p. 246

Study about the prevalence of severe colds in 1319 Kent schoolchildren, measured
at the ages of 12 and 14
The response of interest is whether the child had severe colds during the last 12
months
Severe colds at 14 yrs.

Yes
Severe colds Yes

at 12 yrs.
No
No
212
144
356
256
707
963
468
851
1319
44
Research question:
Is the prevalence of severe colds different at the two ages ?
At age 12, 356/1319 = 27% of the children reported severe colds.
At age 14, this percentage equals 468/1319 = 35%
These data suggest that the prevalence of severe colds increases with age.
It would be of interest to know how likely the observed change in prevalence is to
occur by pure chance.
If this is very unlikely, the above data provide evidence that the prevalence indeed
changes with age. Otherwise, the above data do not provide evidence for such a
change.
45
Note that the data structure is similar to the one in the Captopril data, in the
sense that subjects are measured twice at different time points:
46
3.7
Example: Surgery data
Robertson and Armitage [7]; Hand et al. [2] p. 100

Sometimes, a patients BP needs to be lowered during surgery, using a
hypotensive drug, which is administered continuously durig the relevant phases of
the operation.
Since the duration of these phases varies, so does the total amount of drug
administered
Patients also vary in the extent to which the drug succeeds in lowering the BP
The sooner the BP rises again to normal after the drug is discontinued, the better
Data on 53 patients, with 3 types of operation
47
Available measurements:
. Time (min.) before the patients systolic BP returns to 100 mmHg
. The 10-base log(dose) of the drug in log(mg)

. The average systolic BP while the drug was being administered
Research question:
How is recovery time related to other two variables ?
Note that the potential relation between BP and log(dose) makes it difficult to
disentangle the their relative relations to the recovery time.
48
Dataset Surgery
Minor non-thoracic
Major non-thoracic
Thoracic
log(dose)
BP
Time
log(dose)
BP
Time
log(dose)
BP
Time
2.26
1.81
1.78
1.54
2.06
1.74
2.56
2.29
1.80
2.32
2.04
1.88
1.18
2.08
1.70
1.74
1.90
1.79
2.11
1.72
66
52
72
67
69
71
88
68
59
73
68
58
61
68
69
55
67
67
68
59
7
10
18
4
10
13
21
12
9
65
20
31
23
22
13
9
50
12
11
8
1.74
1.60
2.15
2.26
1.65
1.63
2.40
2.70
1.90
2.78
2.27
1.74
2.62
1.80
1.81
1.58
2.41
1.65
2.24
1.70
68
63
65
72
58
69
70
73
56
83
67
84
68
64
60
62
76
60
60
59
26
16
23
7
11
8
14
39
28
12
60
10
60
22
21
14
4
27
26
28
2.45
1.72
2.37
2.23
1.92
1.99
1.99
2.35
1.80
2.36
1.59
2.10
1.80
84
66
68
65
69
72
63
56
70
69
60
51
61
15
8
46
24
12
25
45
72
25
28
10
25
44
49
50
3.8
Conclusion
The aim of statistics is twofold:
. Descriptive statistics: Summarizing and describing observed data such that

the relevant aspects are made explicit.
. Inferential statistics: Studying to what extent observed trends/effects can
be generalized to a general (infinite) population
51
Examples of descriptive statistics include tables, graphs, calculation of

averages,. . .
Valid inferential statistics requires a strong link between the sample and the
population about which one wishes to draw conclusions.
Valid inferential statistics requires:
. Correct statistical methodology
. Correct interpretation of results
52
Chapter 4
Population versus sample
. The population
. The (random) sample
. Statistics versus probability theory
. Types of studies
. Random samples variability uncertainty
53
4.1
Introduction
Observed data can always be considered as taken from some population.

The strength of the evidence in the data, and the validity of the conclusions based
on the data depends entirely on:
. the definition of the population
. the way the sample is drawn from the population
54
4.2
The population
In practice, the population of interest is defined through inclusion and exclusion

criteria
The inclusion criteria are the characteristics a subject/patient needs to have in
order to belong to the population
Examples of inclusion criteria:
. specific disease
. age range
The exclusion criteria are the characteristics a subject/patient is not allowed to
have in order to belong to the population
55
Examples of exclusion criteria:
. previous treatment for same disease

. pregnancy
It is important that objective criteria are used:
. Tumour must be beyond hope of surgical eradication

. An expected survival of at least 90 days
The population is not fixed, but changes constantly.

For example, interest is not only in todays patients with a specific condition, but
also in all patients in the (near) future
Therefore, the population is often (considered) infinite
56
4.3
The (random) sample
4.3.1
Example: Low back pain in nurses
Consider a study on risk factors for the prevalence of low back pain in nurses
Suppose that interest is in the population of all nurses in all Belgian hospitals
Data sets with the following characteristics would be problematic:
. Only female nurses
. Only nurses from university hospitals

. Only nurses aged 40
57
Observed effects/trends cannot be generalized to the entire population since one

cannot rule out that such effects would only occur in females, in university
hospitals, or in younger nurses.
Ideally, the sample should be a perfect reflection of the total population:
. Same proportion of males and females
. Same types of hospitals
. Same age distribution
58
4.3.2
Example:
Suppose a study is designed to compare 2 antidepressants

Study participants can be obtained from a number of psychiatric hospitals
The so-obtained data set is not necessarily representative for the total population
of depressed people, as only those hospitalized are considered.
Ideally, the sample should be a perfect reflection of the total population of interest
59
4.3.3
The random sample
Obviously, a data set in which all subjects satisfy the in- and exclusion criteria of
the population will not necessarily allow generalizations of observed effects/trends
to the total population.
Ideally, the sample should be a perfect reflection of the total population of interest
This can only be realized by taking a completely random selection from the total
population
Imbalances for some variables then only occur in small samples, and by pure
chance.
60
P
O
P
U
L
A
T
I
O
N
RANDOM
S
A
M
P
L
E
61
Taking a completely random sample is difficult in practice

Even if a completely random sample has been obtained, imbalances may still occur
due to various reasons:
. Selected subjects refuse to participate
. Subjects leave the study due to side effects
. ...
62
4.4
4.4.1
Statistics versus probability theory
Probability theory
Suppose it is known that a specific treatment is effective in 70% of the patients

receiving the treatment
This implies that the population consists of patients for whom the treatment is
not effective (30%) as well as patients for whom the treatment does have an
effect (70%)
If the treatment is administered to 100 randomly chosen patients, more than 70
may experience improvement, or less than 70
63
Question:
If 100 patients are given the treatment,
what is the probability that less than 60 of them
will experience an improvement ?
Probability theory aims at predicting the outcome of an experiment, knowing the
population
64
4.4.2
Statistics
Suppose one wants to investigate the effectiveness of a specific treatment in some

specific population
100 subjects, meeting the criteria of the population, receive the treatment
73 out of them experience considerable improvements
Question:
What is the efficacy rate in the total population ?
Statistics aims at drawing conclusions about the population, based on what
has been observed in the experiment
65
4.4.3
Conclusion
66
P
O
P
U
L
A
T
I
O
N
RANDOM
S
A
M
P
L
E
Efficacy rate in population
STATISTICS
73/100 treated patients improved
67
4.5
4.5.1
Types of studies
Introduction
There are various ways data can be collected

Which questions can be answered, and which ones cannot be answered, entirely
depends on how a specific data set arose
Also the strength of evidence depends on the method of data collection
68
4.5.2
Prospective versus retrospective study
Prospective: A group of people is folowed for the occurrence or non-occurrence

of specified endpoints or events or measurements
Examples:
. Treated patients are examined 1 month after the start of the treatment
. Cancer patients are followed after chemotherapy and the outcome of interest is
the time untill disease progression.
. Datasets weightgain, cancer, captopril
Retrospective: Subjects having a particular outcome or endpoint are identified
and studied. Often, measurements from the past are of interest
69
Examples:
. A sample of subjects is questioned about the food intake during

the last two days
. Cancer patients are questioned about potential exposure to polution.

. Datasets on sickness absence, or cervical cancer
70
4.5.3
Experimental versus observational data
Experimental: The data are collected in a newly designed and conducted

experiment
Examples:
. Rats are treated and measured afterwards, at pre-specified time points
. Cancer patients are followed after chemotherapy and the outcome of interest is
the time untill disease progression.
. Datasets Captoptil, weightgain, cancer
Observational: The data are collected on a routinely basis, and no new
experiment is set up
71
Examples:
. Hospital records on all patients treated

. Data collected during yearly medical check-ups
. Dataset on sickness absence
It is often not clear from which population observational data can be believed to
be sampled from.
For example, consider observational data collected on a routinely basis on all
patients treated in a university hospital
For example, consider the data collected by an occupational health service, on a
routinely basis, during a specific year
72
Contrasting the percentages of observed subjects in the various occupational

classes, with those in the total Flemish population, shows that the data cannot be
believed to be a random sample from the Flemisch population:
Data on females
Occupational sector
Collected data (%)
Flemish population (%)
1. Agriculture / fisheries
1.1
2.7
2. Energy / water
0.2
0.5
3. Minerals / chemistry
0.7
2.2
4. Metal / mechanical and optical industry
2.4
4.8
5. Other industry
10.7
10.2
6. Construction
0.6
1.0
7. Commerce / hotel / restaurant / bar
16.4
21.5
8. Transportation / communication
1.0
3.6
9. Bank / insurance / services w.r.t. companies
6.0
8.3
10. Other services
60.9
45.2
Missing
6.1
73
4.5.4
Cross-sectional versus longitudinal study
Cross-sectional: Study participants are measured once, at a fixed (pre-specified)

moment in time
Examples:
. Study where blood pressure is related to subject characteristics such as

bodyweight, food intake, living habbits,. . .
. Study where height of children at the age of 12 years old are related to the
height of the parents
. Datasets on sickness absence, or cervical cancer
Longitudinal: The outcome of interest is measured repeatedly over time
Example: Captopril and Severe cold data
74
4.5.5
Clinical trial
A rigourously designed experiment aiming at finding the best treatment for future
patients in a specific condition
All aspects are pre-specified in the study protocol
Typically, a group of patients is randomly allocated to one of a number of treatments, after
which the outcome(s) of interest are measured
These are the only studies that are accepted
by regulatory agencies that approve marketing
of treatments
The random allocation allows causal interpretation of observed treatment effects (see later)
75
Clinical trials are always prospective

Clinical trials are always experimental
Clinical trials can be of a longitudinal or a cross-sectional nature.
76
4.5.6
Cohort study
A well-defined group of subjects is followed over time, usually untill a specific

event happens.
Examples:
. Students who graduated in 2005

. All patients who received surgery with a specific technique, during a specific
period of time.
. Dataset cancer, dataset on severe colds

A birth cohort is a cohort of people all born in the same period, and is often used
to exclude effects from the fact that subjects lived in different periods.
77
For example, 40-year olds can be different

from 20-year olds, just by the fact that the
first 20 years of life were lived under completely different circumstances
Over 20 years, the 20-year olds from today will
not necessarily be equal to the 40-year olds
from today.
78
4.5.7
Case-control study
Suppose we want to investigate the relation between smoking and lung cancer
One may select a group of smokers and a group of non-smokers, and follow them
for a (long) period of time
The outcome of interest is the incidence of lung cancer.
A potential dataset is:
Lung cancer
Smoking status
Yes
No
Yes
42
203
245
No
114
121
49
317
366
79
Such a study would take an very long time to conduct, as one has too wait untill a
sufficiently high number of cancer cases has been observed.
One therefore often conducts a case-control study, in which a number of cases
(cancers) and controls (non-cancers) are selected, which are questioned about
their smoking behaviour in the past.
This potentially may lead to the same data.
Note that, since the number of sampled cases and controls is pre-defined, such a
study design does not allow the estimation of the prevalence of lung cancer.
However, the case-control study does allow to study the relation between risk
factors and the prevalence of some disease (see later).
80
4.5.8
Matched case-control study
Suppose a case-control study is conducted to study the relation between smoking

behaviour and the incidence of lung cancer.
Suppose also that a strong relation is observed.
However, how should this association be interpreted if the smokers are much older
than the non-smokers, and/or if the group of smokers contains many more males
than the group of non-smokers ?
Maybe, the observed relation is indirectly induced by the differences in age and
gender.
Matched case-control studies allow to guarantee that cases and controls are
exactly the same with respect to some important subject-specific characteristics.
81
For example, matching for age and gender can be done as follows:
. Sample the required number of cases
. For each case, select a control with the same age and gender as the case
In some situations, one may want to match multiple controls to each case, or
multiple cases to each control.
Ideally, one would like to match for as many factors as possible.
However, matching on too many factors complicates the search for appropriate
controls.
82
4.6
Random samples variability uncertainty
A sample needs to be taken randomly such that it well represents the total
population. Only then, valid conclusions can be drawn
Note however, that different random samples will include different subjects, with
different observations
Hence, each new random sample or, equivalently, each new experiment will lead to
(slightly) different conclusions, implying that, sometimes, wrong conclusions will
be drawn
Note that absolute certainty cannot be expected as conclusions are based on only
a small part (the sample) from the total, infinitely large, population.
83
Conclusion:
Statistics helps to:
. quantify the errors

. control the errors
84
Chapter 5
Causality and randomization
. Causal effects
. Methods of randomization
. Randomization not always possible
85
5.1
Causal effects
Suppose an experiment is set up to compare homeopathy (H) with placebo (P).

Two groups of patients are selected. One receives H, the other receives P.
(Double) blinding is necessary:
. Believers may overestimate the effect of H

. Non-believers may underestimate the effect of H
An observed difference between H and P does not necessarily imply H is (more)

effective, not even under double blinding
86
What if:
. H-group contains more females ?

. H-group is younger ?
. H-group contains better patients ?
Any difference between both groups, other than the treatment, may explain the
observed difference in efficacy.
In such cases a difference between H and P should not automatically be ascribed
to the treatment.
In general, an observed effect is not necessarily a causal effect in the sense that
the difference in treatment can be interpreted as the cause of the observed
difference in response.
87
The only way to assure treatment groups are comparable with respect to all
known and unknown factors is to assign treatments to subjects in a completely
random way.
This is randomization
Groups then only differ with respect to
the treatments they received
Small imbalances can occur by pure
chance, in small studies
88
Randomization is required whenever causal relations are to be shown:
Cause = Effect
89
5.2
5.2.1
Methods of randomization
Simple randomization
Throwing coins or dice,

wheels,. . .
spinning
Random number generators

Pre-generated lists should not be made
available in advance, in order to guarantee blinding
90
5.2.2
Block randomization
Simple randomization usually does not lead to equal group sizes

Equal numbers of subjects in all groups can be obtained by simple randomization
as long as the required numbers have not been reached.
Once a group contains sufficient subjects, randomization is done over the
remaining groups only.
There then is a tendency for the last few subjects all to be in the same group,
implying that the assignment for the last subjects is not completely unpredictable
With block randomization, subjects are put in small equal-sized groups and,
within each block, equal numbers are allocated to the groups.
91
Block randomization also implies approximately equal numbers at each moment

during the study.
This is very useful in situations where a training effect of the physician is to be
expected.
92
5.2.3
Stratified randomization
With relatively small studies, (serious) imbalance can be obtained by pure chance.
Stratified randomization can be used to ensure complete balance, at least with
respect to some measured important prognostic factors.
For example, suppose gender and age are believed to be strongly related to the
outcome of interest.
Ideally, the two treatment groups would have exactly the same age and gender
distribution.
This can be realized by using separate (block) randomization for each combination
of age with gender.
Hence, separate randomization lists are to be constructed for each combination of
age with gender.
93
In practice, one often would like to stratify for as many factors as possible.
However, stratification on too many factors may lead to many incomplete blocks
implying that the balance hoped for cannot be realized
Some extreme versions of stratified randomization are:
. Twin studies: both subjects are assigned randomly to the two treatments
. Cross-over studies: subjects receive all treatments but in a random order

. Pre-test post-test studies: all subjects are measured before as well as after
the treatment
(e.g., Captopril data, severe cold data)
. Both hands, feet, treated with different treatments, assigned at random.
94
95
5.3
Randomization is not always possible
5.3.1
Example 1
Suppose one wants to study the effect of chemotherapy in women, on the unborn
baby and its evolution after birth.
Ideally, one would randomize pregnant women into two groups
. Group 1: receives chemotherapy
. Group 2: no chemotherarpy (placebo)

Obviously, this is ethically not possible.
96
In practice, for each pregnant woman getting chemotherapy, another pregnant

woman is searched for, who does not get chemotherapy, but who is comparable for
the most important known prognostic factors.
Often, the controls are taken from an earlier collected data set, and are therefore
called historical controls.
After birth, the children are followed and the outcomes of interest are measured
(e.g., IQ level at the age of 5), and the association with chemotherapy can be
studied.
Note that associations detected, should not be interpreted as causal
97
5.3.2
Example 2
Suppose one wants to study the relation between smoking and lung cancer.
Ideally, one would randomly subdivide subjects into two groups:
. Group 1: subjects have to smoke many cigarettes, daily, during many years.
. Group 2: subjects are not allowed to smoke

Obviously, this is ethically not possible.
As in the previous example, one could select a group of smokers, and a
comparable group of non-smokers
However, in order to be able to measure occurrence of lungcancer in all these
subjects, one would have to wait many years.
98
In such studies, one will select a group of cancer cases, and a comparable group of
non-cancer cases.
All subjects are questioned about their smoking behaviour in the past.
This will still allow to study the association between smoking and the occurrence
of lung cancer
This is an example of a case-control study
Note that associations detected, should not be interpreted as causal
99
5.3.3
Implications
Imbalance with respect to some important prognostic factors cannot be ruled out
Imbalances with respect to measured known factors can be corrected for by
appropriate statistical techniques
However, as one cannot correct for the imbalance with respect to unknown or
unmeasured factors, causality can still not be concluded from such analyses.
For example, one will never be able to show any causality in the relation between
smoking and lung cancer.
100
Part III
Describing and summarizing data
101
Chapter 6
Types of outcomes
. Qualitative data
. Quantitative data
102
6.1
Qualitative data
Qualitative variables are not characterized by a numerical value

Further subdivision:
. Dichotomous: only 2 possible values: gender, survival

. Nominal: no ordering: color of hair, cause of death
. Ordinal: ordered oucome values: pain score (never always)
103
6.2
Quantitative data
Quantitative: variables have values that are intrinsically numerical

Further subdivision:
. Discrete: the possible values are distinct and separated

. Examples: Number of particles emitted by a radio-active source, heart rate
. Continuous: values are within a continuous, uninterrupted range

. Examples: height, age, blood pressure
Note that continuous variables are always measured in a discrete way
Discrete variables with many possible values are often treated as continuous
104
Chapter 7
Graphical presentation of data
. One variable
. Multiple variables
105
7.1
Graphs of single qualitative variables
Bar plot or pie chart:
106
7.2
Graphs of single quantitative variables
Histogram:
Note that the choice of the intervals is crucial

107
Box (-Whiskers) plot:
108
7.3
Graphs of multiple qualitative variables
Categorized bar plot (similar for pie chart):
109
7.4
Graphs of multiple quantitative variables
Scatterplot:
110
Scatterplot with box plots (also with histograms)
111
Scatterplot matrix (also with box plots):
112
7.5
Graphs of mixed quantitative / qualitative variables
Categorized box plots:
113
Categorized histograms:
114
Bubble plot:
115
Chapter 8
Summary statistics
. Introduction
. Measures of location
. Measures of spread
. Percentages
. Geometric mean and standard deviation
. Missing data
. Graphical representation
. Examples from the biomedical literature
116
8.1
Introduction
A B
A and B have the same location but different spread

A and C have the same spread but different location
117
8.2
Measures of location
Location measures:
Where are the observations more or less located ?
As an example, consider the small sample:
1, 3, 3, 4, 5, 14
Sample average (sample mean):

1 + 3 + 3 + 4 + 5 + 14
x1 + x2 + x3 + x4 + x5 + x6
=
6
6
n
x1 + . . . + xn
1 X
=
=
xi = 5
n
n i=1
x =
118
119
The sample median is the middle

observation:
1
3+4
2
{z
4}
14
= 3.5
The sample mode is the value that

was observed the most often:
1,
3,
3,
4,
5,
14
120
Note that the sample average is very sensitive to outliers:

1, 3, 3, 4, 5, 14 5
1, 3, 3, 4, 5, 20 6
1, 3, 3, 4, 5, 26 7
This is not the case with the sample median:
1, 3, 3, 4, 5, 14 3.5
1, 3, 3, 4, 5, 20 3.5
1, 3, 3, 4, 5, 26 3.5
The mode is not always informative:
Mode
121
For symmetric data, the average and the median are the same. In general, they
are not:
Symmetric
Median = Mean
Skewed
an ean
i
ed M
M
122
With skewed data, the mean can be heavily influenced by the random presence of
a/some extreme observation(s).
In order to still get a good idea about the location of the data, one then prefers
the use of the median over the mean:
Symmetric data = Mean

Skewed data = Median
123
8.3
Measures of spread
Obviously, a measure of location only summarizes one specific aspect of the

observed data:
Statitician drowning in a lake of average depth 0.5m
124
Measures of spread:
How similar are the observations ?
xn
....
x8
x7
x6
x5
x4
x3
x2
x1
xn
.. ..
x7
x4
or
x2
x8
x6
x5
x3
x1
125
As an example, re-consider the small sample:
1, 3, 3, 4, 5, 14
Mean deviation from the mean :

1
n
n
X
(xi x) =
i=1
4 2 2 1 + 0 + 9
0
=
= 0
6
6
Mean quadratic deviation from the mean:

1
n
2
2
2
2
2
2
(4)
+
(2)
+
(2)
+
(1)
+
0
+
9
(xi x)2 =
i=1
6
n
X
106
= 17.67
6
126
Sample variance:
s2 =
1
n1
n
X
(xi x)2
i=1
(4)2 + (2)2 + (2)2 + (1)2 + 02 + 92

106
=
= 21.2
=
5
5
Note that the units of the sample variance and the mean quadratic deviation are
the squared units of the original observations
The sample standard deviation is in the same units as the original
observations:
s =
v
u
u
u
u
t
1
n1
n
X
(xi
i=1
x)2
= 21.2 = 4.60
127
Sample range:
R = max xi min xi = 14 1 = 13
i
Note that the range strongly depends on the sample size n: Larger samples are
more likely to contain extreme observations, hence are more likely to have a larger
range
Since we hope that our measure of spread reflects the amount of variation in the
population, we prefer a measure that does not depend on the sample size.
The sample interquartile range is the range obtained after deletion of the 25%
highest and 25% lowest values in the sample:
1, 3, 3, 4, 5, 14
3,3,4,5
IQR = 5 3 = 2
128
The interquartile range does not depend on the sample size n, since a larger
number of observations is deleted in larger samples.
The variance (hence also mean quadratic deviation and standard deviation), and
the range are very sensitive to outliers:
1, 3, 3, 4, 5, 14 s2 = 21.2,
1, 3, 3, 4, 5, 20 s2 = 48.8,
1, 3, 3, 4, 5, 26 s2 = 88.4,
R = 13
R = 19
R = 28
This is not the ase with the interquartile range:

1, 3, 3, 4, 5, 14 IQR = 2
1, 3, 3, 4, 5, 20 IQR = 2
1, 3, 3, 4, 5, 26 IQR = 2
129
With skewed data, the standard deviation can be heavily influenced by the random
presence of a/some extreme observation(s).
In order to still get a good idea about the variation in the data, one then prefers
the use of the interquartile range over the standard deviation:
Symmetric data = Standard deviation
Skewed data = IQR
130
131
8.4
Percentages
Traditionally, measurements are summarized by a measure of location and a

measure of spread
However, suppose the variable of interest is sickness absence
For each subject i in the sample, we define xi as:
xi =
1 if subject i was absent due to illness

0 otherwise
The sample average equals

x =
x1 + x2 + . . . + xn
Number of people with sickness absence
=
n
n
132
Hence, the average equals the observed proportion (percentage) of people with
sickness absence
Note that, once the average is known, the original observations are known, hence
also the variability:
0
1
x6
x5
x4
x3
x2
x1
x = 0.5
x6
x5
x4
x3
x2
x1
x = 0.16
1
x6
x5
x4
x3
x2
x1
x = 0.84
133
One can show that the variance is obtained as

s2 =
n
x (1 x)
n1
Since the variance directly follows from average, only the mean is reported, no
measure of spread
In general, measures of location and spread are only used for quantitative
(continuous) variables.
Other variables are described by observed frequencies and percentages.
134
For example, the variables sickness absence and cancer type could be
summarized as follows:
Variable
Sickness:
Cancer type :
(n = 256)
Yes
103 (40.23%)
No
153 (59.77%)
Breast
79 (30.86%)
Stomach
26 (10.16%)
Bronchus
83 (32.42%)
Colon
58 (22.66%)
Ovary
10 (3.90%)
135
8.5
Geometric mean and standard deviation
The mean and standard deviation are used to describe symmetric data
In case of skewness, alternatives such as median and IQR are used
An alternative is to transform the original data such that the transformed
observations are symmetric
A special, frequently occurring, case is when symmetry is obtained using a
logarithmic transformation
As an example, we consider the survival times of cancer patients, and we restrict
to the patients with stomach cancer
136
Summary statistics:
However, the histogram of the observations suggests skewness:
137
Often, skewness in the direction of the large values can be solved with a
logarithmic transformation:
X = survival time Y = ln(X) = ln(survival time)
138
Stomach
X Y = ln(X)
124
4.82
42
25
3.74
3.22
45
3.81
412
51
6.02
3.93
1112
7.01
46
103
3.83
4.63
876
6.78
146
340
4.98
5.83
396
5.98
139
Assessing symmetry is difficult due to the small number of observations. However,

the evidence against symmetry is much weaker now, and use of mean and
standard deviation seems justified for the description of the Y -values.
Often this mean and standard deviation is back-transformed to the original
units, leading to the geometric mean and standard deviation:
Outcome
Survival time (days)
Stomach cancer
mean (stand.dev.)
144.03
(3.49)
= exp(4.97)
= exp(1.25)
geometric means and standard deviations
which is very different from the arithmetic mean and standard deviation
that were reported before:
140
8.6
Missing data
Sometimes, not all observations are available

For example, for some of the subjects in the sample, absence of sickness was not
measured.
Summarizing this variable can be done in two ways:
Variable
Sickness:
(n = 203)
Yes
103 (50.74%)
No
100 (49.26%)
Variable
Sickness:
(n = 256)
Yes
103 (40.23%)
No
100 (39.07%)
Missing
53 (20.70%)
Not accounting for missing observations results in misleading summary statistics

141
8.7
Graphical representation of summary statistics
Boxplots (with 1%, 25%, median, 75%, and 99% percentiles)

Means and standard deviations
142
8.8
Examples from the biomedical literature
Boushey et al. [8], Figure 2:
. Bar plot
. Categorized for 3 groups
143
Marlow et al. [9], Figure 1:
. Scatterplot
. Jittered to avoid overlapping symbols
. Categorized for 42 groups
. Lines are probably means
144
Wong et al. [10], Table 1 (first part):
. Means and standard deviations

. Medians and IQRs
. Percentages
145
Blanchon et al. [11], Table 1 (parts):
. Categorization of continuous
variables
. Explicit acknowledgement of
missing values
146
Kellett, Kellett, and Nordholm [12], Table 2:

. Variables are NOT symmetrically distributed
147
Wu [13], Figure 2:
. Averages for 33 combination of

box size and frequency of lifting
. No indication of variability
148
Two completely different hypothetical scenarios for variability:
149
Nawrot et al. [14], Table 1:
. Geometric means
. IQR instead of geometric
standard deviations
150
Part IV
Basic concepts of statistical inference
151
Chapter 9
Describing the population
. Stochastic variable
. Discrete probability distribution
. Continuous probability distribution
. Summary characteristics for probability distributions
. The normal distribution
152
9.1
Stochastic variable
Suppose a random sample of size n = 321 from a specific population is available,

and interest is in the outcome BMI
The outcome variable BMI is often denoted as X, while the n = 321 observations
are usually denoted as
x1 , x2 , . . . , x321
The variable X is a random or stochastic variable since the value that it takes
is subject to chance
Indeed, if one randomly selects one subject from the population, the BMI of that
subject cannot be predicted, and entirely depends on which subject has been
selected
153
At most, one can say that, e.g., it is more likely that this subject will have a BMI
between 20 and 25 than a BMI larger than 35
So, the realized value of X depends on random variability
Our sample x1, x2, . . . , x321 can be considered as n = 321 realizations of the same
random variable X, for subject i, i = 1, 2, . . . , 321.
Drawing the sample can be viewed as performing n = 321 small experiments, each
time selecting one subject and measuring this subjects BMI, leading to the
realized value xi of X
How likely it is to observe certain values or certain ranges of values is described by
the probability distribution
Similar to the classification of observations, one can classify random variables as
qualitative, quantitative, discrete, continuous,. . .
154
Other examples:
Experiment
Selecting one Belgian
Throwing a die
Selecting n = 321 people
Random variable
Type of variable
. Weight
. Qualitative, continuous
. Height
. Qualitative, continuous
. Gender
. Quantitative, dichotomous
. Number of throws until first 6
. Quantitative, discrete
. Number of times a 6 was thrown,

out of 10 trials
. Quantitative, discrete
. Percentage of women
. Quantitative, disc./cont. ?
. Average age
. Quantitative, continuous
. Number of cancer cases
. Quantitative, disc./cont. ?
155
9.2
Discrete probability distribution
A discrete probability distribution describes how likely it is to observe specific

values for a discrete random variable.
Suppose X is the random variable sickness absence:
X =
1 if absence due to illness

0 otherwise
X can only take the values 0 and 1

The probability distribution of X describes the probability of observing a 0 or a 1,
respectively
156
These probabilities are the percentages of 0s and 1s one would observe if the
experiment would be repeated over and over again.
Hence, we need to describe the observations one would observe in an experiment
of size n = +
We will do this in exactly the same way as how discrete observations were
described before, i.e., using the Bar plot:
1.0
0 + 1 = 1
0.5
1
0.0
0 (No)
1 (Yes)
Sickness absence (X)

157
0 is the probability of observing a 0, P (X = 0)

0 is the proportion of 0s one would observe in a sample of size n = +
1 is the probability of observing a 1, P (X = 1)
1 is the proportion of 1s one would observe in a sample of size n = +
A discrete distribution for a discrete random variable X describes what values X
can have, and what the associated probabilities are to observe those values.
158
For example, if the experiment is to throw a die, and X is the result of one throw,
then the probability distribution of X is given by:
Graphically:
xi :
1 2 3 4 5 6
i = P (X = xi):
1
6
1
6
1
6
1
6
1
6
1
6
1.0
0.8
1 = 2 = 3 = 4 = 5 = 6 =
1
6
0.6
0.4
0.2
0.0
Result of throwing a die (X)

159
For example, suppose there is equal probability for a newborn to be male or

female. Let X be the number of boys in a family of 5 children, then the
probability distribution of X is given by:
xi :
i = P (X = xi):
Graphically:
0.0312 0.1563 0.3125 0.3125 0.1563 0.0312
1.0
0.8
0.6
0.4
0.2
0.0
1
1
6
2
Number of boys (X)

160
Many frequently used distributions are given a name:

. Sickness absence example: Bernoulli distribution
. Die example: Multinomial distribution
. Gender example: Binomial distribution
Other frequently used discrete distributions are the poisson, the geometric, the
hypergeometric, the beta-binomial, the negative binomial, . . . , distribution.
161
9.3
Continuous probability distribution
A continuous probability distribution describes how likely it is that a continuous

random variable takes values within certain ranges
Suppose X is the random variable BMI
How can we describe what ranges of values of X are likely to observe, and which
ranges of values are less likely to observe ?
For discrete variables this was done by generalizing the bar plot to an infinitely
large sample (= the population)
The same idea is now used for continuous variables: How can the histogram be
generalized to an infinitely large sample (= the population) ?
162
To study this, we draw samples from our population, and study the behaviour of
the histogram of BMI, when the sample size increases
Six samples will be drawn, with sample sizes:
. n = 10
. n = 20
. n = 50
. n = 100
. n = 500
. n = 5000
For each sample, the histogram of the observed BMI values is constructed
We will use histograms with intervalwidth equal to 1
163
For samples of size n = 10 and n = 20:
164
165
Obviously, the obtained histogram becomes smoother as the sample size increases
166
Eventually, the histogram becomes a smooth function, f (x), called the density
function of the random variable X:
167
Since we started from histograms with intervalwidth equal to 1, all histograms had
the property that the total surface of a bar represented the proportion of
observations in the corresponding interval
This property is now carried over to the density function:
f (x)
The probability of observing a value for X between

a and b equals the surface below f , between a and b
.....................
......
......
...
....
....
...
....
..
.
.
....
..
....
.
....
....
...
.
...
.
....
....
...
....
...
....
...
....
...
....
...
....
....
...
....
...
....
...
....
....
...
....
...
....
...
....
...
....
....
...
....
...
....
....
...
....
...
....
..
....
....
....
.....
...
....
....
.................
...
....
..
.....
..... ..
.....
..... .
...
..... .........
...
.. ..
..
................
...
.......
.....
.......
.....
.
.......
.....
.
.
...
.
.
........
...
.
........
.
.
..
.
.........
......
....
.........
..........
..
..........
...
...........
.............
...
.............
..
................
..................
...
.....................
...
..........................
..................................
...
...................................................
..
................................................................................................
..
......
..............
......................
..............................................
......................................
...............................................
..........................................................................................
......................................................................
..............................................................................
.........................................................................................................................................
.........................................................................................................
....................................................................................................................
.....................................................................................................................................................................................................
.....................................................................................................................................................................................................................................
..............................................................................................................................................................................
...........................................................................................................................................................................................
..........................................................................................................................................................................................................................................................................................
............................................................................................................................................................................................
............................................................................................................................................................................................
..........................................................................................................................................................................................................................................................................................
..............................................................................................
P (a X b)
a
b
168
This also implies that the total surface below f is equal to 1

Note that f completely defines what values are possible for X and how likely
these values are to occur.
Hence, f has the same properties as the discrete distributions discussed before
Many continuous distributions exist, all defined by a specific density function
Moreover, every positive function f with total surface below the function equal to
1 defines a probability distribution
The calculation of probabilities requires computation of surfaces below f , hence of
integrals.
This can be done using tables, or computer packages.
169
For example, the package Statable can be freely downloaded from:

http://www.cytel.com/Products/StaTable/
Some frequentely used continuous distributions are the normal, the t, the
chi-squared (2 ), the F , the gamma, the beta, . . . , distribution.
Some borrow their name to specific statistical techniques, such as t-tests, F -tests,
chi-squared tests, . . .
170
9.4
Summary characteristics for probability distributions
The probability distribution can be viewed as an extension of the bar plot and the
histogram to the total population, or equivalently, an infinite sample
It describes how likely specific values are to be observed when randomly drawing
from the population
Similarly, we can now define measures of location and spread for the total
population.
These are the measures of location and spread one would observe if the total
population would be measured, i.e., in an infinite sample
171
These measures are usually denoted with Greek letters, e.g.,

. population average:
. population variance: 2
Note that, similarly to the probability distribution, the population versions of the
measures of location and spread, are theoretical concepts, as one will never
observe them, or measure them.
Indeed, in practice, one only observes a finite sample, from which it is possible to
calculate the sample-based versions, such as the sample average x and s2:
Population
Sample
(never observable) (observable)
Location:
Spread:
s2
172
9.5
The normal distribution
The most frequently used distribution in statistics is the Normal or Gaussian

distribution
It has density function f (x) equal to
1
1
2
f (x) =
exp
(x
2 2
2 2
It depends on two parameters IR and 2 > 0, which are the (population)

mean and variance.
If a random variable X is normally distributed with mean and variance 2, this
is denoted as X N (, 2)
173
Always symmetric around the mean

For = 0 and 2 = 1, we have the
standard normal distribution:
X N (0, 1)
Any normal can be transformed in a
standard normal, and vice versa:
X N(, 2) =
X
N(0, 1)
X N(0, 1) = + X N(, 2 )
174
P ( X + )
= P 1
1 = 68.27%
P ( 1.96 X + 1.96)
= P 1.96
1.96 = 95%
P ( 2 X + 2)
2 = 95.45%
= P 2
P ( 3 X + 3)
= P 3
3 = 99.73%
175
The normal distribution is very popular

because:
. Many stochastic processes follow normal distributions, or can be well approximated by normals
. Statistical theory shows that the normal is often a reasonable approximation
. The parameters in the normal distribution have a natural interpretation,
since they represent the mean and the
variance in the population.
176
Chapter 10
From the population to the sample, and back to the
population
. From the population to the sample

. From the sample to the population
. Example
. Normal values
177
10.1
From the population to the sample
The probability distribution describes how likely specific values are to be observed
when randomly drawing from the population
Also, the probability distribution summarizes how the data in an infinitely large
sample would be distributed.
Hence, when a sufficiently large random sample is drawn from that population,
one expects the observed histogram to be close to the probability distribution.
This is probability theory
178
P
O
P
U
L
A
T
I
O
N
RANDOM
S
A
M
P
L
E
RANDOM
NOT RANDOM
179
10.2
From the sample to the population
In statistics, the observations in the sample are used to learn about the
population.
Obviously, in order for the sample to learn us something about the population, the
sample needs to be drawn randomly
This procedure, in which information from the sample is used to draw conclusions
about the population is called statistical inference or estimation
180
P
O
P
U
L
A
T
I
O
N
STATISTICAL INFERENCE AND ESTIMATION

S
A
M
P
L
E
181
182
P
O
P
U
L
A
T
I
O
N
RANDOM
S
A
M
P
L
E
Distribution of X
INFERENCE AND ESTIMATION

Histogram of xi
?
183
10.3
Example: BMI
Suppose a random sample of n = 2605 Belgian males is available, and the

outcome X of interest is the body mass index (BMI).
Suppose interest is in estimating the percentage of people with overweight
(BMI> 25) and with obesity (BMI> 30) respectively.
Summary statistics are:
One way to proceed is to estimate the distribution of X, from which probabilities

can be calculated
184
Histogram of observed values:
Obviously, BMI is not symmetrically distributed.

Due to the flexibility of the normal distribution, one often approximates
non-normal distributions with a normal one, after appropriate transformation.
185
Which transformation(s) is/are appropriate depends on the shape of the original

histogram
In case of skewness with a tail towards large values, a transformation with
negative curvature is needed,
such that large values are transformed towards the
smaller ones, e.g., ln(x), x, . . .
In case of skewness with a tail towards small values, a transformation with positive
curvature is needed, such that large values are transformed away from the smaller
ones, e.g., exp(x), x2, . . .
Sometimes, x needs to be rescaled or shifted before applying any of the above
transformations, e.g., ln(1 + x) in case x = 0 is possible
One always should check whether the transformed values are approximately
normally distributed
186
Histogram
Possible transformations
187
For our BMI example, a logarithmic transformation proofs helpful:
We can now approximate the distribution of Y = ln(X) by a normal one.

Which normal distribution N (, 2) ?
188
We need to decide what values for and 2 will be used.

and 2 are the unknown mean and variance of the distribution of Y , i.e., these
are the average and variance, respectively, one would observe in an infinitely large
sample of observations for Y .
We therefore estimate these parameters based on the observed values in the
sample, assuming that the sample was large enough to yield a sufficiently good
approximation for and 2
The summary statistics for the log-BMI observations are:
189
Hence, our estimates for and 2 will be 3.21 and 0.152

If the resulting normal distribution N (3.21, 0.152 ) is close to the true one, we
expect the histogram of our random sample of Y values to be close to the normal
density.
This can be graphically checked by adding the density of this normal distribution
to our histogram of Y values:
190
From now on, Y = ln(BMI) will be assumed normally distributed, with mean 3.21
and variance 0.152 , and probabilities of interest can be calculated
For example, the proportion of males in the population with overweight
(BMI> 25) equals:

P (X > 25) = P (Y > ln(25)) = P N (3.21, 0.15 ) > 3.22 = 0.4734

Similarly, the proportion of obese males in the population (BMI> 30) equals:

P (X > 30) = P (Y > ln(30)) = P N (3.21, 0.15 ) > 3.40 = 0.1026

Note that the last step in the above equations is obtained from using statistics
tables or computer programs.
191
P
O
P
U
L
A
T
I
O
N
RANDOM
S
A
M
P
L
E
Distribution of BMI
P (X > 25) = 47.34%

P (X > 30) = 10.26%

Histogram of xi
192
10.4
Example: Normal values
Normal, or reference, values are often used in the reporting of clinical test results
95% normal values are the values c1 and c2 such that 95% of the total population
falls in between those values:
95%
c1
c2
With clinical test results, the normal values are with respect to the normal
(healthy) population
193
Example:
194
The probability that a randomly selected, healthy, patient has a value within the
95% normal values is by definition 95%.
When two independent parameters are measured, the probability that a randomly
selected, healthy, patient has both parameters within the respective 95% normal
value ranges equals
P (Both parameters within the 95% normal range) = 0.95 0.95 = 0.9025
Hence, combining two sets of 95% normal values leads to region which contains
only 90.25% of the total population.
In general, one has :
P (k parameters within the 95% normal range) = 0.95k
195
Some values:
k
1
2
5
10
20
50
100
0.95k
0.9500
0.9025
0.7738
0.5987
0.3585
0.0769
0.0059
Hence, for 100 tests, we have almost certainty that at least one parameter will
take a value outside its 95% normal range.
Obviously, one can use higher percentages (e.g., 99% instead of 95%), but the
problem of multiple testing remains.
Note that the above calculations assume the tested parameters to be independent.
196
For example, suppose that a normal value for parameter 1 always leads to a
normal value for parameter 2, and vice versa, we would have that
P (Both parameters within the 95% normal range)
= P (The first parameter within the 95% normal range) = 0.95
Alternatively, suppose that a normal value for parameter 1 always leads to a
value for parameter 2 which is outside its normal range, we would have that
P (Both parameters within the 95% normal range) = 0
Conclusion:
Normal values need to be interpreted with extreme caution
197
Chapter 11
Estimation, sampling variability, bias, and precision
. Estimation
. Example
. Sampling variability
. Bias and precision
. Sampling distribution of the sample average
. Standard error of the mean
198
11.1
Estimation
One does not always have to estimate the complete distribution of a random
variable X
Often, interest is in specific characteristics of the distribution, such as the
population average
One can then try to draw conclusions about , based on the observed data in the
sample, without having to specify the distribution of X
Since the population characteristics (mean, median, variance, . . . ) are summary
statistics in an infinitely large sample, it is natural to estimate them using the
sample versions.
199
Summarized:
Population parameter
Estimate from sample
c = x
c 2 = s2
population median
sample median
population IQR
sample IQR
...
...
Note that it is very unlikely that the estimate is identical to the parameter it is
estimating
How close the estimate will be to the true value depends on various aspects.
Some key results will be explained in the next sections
200
P
O
P
U
L
A
T
I
O
N
RANDOM
S
A
M
P
L
E
Distribution of X
Characteristic of f (x)
...........
...
.
.
................................ 2
.....
..... ..
..........

Histogram of x
Estimate for
d
=x
...........
.
..
.
.
2
d2
.................................
=
s
.....
..... ..
..........
201
11.2
Example: BMI
Re-consider the random sample of n = 2605 Belgian males, with the outcome X
of interest being the body mass index (BMI):
202
We previously described the distribution of BMI with a normal distribution for the
log-transformed values, and we estimated the percentage of people in the
population with overweight (BMI > 25) to be 47.34%
Note that this percentage equals = P (X > 25) which is a characteristic of the
BMI distribution in the Belgian male population
We can estimate this by the observed proportion c of males in the sample, with
overweight:
number of males with xi > 25
c =
= 46.99%
2605
Note that c is a new estimate for = P (X > 25), which does not require
estimating the whole distribution of BMI in the total population.
If our new estimate would have been very different from the previous one
(47.34%), this would have been some indication that our estimation of the BMI
distribution was not accurate.
203
P
O
P
U
L
A
T
I
O
N
RANDOM
S
A
M
P
L
E
Distribution of BMI
= P (X > 25)

Histogram of x
Estimate for
b =
observed proportion of
males with BMI> 25
204
11.3
Sampling variability
Suppose interest is in the estimation of some characteristic of the distribution of

a specific random variable X
could be the mean , the variance 2, but for example also the percentage of
people with X > 25.
Based on a random sample, an estimate c for can be obtained, for example:
Population
Sample
(never observable) (observable)
Mean:
Variance:
= 2
In general:
c = x
c = s2
c
205
The estimate c is calculated from the observed data, hence the resulting value for
c completely depends on the sample that was drawn from the population.
Repeating the experiment would lead to another sample, other observations, thus
also to another estimate c for .
The estimate c can therefore be interpreted as one realized value of a random
d
variable
d
d
The distribution of
is called the sampling distribution of .
It describes what
values of c are to be expected should the experiment be repeated many times.
d
In general, the sampling distribution of
depends on:
d
. The statistic :
different for mean, median, variance,. . .
. The distribution of the original data (i.e., of X)

. The sample size
206
11.4
Bias and precision
Suppose, as before, that interest is in some characteristic of the distribution of

X, and that an estimate c is available
d
c can then be interpreted as one observation of the random variable
d
The sampling distribution of
is important as it reflects how likely it is that an
estimate c would be obtained which is far away from the true value .
207
Example:
Distribution of
. Asymmetric
. Unlikely to have serious underestimation
. Likelely to have serious overestimation
. On average, our estimate will be correct
Example:
. Symmetric
Distribution of
. Under- and overestimation equally likely

208
Example:
. Symmetric
Distribution of

. Very precise estimation of
Example:
. Symmetric
Distribution of

. A lot of uncertainty about
209
Example:
Distribution of
. Symmetric
. On average, our estimate is not correct
. We have a biased estimator c for
In general we prefer unbiased estimators, with as much precision as possible.

d
We will now investigate the sampling distribution of the sample average
=X
210
11.5
Sampling distribution of the sample average
Suppose interest is in the estimation of the mean of some random variable X

Based on a random sample, will be estimated by the sample average x, which is
one realized value of the random variable X
Questions:
. When is the sample average biased ?

. When is the sample average precise ?
As discussed before, the sampling distribution of X will depend on the distribution

of the original data X, as well as on the sample size n
We will therefore simulate the sampling distribution of X, under various settings
211
Simulation steps:
. Randomly generate n observations from the distribution of X: x1, . . . , xn

. Based on this data set, calculate the sample average x
. Repeat the above steps many times (e.g., 1000 times)
. Study the histogram of the (1000) realized averages
Sample 1
x(1)1 , x(1)2, x(1)3, . . . , x(1)n
x(1)
Sample 2
x(2)1 , x(2)2, x(2)3, . . . , x(2)n
x(2)
Sample 3
x(3)1 , x(3)2, x(3)3, . . . , x(3)n
x(3)
Sample 4
x(4)1 , x(4)2, x(4)3, . . . , x(4)n
x(4)
Sample 5
x(5)1 , x(5)2, x(5)3, . . . , x(5)n
x(5)
...
. . .
...
Sample x()1, x()2, x()3, . . . , x()n x()

212
The following scenarios will be used for the distribution of X:

. X N (5, 1), true mean = 5
. X 24, true mean = 4

. X Bernoulli

1
5
, true mean = 0.2
. X Poisson(1), true mean = 1

The following scenarios will be used for the sample size n:
.n=2
.n=5
. n = 10
. n = 50
Calculations: Vestac Java Applet basics distribution of mean
213
Results for X N (5, 1) ( = 5):
214
Results for X 24 ( = 4):
215
Results for X Bernoulli

1
5
( = 0.2):
216
Results for X Poisson(1) ( = 1):
217
General conclusions: For large samples, the sampling distribution of X becomes. . .

. . . . symmetric around the true value for
. . . . more concentrated around the true value for

. . . . normally distributed
One can prove the following theoretical result:
For any random variable X with mean and variance 2,
and for n sufficiently large,
X N ,
n
This is the Central Limit Theorem (CLT), which will be the basis for most
calculations from now on
218
This implies that, for sufficiently large samples, x is an unbiased estimate for ,
which is more precise as the sample gets larger
What is sufficiently large ? The simulation results have shown that this entirely
depends on the distribution of the original data X. Hence, no generally valid
answer can be given.
One can also use similar simulation studies to investigate the sampling distribution
of other statistics such as the median, the variance,. . . . However, no general
results can be derived as in the CLT
For the variance, such simulations (Vestac Java Applet basics distribution
of variance) show that . . .
. . . . the sample variance s2 is unbiased for 2
. . . . the precision of s2 increases with n
219
These results (and the CLT) are the key motivation for conducting large studies,
since collecting additional information (more observations, larger sample) will lead
to increased precision in the estimation:
One can buy extra precision with extra observations
220
11.6
The standard error of the mean
It follows from the CLT that the standard deviation of X is equal to / n.

This standard deviation is also called the standard error of the mean (s.e.m.)
The s.e.m. reflects the precision in the estimation of by x
The s.e.m. is often presented in publications and reports, as an indication of how
precise ones conclusions are.
As an example, consider summarizing/describing the BMI values for a number of
different professions
221
Average standard deviation

. Describing location in samples
. Describing spread in samples
. Meaningful for symmetric distributions
only
Average s.e.m.
. Describing location in samples

. Describing precision of location
estimation
. Always meaningful since X normal for

sufficiently large n
222
Chapter 12
Confidence intervals
. Example
. The confidence interval
. Interpretation
. Properties of confidence intervals
. Example
. Example from the biomedical literature
223
12.1
Example: Captopril data
Consider the Captopril data, where blood pressure was taken in 15 hypertensive
patients, before and after administration of the drug Captopril:
Interest is in estimating the average change in diastolic BP.

224
Let X be the difference in diastolic BP before and after treatment:

X = BPbefore BPafter
The observed values xi for X can be calculated from the observed values of the
BP in our sample:
Patient
Before
DBP
After
DBP
Change
xi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
130
122
124
104
112
101
121
124
115
102
98
119
106
107
100
125
121
121
106
101
85
98
105
103
98
90
98
110
103
82
5
1
3
2
11
16
23
19
12
4
8
21
4
4
18
225
Note that, in relatively small samples, the histogram can be difficult to interpret.
One therefore prefers not to estimate the complete distribution of X
On the other hand, there does not seem to be strong evidence for severe skewness.
Focuss will be on the estimation of the average of X. As before, our estimate
will be the sample average:
c = x = 9.27
c it is of interest
Since every other sample would have lead to another estimate ,
to know how likely it is that our estimate is far from the true value
c = 9.27 which is very likely to

We want to derive an interval around our estimate
contain the true value
226
P
O
P
U
L
A
T
I
O
N
RANDOM
S
A
M
P
L
E
Distribution of X
: Average change
in diastolic BP

Histogram of x
Estimate
i
for =
= x = 9.27
227
12.2
The confidence interval
The CLT describes what values for x are to be expected if one would repeatedly
draw new samples. If n is sufficiently large, we have that:
X N , n
So, thanks to the CLT, we can calculate how likely it is to have an estimate far
from the correct value, or close to the correct value
228
In our example, n = 15 which is rather small. However, since the distribution of X

is relatively symmetric, n = 15 is probably sufficiently large for the CLT to apply.
Let us calculate the probability that a random sample would yield an estimate x
which is less than 1 unit apart from :
P (1 X < 1) = P
X
s
2
n
2
n
As always, 2 is estimated by s2 = 74.21, and n = 15, so we have:

P (1 X < 1) = P (0.45 N (0, 1) 0.45) = 35%
Hence, a random sample will in 35% of the cases yield an estimate for which is
less than 1 unit apart from .
229
The above calculations can be repeated for other distances between x and :
Distance |x |
1
2
3
4.36
6.25
Probability
35%
63%
82%
95%
99%
For example, 99% of the random samples would yield a sample average that is not
further away from than 6.25 units
So, there is 99% chance that the interval [x 6.25; x + 6.25] contains .
The interval [x 6.25; x + 6.25] is called the 99% confidence interval (C.I.)
for .
230
In our example, this interval equals:

[x 6.25; x + 6.25] = [9.27 6.25; 9.27 + 6.25] = [3.02; 15.52]
The percentage 99% is called the confidence level
Confidence intervals for other confidence levels:
Level
35%
63%
82%
95%
99%
Confidence interval
[x 1; x + 1]
[8.27; 10.27]
[x 2; x + 2]
[7.27; 11.27]
[x 3; x + 3]
[6.27; 12.27]
[x 4.36; x + 4.36] [4.91; 13.63]
[x 6.25; x + 6.25] [3.02; 15.52]
In biomedical sciences, one traditionally uses 95% confidence levels

231
12.3
Interpretation
Let us focuss on the 95% confidence interval. For other confidence levels, the
interpretation is similar.
We derived that 95% of the random samples would yield a sample average x that
is not further away from than 4.36 units
So, one can expect that approximately 95 out of 100 samples would lead to an
interval [x 4.36; x + 4.36] that contains .
For a specific data set, such as the Captopril data, the obtained confidence interval
[4.91; 13.63] may or may not contain . However it is very likely to contain ,
since only 5 out of 100 data sets would lead to an interval not containing .
Illustration: Vestac Java Applet statistical tests confidence interval for mean
232
233
12.4
Properties of confidence intervals
Ideally, C.I.s are small, as this reflects a very precise estimation of the unknown
population parameter
Hence, a C.I. can be used as an indication of the precision of the estimation:
. short C.I.: precise estimation
. long C.I.: imprecise estimation, much uncertainty

The length of the C.I. increases with the confidence level:
Level
95%
99%
Confidence interval
[4.91; 13.63]
[3.02; 15.52]
234
Intuitively: larger intervals are more likely to contain the unknown population
parameter
The length of the C.I. decreases with the sample size n
Illustration: Vestac Java Applet statistical tests confidence interval for mean
235
Intuitively: More observations leads to more precision:

One can buy extra precision with extra observations
The length of the C.I. increases with the variance 2 of the original
data
Intuitively: The more the observations are alike, the more precise the mean can
be estimated:
Precise estimation of
Imprecise estimation of
236
What about 100% C.I.s ?

The 100% C.I. for equals [; +], which is not informative at all
Intuitively: Absolute certainty about
population characteristics cannot be attained based on a finite sample of observations
237
12.5
Example: BMI
The concept of C.I. has been explained in the context of the estimation of a
population average
However, C.I.s can be constructed for any characteristic of the distribution of
the random variable X of interest (variance, mean, median, proportion)
As an example, we re-consider the BMI data on n = 2605 Belgian males, and we
estimated the proportion of males in the population with overweight (BMI> 25)
by the observed proportion c = 46.99%
As an indication of the precision of this estimate, we can calculate, e.g., a 95%
C.I. for : [0.45; 0.49]
The interval [0.45; 0.49] contains the unknown proportion with 95% probability
238
12.6
Example from the biomedical literature
Wong et al. [10], Table 2:
C.I.s for differences between

means and medians
239
Chapter 13
Hypothesis testing
. Example
. Null and alternative hypothesis
. The p-value and level of significance
. Possible errors in decision making
. Hypothesis testing versus confidence intervals
. Example
240
13.1
Example
We continue the example with the Captopril data, where blood pressure was taken
in 15 hypertensive patients, before and after administration of the drug Captopril:
Interest is in deciding whether the treatment did affect the diastolic BP

241
As before, X is the difference in diastolic BP before and after treatment:

The observed values xi for X can be calculated from the observed values of the
BP in our sample:
Patient
Before
DBP
After
DBP
Change
xi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
130
122
124
104
112
101
121
124
115
102
98
119
106
107
100
125
121
121
106
101
85
98
105
103
98
90
98
110
103
82
5
1
3
2
11
16
23
19
12
4
8
21
4
4
18
242
Focuss will be on finding evidence that the treatment affected the BP.
In case the treatment would have no effect, the average of X would be zero.
So, if one can show that there is (strong) evidence that 6= 0, then this can be
considered as evidence for a treatment effect.
c = x = 9.27
Based on our sample, the estimate for is
Obviously, this estimate is relatively far away from 0, suggesting that the
treatment might have affected BP
c = 9.27 could have occurred by pure
On the other hand, the observed effect
chance, even if there would be no treatment effect at all.
In general, one will decide that there is evidence that 6= 0 if our estimate x of
is far away from 0, i.e., if |x 0| is large.
243
13.2
Null and alternative hypothesis
The procedure to decide whether there is sufficient evidence to believe the

treatment did affect BP is called test of hypothesis
In practice, the research question is formulated in terms of a null hypothesis H0
and an alternative hypothesis HA:
H0 : = 0
versus
HA : 6= 0
Based on our observed data, we will investigate whether H0 can be rejected in

favour of HA
If not, the null hypothesis H0 is accepted and one decides that the treatment
was not effective
244
245
P
O
P
U
L
A
T
I
O
N
RANDOM
S
A
M
P
L
E
Distribution of X
H0 : = 0
HA : 6= 0

Histogram of x
Estimate for
b = x = 9.27
246
13.3
The p-value and level of significance
Intuitively, it is obvious that H0 : = 0 will be rejected if the observed sample

average x is too far away from 0
Question:
How far is too far ?
Answers:
If this result is very unlikely to happen by pure chance
If this result is not at all what you expect to see if would be 0
247
The CLT will help us in deciding, as it describes what values for x are to be
expected if one would repeatedly draw new samples. If n is sufficiently large, we
have that:
X N , n
As always, 2 is estimated by s2 = 74.21, and n = 15.

Moreover, we are interested in knowing what values for x could be expected if
would be 0
248
Hence, the sampling distribution of interest is:
X N 0, 74.21
15
|
0
So, if = 0, we expect that random samples generate averages that behave
according to the above distribution
Hence, if we observe a random sample, with an average that is very extreme
according to this distribution, we should question the validity of the null
hypothesis = 0
249
How far should x be from 0 in order to consider this extreme ?

Should x = 1 be considered extreme ?
X N 0, 74.21
15
| | |
1 0 1
If = 0, the probability of observing a x less than 1 unit away from 0 is:
P (1 X 1)
= P
1 0
74.21
15
X 0
s
74.21
15
1 0
74.21
15
= P (0.45 N (0, 1) 0.45) = 35%

250
Hence, if there is no treatment effect, i.e., if = 0, then there is only 35% chance
of having a random sample with average x within 1 unit away from 0.
So, there would be 65% chance of observing a sample with an average more than
1 unit away from 0.
Observing x = 1 cannot really be considered a lot of evidence against H0 : = 0
Similar calculations can be used for other values, such as 2, 3, . . .
X N 0, 74.21
15
| | | | | | |
3 1 0 1 3
2
2
251
The corresponding probabilities of observing a sample average more than

1, 2, 3, . . . units away from 0 are:
Distance |x 0|
1
2
3
Probability
65%
37%
18%
This probability can also be calculated for the distance |x 0| = 9.27 that was
observed in our experiment:
X N 0, 74.21
15
|
9.27
|
0
|
9.27
252
The corresponding probability equals:

Distance |x 0|
1
2
3
9.27
Probability
65%
37%
18%
0.1%
Hence, if would indeed equal 0, then it would be very unlikely to observe a

sample with average as extreme as 9.27. This would happen only once every 1000
times a sample would be taken.
We therefore consider the data observed in our experiment sufficient evidence to
reject the null hypothesis and we conclude that the treatment effect is
significantly different from 0, or equivalently, that there is a significant
treatment effect
253
The probability 0.1% that expresses how extreme our observations are in case the
null hypothesis would be true, is denoted by p, and is called the p-value.
A small p-value is indication of extreme results were H0 true. One then rejects
the null hypothesis
A large p-value is indication that the observed results are perfectly in line with
what can be expected to observe, if H0 is true. One then does not reject the
null hypothesis, which is equivalent to accepting the null hypothesis
In practice, one has to decide how small p should get before the null hypothesis is
rejected.
One therefore specifies the so-called level of significance :
p < = reject H0
p = accept H0
254
is typically a small value, such as 0.01, 0.05, 0.10

In biomedical sciences = 0.05 =
5% is standard.
One then rejects the null hypothesis as soon as the observed result
would happen in less than 5 times
in 100 experiments, assuming that
the null hypothesis would be correct
Strictly speaking, one should always mention what level of significance has been
used, and the conclusion would have to be formulated as the treatment effect is
significantly different from 0 at the 5% level of significance, or equivalently,
that there is a significant treatment effect at the 5% level of significance.
255
Note that specification of is only

required if a formal decision is preferred (accept or reject).
It is therefore not meaningful to report borderline significance in
examples where p is only slightly
larger than
(e.g., p = 0.06 > = 0.05)
256
13.4
Possible errors in decision making
In our example about the Captopril treatment, we obtained p = 0.001 leading to

the rejection of the null hypothesis of no treatment effect.
This should not be considered as formal proof that there is a treatment effect
Even if the treatment has no effect at all, a sample like ours would occur once
every 1000 times.
Maybe, our sample was indeed the extreme one that happens once every thousand
experiments.
Alternatively, suppose we would have obtained p = 0.9812. We then would not
have rejected the null hypothesis, and concluded that there is no evidence for any
treatment effect.
257
This should not have been considered as formal proof that any treatment effect
would be absent.
Maybe, the treatment effect is not 0, but very close to 0. The data one then
would observe would look very similar to data that would be observed if = 0,
such that the data do not allow to detect that 6= 0
Conclusion:
Intuitively: Absolute certainty about

population characteristics cannot be
attained based on a finite sample of observations
258
259
13.5
Hypothesis testing versus confidence intervals
For the Captopril data, we have drawn conclusions about the average treatment
effect in the population, through 2 different statistical procedures:
. 95% confidence interval: [4.91; 13.63]
. Significance of treatment effect, p = 0.001
We know from the C.I. that the average treatment effect is likely to be between
4.91 and 13.63, excluding 0
The significance test has rejected the value 0 as possible value for
So, both procedures agree
260
Question:
Do both procedures always agree ?
Answer:
Yes, provided the levels of significance and
confidence are complementary to each other:
Level of significance Confidence level (1 )100%
0.05
95%
0.10
90%
0.01
99%
261
In case of accepting H0 (p = 0.05):
95% C.I.
[
.....
.. .....
.. ... ..
....
...
...
..
H0
In case of rejecting H0 (p < = 0.05):
95% C.I.
..
.........
.. .... ..
...
....
..
..
H0
262
An alternative interpretation for the C.I. follows immediately:

A 95% C.I. is the collection of all null
hypotheses that would be accepted in a
statistical test
Statistical tests are to some extent equivalent to C.I.s

However, C.I.s have the advantage of giving an indication of the effect size
c
(treatment esstimate ),
as well as of the precision of estimation (width of C.I.)
So, C.I.s should be preferred over statistical tests
Biomedical literature
263
13.6
Example: BMI
The concept of statistical tests has been explained in the context of the
estimation of a population average
However, tests can be constructed for any characteristic of the distribution of
the random variable X of interest (variance, mean, median, proportion)
As an example, we re-consider the BMI data on n = 2605 Belgian males, and we
estimated the proportion of males in the population with overweight (BMI> 25)
by the observed proportion c = 46.99%
Suppose it would be known that 10 years before our sample was taken, only 40%
of the Belgian males suffered from overweight
264
If one wants to investigate whether overweight is occurring more frequently now,

it would be of interest to test the following hypotheses:
H0 : 0.40
versus
HA : > 0.40
If H0 would be rejected in favour of HA we would conclude that we found

evidence that nowadays more males suffer from overweight
The corresponding p-value equals p < 0.0001, so we conclude that the proportion
of males, in the Belgian population, with overweight is significantly larger than
40%, at the 5% level of significance.
This is an example of a one-sided test, since the alternative hypothesis is at one
side of the null hypothesis only.
265
In the earlier Captopril example, the hypotheses were

H0 : = 0
versus
HA : 6= 0
This was an example of a two-sided test

Note how for one- and two-sided tests, the conclusions are formulated slightly
different:
. Two-sided: . . . is (not) significantly different from . . .
. One-sided: . . . is (not) significantly smaller/larger than . . .
266
13.7
Wong et al. [10]

. Section on statistical methodology:
. Two-sided tests
. 5% level of significance
267
. Table 2:
. C.I.s for differences

between means and medians
. Corresponding tests for
significance
268
Part V
Some frequently used tests
269
Chapter 14
The comparison of two means: Unpaired data
. Example
. Confidence interval for the difference
of two means
. The unpaired t-test
. Assumptions
. Example: Survival times of cancer patients
270
14.1
Example
Re-consider the example on the weight gain in rats, where interest is in the
comparison between rats fed on a high or low protein diet
Group-specific histograms:
271
Group-specific summary statistics:
On average, there is an observed difference of 19g between the rats on a high

protein diet and those on a low protein diet.
Is this observed difference sufficient evidence to conclude that there indeed is an
effect of diet on the weight gain ?
It would be of interest to know how likely such a difference of 19g is to occur if
weight gain would be completely unrelated to the protein level of the diet.
272
Note that, strictly speaking, we have two populations, with a sample randomly
drawn from each:
. High protein rats: The hypothetical population of all rats that are given a
high protein diet
. Low protein rats: The hypothetical population of all rats that are given a
low protein diet
From the first population, a random sample of n1 = 12 rats was taken. From the
second one, a random sample of n2 = 7 rats was drawn.
The corresponding observed means are x1 = 120 and x2 = 101 respectively.
Because there is no relation between the observations taken from the first
population and those taken from the second, we have unpaired data.
273
14.2
Confidence interval for the difference of two means
Let 1 and 2 be the (unknown) mean weight gain in the high and low protein
population, respectively:
Low protein
High protein
|
2
|
1
Of interest is to draw inferences about 1 2

274
As always, our estimate of 1 2 is

c
c = x x = 19
1
2
1
2
Based on the observed data, C.I.s can be constructed for 1 2

For example, a 95% C.I. for 1 2 is given by
[2.19; 40.19]
The true difference 1 2 may or may not be in the interval [2.19; 40.19].
However, if 100 similar experiments would be conducted, then 95 out of the 100
corresponding C.I.s are expected to contain 1 2.
Hence, with 95% certainty, we can conclude that we believe 1 2 to be within
the interval [2.19; 40.19].
275
This C.I. shows that:
. the estimate (19g) of 1 2 is a very imprecise estimate:

the C.I. is very wide
the estimate is up to 21.19 units precise with 95% chance
. based on our data, it cannot be ruled out that 1 2 would be zero, i.e., that
there would be no difference between both populations.
276
14.3
The unpaired t-test
Often, it is of interest to test whether two populations have the same mean.
This is translated in a set of hypotheses of the form:
H0 : 1 = 2
versus
HA : 1 6= 2
We will reject the null hypothesis if the observed data show too much deviation
from what is expected to see if the null hypothesis were correct
Hence, we will reject H0 if x1 is much larger than x2 , or vice versa
This is equivalent with rejecting H0 if |x1 x2| is too large
277
Question:
How large is too large ?
Answer:
If the observed difference |x1 x2|
is very unlikely to happen by pure chance
We therefore calculate the propability p of observing a similar experiment with

mean difference between the groups of at least 19g, if 1 = 2.
278
In our example, this probability equals p = 0.0757:
So, even if there is no relation at all between the protein content of the diet and
weight gain, then one can still expect to observe a difference of at least 19g in
7.6% of the future similar experiments.
Since p = 0.0757 > 0.05 = , we consider this unsufficient evidence to conclude
that the protein level would indeed affect the weight gain
279
Conclusion:
There is no significant difference (p = 0.0757) in weight gain
between rats on a high protein level diet,
and rats on a low protein level diet
The above testing procedure is called the unpaired t-test since unpaired data are
analysed, and since the calculation of the p-value is based on the t-distribution.
280
14.4
Assumptions
The calculation of the C.I., as well as the computation of the p-value are based on
the sampling distribution of X 1 X 2, which describes what values for x1 x2
can be expected in case the experiment would be repeated many times.
The sampling distribution of X 1 X 2 is completely determined from the
sampling distribution of X 1 and X 2
In case of large samples, those distributions are known to be normal (CLT)
In small samples, this normality of X 1 and X 2 is only valid in cases where the
original data are (approximately) normally distributed.
281
Therefore, in case of small samples, one assumes the outcome to be normally

distributed in each group separately:
Low protein
High protein
|
2
|
1
282
Conclusion:
Low protein
Large samples: no assumptions
High protein
|
2
|
1
Low protein
Small samples: Normality in both groups
High protein
|
2
|
1
Note that the samples in our group were small (n1 = 12 and n2 = 7). Hence the
histograms should be explored for any evidence against symmetry
283
The group-specific histograms are:
Note that, given the small sample sizes, assessment of symmetry is difficult
This illustrates another drawback of small samples: Assumptions are often needed,
which are very hard to check based on the observed data.
284
Subject-matter knowledge can often help in deciding whether the underlying

assumptions are realistic
The unpaired t-test also implicitly asssumes that, both populations have the same
variance
This can be checked with a test for equality of variances, in which the
following hypotheses are tested:
H0 : 12 = 22
versus
HA : 12 6= 22
Most software packages automatically report the results from such a test, and
even provide a corrected unpaired t-test, which corrects for the unequal variances:
285
The variances are not significantly different from each other (p = 0.9788), such
that our original result remains valid.
Note that, since the variances are so similar, the corrected and uncorrected t-tests
yield very similar results (p-values).
Often, non-equality of the variances is associated with non-normality of the data
286
14.5
Example: Survival times of cancer patients
Based on the data on survival times of cancer patients, we want to compare the
surival times of stomach cancer patients with the survival times of colon cancer
patients
Summary statistics:
We observe a large difference of 457.4 286 = 171.4 days in average survival time
between both groups.
287
On the other hand, there is a lot of variability between the subjects in both groups.
Hence, it is not clear whether the observed difference of 171 days is sufficient
evidence to conclude that survival times are indeed different for colon cancer
patients and stomach cancer patients
Results of the unpaired t-test:
We do not find a significant difference between both groups, with respect to the
survival time (p = 0.2483).
288
However, the histograms suggest skewness in the data, such that the underlying
assumption of normality becomes questionable:
The skewness in the direction of the large values suggests that a logarithmic (or
similar) transformation might be useful:
X = survival time Y = ln(X) = ln(survival time)
289
Histogram
Possible transformations
290
Stomach
X Y = ln(X)
Colon
X Y = ln(X)
124
4.82
248
5.51
42
25
3.74
3.22
377
189
5.93
5.24
45
3.81
1843
7.52
412
51
6.02
3.93
180
537
5.19
6.29
1112
7.01
519
6.25
46
103
3.83
4.63
455
406
6.12
6.01
876
6.78
365
5.90
146
340
4.98
5.83
942
776
6.85
6.65
396
5.98
372
5.92
163
101
5.09
4.62
20
3.00
283
5.65
291
As before, assessing symmetry is difficult due to the small number of observations

in both groups. However, the evidence against symmetry is much weaker now.
Results of unpaired t-test based on transformed data:
The observed difference between both groups is still not significant (p = 0.0671),
but the p-value is very different from what we obtained before the transformation
(p = 0.2483).
This illustrates that:
. assumptions need to be checked

. violation of assumptions can lead to serious errors
292
Note that this is another example where geometric means and standard
deviations would be useful to describe the location and spread in survival times in
the two cancer groups separately:
Outcome
Survival time (days)
Stomach cancer
mean (stand.dev.)
144.03
(3.49)
Colon cancer
mean (stand.dev.)
314.19
(2.72)
= exp(4.97)
= exp(5.75)
= exp(1.25)
= exp(1.00)
geometric means and standard deviations
which is very different from the arithmetic means and standard deviations that
were reported before:
293
The fact that the formal test has been performed on the log-transformed survival
times does not change the interpretation of the result
If the log-transformed survival times are different for the two groups, then also the
untransformed survival times
Hence, although the conclusion, strictly speaking, should be that
there is no significant difference in log survival times,
it will often be formulated as
there is no significant difference in survival times.
294
14.6
Nissen et al. [15], Table 1:
. Large samples
. Similar variability in both groups
. p < 0.001 rather than p = 0.000
295
Kellett, Kellett, and Nordholm [12], Table 2:
. Relatively small samples
. Variances NOT equal
. Normality assumption NOT satisfied
. No reporting of the p-values
296
Chapter 15
The comparison of two proportions: Unpaired data
. Example
. The chi-squared test
. Assumptions The Fisher Exact test
. Rows versus columns
. Example: Case-control data
297
15.1
Example
We re-consider the data on sickness absence, collected on 585 employees with a

similar job:
Sickness absence
Gender
No
Yes
female
245
184
429
male
98
58
156
343
242
585
298
Research question:
Is there a relation between absence and gender ?
184/429 = 42.9% of the females, and 58/156 = 37.2% of the males have been
absent
This suggests that females are more absent than males
However, even if absence due to sickness is equally frequent amongst males and
females, the above results could have occurred by pure chance.
It therefore would be of interest to calculate how likely it would be to observe such
differences, by pure chance
299
Note that we have again two populations, with a sample randomly drawn from
each:
. Males: The hypothetical population of all male employees with similar job
conditions
. Females: The hypothetical population of all female employees with similar
job conditions
From the first population, a random sample of n1 = 156 males was taken. From
the second one, a random sample of n2 = 429 females was drawn.
Let 1 and 2 denote the proportion of males and females in the total populations
Then 1 and 2 can be estimated based on their sample versions c1 = 0.372 and
c2 = 0.429
Because there is no relation between the observations taken from the first
population and those taken from the second, we have unpaired data.
300
15.2
The chi-squared test
Often, it is of interest to test whether two populations have the same percentage
of people with absence due to sickness.
This is translated in a set of hypotheses of the form:
H0 : 1 = 2
versus
HA : 1 6= 2
We will reject the null hypothesis if the observed data show too much deviation
from what is expected to see if the null hypothesis were correct
Hence, we will reject H0 if c1 is much larger than c2 , or vice versa
c | is too large
This is equivalent with rejecting H0 if |c1
2
301
Question:
Answer:
If the observed difference |c1 c2|
We therefore calculate the propability p of observing a similar experiment with

difference between the groups at least equal to
|c1 c2 | = 0.429 0.372 = 0.057, if 1 = 2
302
So, even if there is no relation at all between gender and absence, then one can
still expect to observe a difference of 5.7% in 21.5% of the future similar
experiments.
Since p = 0.215 > 0.05 = , we consider this unsufficient evidence to conclude
that the occurrence of sickness absence is related to gender
303
Conclusion:
There is no significant difference (p = 0.215) in prevalence
of sickness absence
between males and females
The testing procedure needed for the comparison of proportions in unpaired data
is called the chi-squared test since the calculation of the p-value is based on the
chi-squared (2 ) distribution.
304
15.3
Assumptions The Fisher Exact test
c
c
The calculation of the p-value is based on the sampling distribution of
1 2 ,
c can be expected in case the experiment
which describes what values for c1
2
would be repeated many times.
c
c
and
Note that
1
2 are the sample averages X 1 and X 2 of the binary variable
sickness absence.
c
c directly follows from
Hence, for large samples, the sampling distribution of
1
2
the CLT
c
c
In small samples, the normality of
1 and 2 can be problematic, and an
alternative calculation of the p-value is needed.
305
The Fisher Exact test provides an alternative way to calculate the p-value,
without relying on the CLT, nor on the assumption of large samples.
As an example, we consider again data on sickness absence, but from a second,
much smaller, company:
Sickness absence
Gender
No
Yes
female
male
10
12
11
14
The results based on the chi-squared as well as on the Fisher Exact test are:
306
We observe considerable differences due to the (extremely) small sample sizes in

both groups
In larger samples, chi-squared and Fisher Exact produce much more similar
p-values:
Sickness absence
Company
Males
Females
Fisher Exact
58/156
184/429
0.215
0.219
2/12
1/2
0.287
0.396
107/330 405/1079
0.091
0.102
p-value
37/97
40/122
0.409
0.477
3/10
48/150
0.895
1.000
56/156
1/11
0.070
0.100
1/12
0/1
0.764
1.000
53/170
0/1
0.501
1.000
378/1089
117/269
0.007
0.009
307
The Fisher Exact test is very time-consuming, and cannot be calculated for large
samples, except with special software.
However, note that, for large samples, the chi-squared test remains possible, and
yields results very similar to the ones that would have been obtained with the
Fisher Exact test
In practice, it is often standard to use Fisher Exact, unless computational
restrictions require the use of chi-squared.
Conclusion:
Large samples: Chi-squared test
Small samples: Fisher Exact test
308
15.4
Rows versus columns
When comparing two unpaired proportions, the data can always be summarized by
a 2 2 table:
Sickness absence
Gender
No
Yes
female
A+B
male
C+D
A+C
B+D
A+B+C +D
in which A, B, C, and D represent the number of observations in each cell.

The hypothesis of interest was to compare the prevalence of sickness absence
between males and females.
309
One can show that this is equivalent with comparing the percentage of males
(females) between the employees with and without sickness absence:
B
D
=
A+B C +D
Proof:
C
D
=
A+C B+D
B
D
=
B(C + D) = D(A + B)
A+B
C+D
BC = AD
C(B + D) = D(A + C)
C
D
=
A+C
B+D
This implies that, for the analysis of a 2 2 table, rows and columns can be
interchanged.
This is of interest for the analysis of case-control data
310
15.5
Case-control data
We consider the data on cervical cancer, where the relationship between the
occurrence of cervical cancer and the age at first pregnancy is studied.
Data were collected on 49 cancer cases and 317 non-cancer cases (controls). All
women were asked about their age at first pregnancy, and the data are
summarized as:
Disease status
Age
Cervical cancer
Control
25
42
203
245
> 25
114
121
49
317
366
311
Research question:
Is there a relation between cancer and age ?
Of interest is to compare the prevalence of cancer between women with first

pregnancy before the age of 25, and those with first pregnancy later.
However, correct estimation of these percentages would have required a sample of
women with first pregnancy before the age of 25, and a sample of women with
first pregnancy later
This was not the setup of the present experiment, where a number of cases and a
number of controls are randomly selected, and where all women are then
questioned about their age at first pregnancy.
312
Such a design only allows correct estimation of the percentage of women with first
pregnancy before the age of 25, for cases and controls separately.
However, since rows and columns can be interchanged, this is sufficient to answer
our research question of interest:
313
For testing purposes, rows and columns can be interchanged, implying that the
analysis of case-control data still answers the research question of interest
For descriptive purposes, however, the choice between row and column
percentages entirely depends on the design of the study.
In the above example on cervical cancer, the row-percentages (i.e., percentage of
women with first pregnancy before the age of 25), for cancer cases and controls
separately, are the only ones that reflect the case-control nature of the experiment.
314
15.6
Zuskin et al. [16], p.173 and Table 1:
315
It is not clear when chi-squared is used, and when Fisher Exact is used
316
Chapter 16
The comparison of two means: Paired data
. Example
. Confidence interval for the difference of two means
. The paired t-test
. The paired versus unpaired t-test
. Example
. Assumptions
317
16.1
Example
We re-consider the example with the Captopril data, where blood pressure was
taken in 15 hypertensive patients, before and after administration of the drug
Captopril:
Interest is in deciding whether the treatment did affect the diastolic BP

318
As in the unpaired t-test, we might consider this a two-sample case, where a

sample is taken from each of two populations:
. Population 1: Patients without treatment
. Population 2: Patients after treatment with Captopril
Let 1 be the population average BP if no treatment is given, and let 2 denote
the population average BP after treatment.
After treatment
Without treatment
|
2
|
1
319
Interest is in inference for the difference = 1 2.

The main difference when compared to the unpaired t-test is that each
observation from the first sample now uniquely corresponds to one observation
from the second sample, and vice versa.
Hence, we have paired data
In the case of unpaired data, would be estimated by the difference between the
two sample averages:
c =
c
c = x x
1
2
1
2
In the case of paired data, is estimated by the average of all subject-specific
differences between BPs before and after treatment. More specifically, the
variable of interest becomes the difference X in BP before and after treatment:
320
As before, the observed values xi for X can be calculated from the observed
values of the BP in our sample:
Before
After
Change
Patient
DBP
DBP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
130
122
124
104
112
101
121
124
115
102
98
119
106
107
100
125
121
121
106
101
85
98
105
103
98
90
98
110
103
82
xi
5
1
3
2
11
16
23
19
12
4
8
21
4
4
18
is the population mean of the variable X, and inference for can be based on
the within-subject differences xi, rather than on the original BP measurements.
Note that this situation has been explained in full detail in the Chapters 12 and 13.
321
16.2
Confidence interval for the difference of two means
In Chapter 12, we derived that a 99% confidence interval for is given by

[3.02; 15.52].
Other confidence levels (95%, 90%, . . . ) are possible as well
322
16.3
The paired t-test
The hypothesis of interest is

H0 : 1 = 2
versus
HA : 1 6= 2
This is equivalent with the following test about the mean of the difference X in
bloodpressure:
H0 : = 0
versus
HA : 6= 0
In Chapter 13, we obtained a significant treatment effect (p = 0.001).

The testing procedure used in Chapter 13 is called the paired t-test since paired
data are analysed, and since the calculation of the p-value is based on the
t-distribution.
323
16.4
The paired versus unpaired t-test
What if the Captopril data were analysed using an unpaired t-test ?
324
Results from unpaired and paired t-tests, respectively:

. Unpaired:
. Paired:
Although both tests lead to a significant result, there is a serious difference in

p-values, showing that ignoring the paired nature of the data can lead to wrong
conclusions.
325
Conclusion:
15 2 measurements 6= 30 1 measurement
In general, the analysis of an outcome, measured multiple times per subject
(repeated measures), requires different statistical procedures than when the
outcome is measured only once for each subject.
326
16.5
Example
Obviously, it is important to correctly account for the paired nature of the data
In practice, this requires knowledge about the design of the study and the way
data have been collected
As an example, suppose interest is in testing for differences in BMI between males
and females
Suppose that BMI measurements are available for 100 males and 100 females.
The unpaired t-test is the obvious choice for the analysis, provided all assumptions
are satisfied.
Suppose now that the 100 males and females are taken from 100 married couples,
would this change the preferred method for analysis ?
YES !
327
16.6
Assumptions
It has been shown in the Chapters 12 and 13 that the calculation of both the C.I.
and the p-value entirely depends on the sampling distribution of X, the sample
average of the differences in BP before and after treatment.
In large samples, this sampling distribution is normal (CLT)
In small samples, this normality is only valid in cases where the difference in BP is
(approximately) normally distributed.
Therefore, in case of small samples, one assumes the difference X to be normally
distributed.
Note that, in this context, the sample size refers to the number of pairs, not the
number of observations in the data set
328
Conclusion:
Difference X
Large samples: no assumptions
|
=0?
Difference X
Small samples: Normality for difference X
|
=0?
In our Captopril example, the sample size was small (n = 15). Hence the
histogram of the observed differences should be explored for any evidence against
symmetry
329
Histogram of observed differences:
Assessment of symmetry is again difficult due to the small sample size, but there
is no strong evidence for severe skewness.
Note that the normality assumption is with respect to the difference X, not the
original measurements.
330
In our example, the original BP measurements (before and after treatment) are
allowed to be skewed, as long as their differences are symmetrically distributed:
After treatment
Before treatment
|
2
|
1
Difference X
|
=0?
Hence, it is useless to check symmetry of the original observations.
331
Note that, in case of skewness, it is often difficult and/or not helpful to transform
the observed differences xi:
. Since often negative
differences are observed, several standard transformations
such as ln() or are not possible
. Even if a transformation such as, e.g., yi = ln(xi + 10) would yield symmetric
observations yi, it is not clear what null hypothesis should be tested.
. Obviously, one can no longer test whether the mean of Y is equal to zero.
In case of skewness, one therefore usually transforms the original data in such way
that the differences become symmetric. This has the advantage that:
. Simple, standard, transformations can often be used
. One can still test for mean zero.
332
For example, a potential transformation for the Captopril data would be:
BPbefore
BPafter
ln(BPbefore)
ln(BPafter)
X = ln(BPbefore) ln(BPafter)
instead of:
BPbefore
BPafter
Y = ln(X + 5)
333
16.7
Chen et al. [17], p. 76 and Tables 1 and 2:
334
Paired t-test to test for time trends (IAC versus AOD)
335
Unpaired t-test to test for group differences (SARS verus Control)
336
Chapter 17
The comparison of two proportions: Paired data
. Example
. Mc Nemar test
. Assumptions
. Remark
. Mc Nemar versus chi-squared
. Example from biomedical literature
337
17.1
Example
Consider the data on the prevalence of severe colds in 1319 children, measured at
the ages of 12 and 14.
The response of interest is whether the child had severe colds during the last 12
months
Severe colds at 14 yrs.

Yes
Severe colds Yes

at 12 yrs.
No
No
212
144
356
256
707
963
468
851
1319
338
Research question:
Is the prevalence of severe colds different at the two ages ?
At age 12, 356/1319 = 27% of the children reported severe colds.
At age 14, this percentage equals 468/1319 = 35%
These data suggest that the prevalence of severe colds increases with age.
It would be of interest to know how likely the observed change in prevalence is to
occur by pure chance.
If this is very unlikely, the above data provide evidence that the prevalence indeed
changes with age. Otherwise, the above data do not provide evidence for such a
change.
339
Note that the data structure is similar to the one in the Captopril data, in the
sense that subjects are measured twice at different time points:
Hence, we have again paired data.

340
17.2
Mc Nemar test
Let 1 and 2 be the percentage of children in the total population with a severe
cold at the ages 12 and 14 respectively.
Interest is in testing whether 1 and 2 are equal, which would reflect no change
over time in the percentage of children with a severe cold.
The hypothesis of interest is
H0 : 1 = 2
versus
HA : 1 6= 2
Note that a change over time in the percentage of severe colds can only occur if
children change their status:
. No severe cold at 12yrs severe cold at 14yrs
. Severe cold at 12yrs no severe cold at 14yrs

341
Moreover, in order to have a change over time, more children should change in
one direction than in the other
Our test will therefore reject H0 if the number of changers in one direction is
much larger than the number of changers in the other direction.
In our example, we will reject H0 if |256 144| is too large
Question:
Answer:
If the observed difference |256 144|

342
We therefore calculate the probability p of observing a similar experiment with

difference between the numbers of changers at least equal to |256 144| = 112, if
there would be no change over time in the total population.
In our example, this probability equals p < 0.0001:
This p-value !
So, if severe colds would occur equally frequently at both ages, it would be very
unlikely to observe what has been observed in this particular experiment
We therefore conclude that our data provide evidence that the probability of
having a severe cold at the age of 12 is not the same as the probability of having a
severe cold at the age of 14.
343
Conclusion:
There is a significant difference (p < 0.0001) in the
occurrence of severe colds between the ages 12 and 14
The testing procedure needed for the comparison of proportions in paired data is
called the Mc Nemar test.
344
17.3
Assumptions
Similarly to the chi-squared test, the calculation of the p-value is based on the
assumption of a large sample
In case of small samples, the p-value can be calculated without approximations
based on CLT
The exact calculation is similar to the Fisher Exact test for unpaired data.
Many statistical packages only support the large-sample calculations.
345
17.4
Remark
As discussed before, the Mc Nemar test rejects H0 if the off-diagonal elements are
too different from each other, i.e., if there are many more changes in one direction
than in the other direction.
This implies that the testing procedure is independent of the observed diagonal
elements
Examples:
Table:
McNemar: comparison:
result:
20 20
200 20
40 50
40 500
60
130
vs.
40
130
p = 0.0142
240
760
vs.
220
760
p = 0.0142
346
17.5
Mc Nemar versus chi-squared
There seems to be a lot of confusion about when Mc Nemar test and when
chi-squared test should be used.
As an example, consider the results from a survey in which 75 people were
questioned about their intended vote in the US presidential elections, before and
after a debate on the national television:
After TV debate
Before
TV debate
Reagan
Carter
Reagan
27
34
Carter
13
28
41
40
35
75
347
Depending on the research question, this table can be analysed in two different
ways:
. Chi-squared: test for relation between vote before and after debate
. Mc Nemar: test for equal proportion Reagan voters before and after debate
Hence, even when data are paired, the chi-squared test can be used
Note that, in case of continuous data, there is no such choice:
. Unpaired data
. Paired data
= Unpaired t-test
= Paired t-test
348
17.5.1
Mc Nemar test
After TV debate
Before
TV debate
Reagan
Carter
Reagan
27
34
Carter
13
28
41
40
35
75
Research question:
Is the proportion Reagan voters the same
before and after the debate ?
The observed proportions are 34/75 = 45.3% and 40/75 = 53.3%
349
The p-value obtained from the Mc Nemar test equals p = 0.2636:
Hence the observed difference of 45.3% versus 53.3% would happen in 26.36% of
the cases, even if the percentage of voters for Reagan is the same before and after
the debate.
Conclusion:
The debate has not significantly changed the voting
behaviour (p = 0.2636).
350
17.5.2
Chi-squared test
After TV debate
Before
TV debate
Reagan
Carter
Reagan
27
34
Carter
13
28
41
40
35
75
Research question:
Is there a relation between voting behaviour before and
after the debate ?
Or equivalently:
Is the proportion of Reagan voters after the debate the same
amongst those who were in favour of Reagan before the debate as
amongst those who were in favour of Carter before the debate ?
351
The observed proportions are 27/34 = 79.4% and 13/41 = 31.7%

Note that this comes down to comparing the proportion of Reagan voters after
the debate, between two separate groups: Those who were in favour of Reagan
before the debate, and those who were not in favour of Reagan before the debate.
Hence, we now compare unpaired proportions.
The p-value obtained from the Chi-squared test equals p < 0.0001:
The observed difference of 79.4% versus 31.7% is very unlikely to happen if there
would be no relation between the voting behaviour before and after the debate.
352
Conclusion:
There is a significant relation between the voting behaviour
before and after the debate (p < 0.0001).
353
17.5.3
General conclusion
The survey results can be analysed in two different ways, leading to two different
conclusions:
. Mc Nemar: There is no evidence that a TV debate would change the results
of an election (p = 0.2636)
. Chi-squared: There is a strong relation between voting behaviour before and
after the debate (p < 0.0001).
Note that the proportion of Reagan voters before and after a TV debate could
also be compared based on unpaired data.
One then would question 75 people before the debate, and one would question 75
other people after the debate.
354
The resulting 2 2 table would then contain 150 subjects:

Preference
TV debate
Reagan
Carter
Before
34
41
75
After
40
35
75
74
76
150
The chi-squared test would compare the observed proportions 34/75 = 45.3% and
40/75 = 53.3%, which are the same ones as those compared before with the
Mc Nemar test for the experiment with paired observations
355
17.5.4
Some further examples
There is no relation between (non-)significance of the chi-squared test and

(non-)significance of the Mc Nemar test
Examples:
Table:
2: comparison:
result:
McNemar: comparison:
result:
25 25
10 10
40 10
5 20
25 25
40 40
10 40
45 30
25
50
vs.
25
50
10
50
vs.
10
50
40
50
vs.
10
50
5
50
vs.
20
50
p = 1.0000
p = 1.0000
p < 0.0001
p = 0.0291
50
100
50
100
50
100
50
100
vs.
50
100
p = 1.0000
vs.
20
100
p < 0.0001
vs.
50
100
p = 1.0000
vs.
25
100
p = 0.0098
356
17.6
Example from biomedical literature
De Clercq et al. [18], Abstract:
Mc Nemar test to compare the presence of sumptoms before and after surgery.
357
Part VI
Further topics on statistical inference
358
Chapter 18
Errors in statistics: Basic concepts
. Introduction
. Two types of errors
. Power
. Sample size calculation
. Examples
. Remarks
359
18.1
Introduction
Re-consider the example on the weight gain in rats, where interest is in the
comparison between rats fed on a high or low protein diet
Group-specific histograms:
360
Group-specific summary statistics:
On average, there is an observed difference of 19g between the rats on a high

protein diet and those on a low protein diet.
Based on the unpaired t-test, we obtained before that this observed difference is
not sufficient evidence to believe that the weight gain is really different for the two
diets (p = 0.0757)
361
Conclusion:
There is no significant difference (p = 0.0757) in weight gain
between rats on a high protein level diet,
and rats on a low protein level diet
As indicated before, the result of a statistical test should be interpreted as

evidence in favour or against the null hypothesis, and should not be interpreted as
formal proof.
In our example, the difference in weight gain between a population treated with
one diet and a population treated with the other diet is too small to be detected
based on 12 and 7 animals, respectively.
362
Alternatively, if the t-test would have lead to p = 0.001, this would still not
formally proof that there is a difference between both populations.
After all, p = 0.001 would only indicate that the observed difference of 19g occurs
once every 1000 times, even if there is no difference at all between both
populations.
Maybe, our sample was indeed the extreme one that happens once every thousand
experiments.
Hence, whenever statistical tests are used, one has to be aware that errors in the
conclusions can occur.
It is therefore important to quantify the errors, and to keep them under
control
363
18.2
Two types of errors

Reality
Test result
Accept H0
H0 correct
H0 not correct
No error
Type II error
Reject H0 Type I error
No error
Type I error: H0 is incorrectly rejected

Type II error: H0 is incorrectly accepted
364
18.3
Type I error
A type I error occurs if H0 is correct but the test leads to a significant result.
Question:
How likely is such an error to occur ?
Suppose the test is performed at the = 5% level of significance

If H0 is correct, then one will observe a significant result in 5% of the cases
Hence, in 5% of the cases, H0 would be incorrectly rejected
365
The probability of making a type I error is therefore equal to the chosen level of
significance.
In practice, the probability of making a type I error is kept under control by
choosing sufficiently small
In biomedical sciences = 5% is often used, hereby allowing to make a type I
error in 5% of the cases.
Reality
H0 correct
Test result
Accept H0
Reject H0
H0 not correct
1
If H0 is correct, then the probability of making a type I error is , while the
probability of correctly accepting H0 is 1 .
366
18.4
Type II error
A type II error occurs if H0 is incorrect but the test has not detected this, i.e., a
non-significant result is obtained
Question:
How likely is such an error to occur ?
In contrast to the type I error, the probability of making a type II error is not easily
controlled, and depends on various aspects of the sample(s) and population(s)
367
In analogy to the type I error, the type II error rate is denoted by
Reality
Test result
H0 correct
H0 not correct
Accept H0
Reject H0
The power of a statistical test is 1 , the probability of correctly rejecting H0
368
18.5
Power
In general, a specific testing procedure is acceptable, only if:
. the chance of making a type I error rate is sufficiently small

. the power to detect deviations from H0 is sufficiently large
The first condition can be met by specifying sufficiently small.

The second condition is more difficult to meet, as the power depends on various
aspects of the sample(s) and population(s)
This will be illustrated in the context of the comparison of two groups (such as
the weight gain experiment)
369
As before, let 1 and 2 represent the average weight gain in the total population,
under high and low protein diets, respectively.
The null and alternative hypotheses are given by
H0 : 1 = 2
versus
HA : 1 6= 2
The power is the probability of correctly rejecting H0.

In that case, 1 6= 2, and we denote the true difference between both
populations by = 1 2
The unpaired t-test assumes the data to be normally distributed in both
populations, with equal variability 2
370
Graphically:
Low protein
High protein
2 .. . 2 ..
.
.................................... .....................................
|
|
2
1
.................
........................
371
18.5.1
Power as a function of
The smaller , the smaller the power
Intuitively: Type I errors are less likely if the null hypothesis is rejected less
often. However, in cases where H0 is truly wrong, it will still be rejected less often.
An extreme case is obtained for = 0:
. = 0 implies that the null hypothesis is always accepted

. So, in case the null hypothesis is wrong, it is still accepted, leading to power 0
372
18.5.2
Power as a function of true difference
The smaller , the smaller the power
Intuitively: Large deviations from the null hypothesis are easier to detect
Low protein
High protein
|
|
2
1
.................
.........................
Low protein
High protein
| |
2 1
.........
.
373
18.5.3
Power as a function of variability 2
The smaller 2, the larger the power
Intuitively: Homogeneous groups are easier discriminated than heterogeneous

groups
Low protein
High protein
|
|
2
1
..................
.........................
Low protein
High protein
|
|
2
1
..................
........................
374
18.5.4
Power as a function of sample size(s)
The more observations, the larger the power
Intuitively: More observations yields more information about the population(s),

therefore implying more precision in the conclusions
375
18.5.5
Conclusion
The power depends on various aspects:

. Level of significance
. True difference between the populations

. Within-group variance 2
. Sample size(s)
Note that the sample size is the only aspect under control of the investigator.
In practice, one can calculate the sample size needed to reach a sufficiently high
power.
376
18.6
Sample size calculation
As indicated before, a testing procedure is only acceptable if it has sufficient

power, i.e., if the probability of making a type II error is sufficiently small.
Since the sample size is the only aspect influencing the power, which is under
control of the investigator, it is important that experiments are sufficiently large in
order for the power to be sufficiently large as well
The level of significance is chosen such that the probability of making a type I
error is sufficiently small
The within-group variance 2 is pre-specified based on earlier, similar experiments,
relevant literature, or a pilot study
377
To be on the safe side, usually an upperbound for 2 is used: In case the

variability would be smaller, the power would be higher, hence still sufficiently high
In practice, is not known. Instead, the smallest which would still be clinically
relevant to detect, is specified.
If sufficient power is attained for the smallest meaningful , we have that:
. Any larger difference will be detected with even larger power
. We are not concerned about small powers for detecting smaller differences, as
such differences are not relevant anyway.
One can then calculate the number(s) of observations needed to reach a desired
level of power.
378
18.7
Example: Weight gain data
In the weight gain data, the observed difference of 19g was found not to be
significant (p = 0.0757)
We can calculate the power that a real difference of 19g would be found
significant if a new experiment were to be conducted, again with 12 and 7
observations in the high and low protein diet groups, respectively.
Group-specific summary statistics, from the current experiment:
379
Power calculations will be based on = 21, and = 0.05

The power to detect a difference of 19g equals 43.45%
Hence, with 12 and 7 observations respectively, there is only 43.45% chance that
a true difference of 19g would be detected.
If a difference of 19g is considered clinically relevant, then the weight gain
experiment was clearly too small, since it is very likely that such a difference would
remain undetected.
We can also calculate the power for other values of
380
Summary:
Power to detect a difference
0g
5.00%
10g
15.70%
19g
43.45%
30g
80.80%
40g
96.49%
: equal to
For example, 12 and 7 observations would be sufficient to show a true difference

of 40g with more than 96% chance.
Alternatively, one can also calculate how large the samples should be to detect a
difference of, e.g., 20g with sufficiently high power.
381
382
If a power of 90% is required to detect true effects as small as = 20g, at least

25 observations are needed in each group.
With 30 observations in each group, the probability of making a type II error,
when the true effect is not smaller than 20g, is approximately 5%.
383
18.8
Example: Sickness absence
We re-consider the data on sickness absence, collected on 585 employees with a

similar job:
Sickness absence
Gender
No
Yes
female
245
184
429
male
98
58
156
343
242
585
The observed difference between the absence rate 42.9% in females and 37.2% in
males was found not significant (chi-squared test, p = 0.215).
384
In case the percentages of sickness absence would be 42% in the total female
population, and 37% in the total male population, and in case a random sample of
429 females and 156 males would be taken, there would be 19.01% chance to
reach a significant effect.
So, if the population proportions are indeed 42% and 37%, an experiment with
429 en 156 would detect this difference only 19 times out of 100 experiments.
If a difference of 5% is considered clinically relevant, then the current experiment
was clearly too small, since it is very likely that such a difference would remain
undetected.
We can calculate how large the samples should be in order to detect a difference
between 42% and 37%, with sufficiently high power
385
386
For example, two samples of approximately 2500 observations are needed in order
to show a difference between 37% and 42%, with 95% probability
387
18.9
Remarks
The earlier examples of power and/or sample size calculations were in the context
of the unpaired t-test and chi-squared test.
Similar calculations can be done in any other statistical testing situation, e.g.,
Fisher Exact test, paired t-test, McNemar test, . . .
Strictly speaking, all experiments should be preceded by a realistic sample size
calculation to avoid experiments with unacceptable high type II error rates, i.e.,
with almost no chance at all to show clinically meaningful effects.
388
18.10
Wong et al. [10]

Methodology section, p.658:
389
Table 2 with results:
Discussion, p.664:
390
The difference on which the sample size calculation was based was much larger
than what actually was observed in the experiment
Therefore, the power to reject equality of the groups was (much) lower than the
expected 80%
The current study cannot tell the difference between a 9% increase and a 3%
decrease.
If such differences are considered clinically important, then the current study was
under-powered, due to the fact that the difference was overestimated at the time
of the sample size calculation.
391
Chapter 19
Errors in statistics: Practical implications
. Multiple testing
. Bonferroni correction
. Tests for baseline differences
. Equivalence tests
. Significance versus relevance
. Examples from biomedical literature
392
19.1
Multiple testing
Each time a test is performed, there is probability of making a type I error

For example, if = 0.05, we can expect to incorrectly reject the null hypothesis in
5 out of 100 times.
Implication:
The more tests one performs, the higher the probability
that something is detected by pure chance
This problem of multiple testing occurs very frequently in bio-medical sciences,
in various settings
393
19.1.1
Example: A classroom experiment
On entry in the classroom, assign each student at random to be seated at the left
or at the right side of the classroom
Compare both sides with respect to 100 aspects including weight, height, age,
gender, color of hair, color of eyes,. . .
It is to be expected that for at least 5 of these outcomes, a significant difference is
obtained at the 5% level of significance, by pure chance.
394
19.1.2
Example: Testing many relations
Amin et al. [19], Table 2:
. 18 tests performed
. only 2 significant results
395
19.1.3
Example: Subgroup analyses
Kaplan et al. [20], Table 5:
. Tests based on C.I.s for odds ratios

. C.I. containing 1 is equivalent to a
non-significant test result
. 21 3 = 63 tests performed
. only 5 significant results
396
19.1.4
Example: Searching for the most significant results
This scientific finding was printed in the Belgian newspapers:
It was even stated that those who wake up before 7.21am have a statistically
significant higher stress level during the day than those who wake up after 7.21am.
397
19.1.5
Conclusion
Significant results obtained by multiple testing are often overinterpreted

If the number of tests is reported, the reader knows that such results need to be
interpreted with extreme care
The problem arises when only the significant results are reported, and one does
not know how many tests were performed in total
This leads to reporting results which turn out to be not reproducible
For example, a new study would not find that students seated on the left are taller
than those on the right. Instead, students seated on the left may weigh more than
those seated on the right.
398
For example, a new experiment might show no difference in stress levels between
subjects waking up early and those waking up late. Or maybe a difference would
be found only when waking up is later than 8.12am.
399
19.2
Bonferroni correction
Suppose two tests are performed, both at the 5% level of significance.

The probability that at least one type I error will be made can be shown not to
exceed 2 0.05 = 0.10:
P (at least 1 type I error) 2 5% = 10%
In general, if k tests are performed, all at the 5% level of significance, the
probability of making at least one type I error can only be shown not to exceed
k 5%
Obviously, controling the overall type I error rate can be done by performing each
separate test at the /k level of significance.
400
For example, performing 2 tests at the 2.5% level of significance each implies that
the probability of making at least one type I error will not exceed 5%.
In general, when k tests are performed at the /k level of significance, one is sure
that the overall probability of making at least one type I error will not exceed .
This correction of the significance level is called the Bonferroni correction.
When confidence intervals are used instead of p-values, the confidence levels can
be corrected in a similar way
401
Some examples:
Number of tests
Significance level
Confidence level
0.05
95%
0.025
97.5%
0.01
99%
0.05/k
(1 0.05/k) 100%
For example, if CI1 , CI2 , . . . CI5 are 5 intervals with 99% confidence, for 5
unknown parameters 1 , 2 , . . . , 5, then there is at least 95% probability that all
5 C.I.s will contain all 5 unknown parameters:
P (CI1 contains 1 and
...
and CI5 contains 5) 95%
402
Note that, strictly speaking, the Bonferroni correction is an overcorrection, since

the overall type I error rate can only be shown not to exceed 5%, and usually
will be smaller than the required 5%.
In some specific testing situations (e.g., ANOVA analysis), more accurate
corrections are available.
403
19.3
Baba et al. [21], p.1202 and p.1203:
404
Kellett et al. [12], Table 2 (for example):
405
In the discussion, R.Roy writes:
Note that the reader cannot perform the Bonferroni correction as the exact
p-values have not been reported.
406
19.4
Tests for baseline differences
In order to show causal effects, patients are often randomized into 2 or more
groups
This ensures (at least in large studies) that all treatment groups are identical,
except for the treatment the patients receive
In (relatively) small studies, imbalances can still occur by pure chance
Therefore, one often compares the various groups with respect to important
factors which are believed to be strongly related to the outcome of interest.
This is called testing for baseline differences, as one compares the
characteristics of the patients at the start of the study.
407
As an example, suppose interest is to compare two oral treatments, A and B, for

the treatment of hypertension.
Suppose the change in diastolic BP is the oucome of interest
Age is one of the factors believed to be strongly related to BP. Therefore, it is
important that both treatment groups have the same age distribution
Therefore, one often tests for age differences between A and B, e.g., based on the
two-sample t-test.
The hypothesis tested is
H0 : A = B
versus
HA : A 6= B
Note that H0 and HA express properties of the populations, not the samples
408
In the populations (infinitely large), we know that, due to the randomization, A

and B are identical
Conclusion:
It makes no sense at all to perform baseline tests
in randomized studies
No matter how small the resulting p-value would be (e.g., < 108 ) we know that
the observed difference in age between groups A and B has occurred purely by
chance.
A meaningful alternative is to calculate a C.I. of the average age difference
between both groups, to ensure that the observed difference is sufficiently small to
conclude that it cannot (completely) explain the observed differences in the
outcome of interest.
409
In our example suppose that a 95% confidence interval for the average difference
in age (years) is given by [0.1; 0.3], then we believe that this difference would be
too small to explain why patients in group A show more decrease in BP than
patients in group B.
Note also that testing for baseline differences cannot be used to check whether
the randomization was done properly.
410
19.5
Nissen et al. [15], abstract and table 1:
A two-arm randomized study
411
formal tests at baseline
412
19.6
Equivalence tests
Suppose two groups A and B are to be compared, and a two-sample t-test is used
to test
H0 : A = B versus HA : A 6= B
In case of a non-significant test result, one often concludes that both groups are
identical or equivalent
An alternative interpretation is that the experiment did not have sufficient power
to show an effect which is present.
Conclusion:
Non-significance should not be interpreted as equivalence
413
This can also be seen from the fact that, if the two-sample t-test could be used to
show equivalence, it would be best to collect data on (extremely) small samples,
as this would increase the chance to obtain an non-significant result, due to lack
of power.
Instead, one should reverse H0 and HA:
H0 : |A B | >
versus
HA : |A B |
where is a pre-specified constant, defining equivalence

Note that HA is equivalent to A B
Hence, in order to reject H0, one needs to show evidence that A and B are less
than away from each other
One way to proceed is to construct a C.I. for A B and to check whether it is
entirely within the interval [; ].
414
Graphically, H0 would be rejected if:
95% C.I.
...
...
..
...
...
....
..
...
...
...
...
....
..
...
...
...
...
.
[c
A B
....
.......
.. .. ...
.. .... ..
....
...
....
..
....
..
...
....
...
...
...
..
...
...
....
..
...
...
...
...
....
..
...
...
...
...
.
0
Graphically, H0 would not be rejected if:
95% C.I.
[
...
...
...
...
....
..
...
....
..
...
....
..
...
....
..
...
....
A B
c
.....
......
.. ... ..
.. ... ...
...
..
....
..
...
....
..
...
....
..
...
...
...
...
....
..
...
....
..
...
....
..
...
....
..
...
....
0
415
Obviously, the result of the equivalence test entirely depends on the choice of
Therefore, needs to be specified prior to the data collection
416
19.7
Shatari et al. [22]:

. Title:
417
. Table 1:
No significant
differences !
418
. Results and conclusions (abstract):
419
Sripalakit et al. [23], abstract, Table 3, and p.1038

. Title:
420
. Study design:
Aim: equivalence of 2 treatments

Cross-over: all subjects receive both treatments
Washout period of 1 week between both treatments
Treatments given in random order
421
. Definition of equivalence:
Paired data, with skewed distribution for differences

Log transformation of original outcomes: ln(Yi) ln(Xi) = ln(Yi/Xi )
Equivalence defined as: = 0.22 = [; +] = [0.22; +0.22]
Back-transformed: [exp(0.22); exp(+0.22)] = [0.80; 1.25]
422
. Table 3 with results, and conclusion (abstract):
423
19.8
Significance versus relevance
We discussed before that the power to detect some effect increases with the
sample size
This implies that any effect , no matter how small, will, sooner or later, be
detected, if the sample is sufficiently large.
For example, consider the Captopril data, where the observed difference of 9.27
mmHg was found significantly different from zero (p < 0.001), based on data
from 15 patients only:
424
The 99% confidence interval for the average change in BP was found to be
[3.02; 15.52].
Suppose that the observed difference would have been 0.1 mmHg.
A p-value as small as 0.001 would be likely to be obtained, provided that the
sample would be sufficiently large.
Obviously, an average change in BP as small as 0.1 mmHg is not relevant from a
clinical point of view.
Conclusion:
Statistical significance
6=
Clinical relevance
425
A highly significant effect can be a large effect:
p = 0.0001
95% C.I.
.
....
...
..
...
...
...
...
....
..
...
....
..
...
....
..
...
..
A highly significant effect can also be a very small effect, but estimated with high
precision, due to a large sample size:
95% C.I.
....
...
..
...
...
....
..
...
...
...
...
....
..
...
....
..
...
p = 0.0001
[c]
0
426
The p-value cannot distinguish between both situations

It is therefore important not to blindly overinterpret significant results without
knowing the size of the effect
This is another reason why confidence intervals are to be preferred over
significance testing
427
Chapter 20
One-sided versus two-sided tests
. Introduction
. One-sided tests
. Example
428
20.1
Introduction
c = 9.27 mmHg
Re-consider the Captopril data, where the observed difference of
was found significantly different from zero (p < 0.001):
The hypothesis tested is

H0 : = 0
versus
HA : 6= 0
This hypothesis is two-sided since it is not pre-specified whether, in case H0 is

rejected, is larger or smaller than 0
429
This implies that an observed difference much larger or much smaller than 0
provides evidence against H0
This is also reflected in the calculation of the p-value:
p is the probability of observing an average difference at
least as far away from 0 as 9.27, if = 0.
This is equivalent to
p is the probability of observing an average difference larger
than 9.27 or smaller than 9.27, if = 0.
430
Graphically:
Sampling distribution of X under H0
p/2
...........
..... ...
... ...
. .....
...
....
...
...
....
.
|
9.27
p/2
|
0
....
................
.. ..
.
.
...
..
...
.
.
.
..
...
....
|
9.27
431
20.2
One-sided tests
Sometimes it is of interest to test one-sided hypotheses, e.g.,

H0 : 0
versus
HA : > 0
Obviously, observed differences smaller than 0 do not provide any evidence

against H0.
Only differences larger than 0 can be used as evidence in the data against H0
This has implications for the calculation of the p-value:
least as large as 9.27, if = 0.
432
Graphically:
Sampling distribution of X under = 0
p
|
9.27
|
0
....
.................
. ..
.
.
.
....
...
...
.
.
.
...
...
|
9.27
433
Note that the above distribution is the sampling distribution of X assuming

= 0.
Intuitively: If the data provide evidence to reject = 0 then also to reject 0
Note that the p-value is now only half the p-value one would obtain when testing
the two-sided hypothesis
As a result, significance is reached more often.
It is therefore tempting to search for arguments justifying one-sided testing rather
than the classical two-sided testing.
Often, this is done after the data have been collected, and after having seen the
direction of the observed effect (positive or negative).
434
However, the study objectives should never be influenced by the data that are
observed.
One-sided testing is justified only if
. it is known that an effect, if any, can only be
in one direction
. only one direction is of scientific interest
. the decision is made prior to the data collection
435
20.3
Example
In the context of the Captopril data, suppose that one is only interested in
treatments which yield an average decrease of at least 5 mmHg in diastolic BP.
This would lead to testing
H0 : 5
versus
HA : > 5
Note that only differences larger than 5 can be used as evidence against H0
The p-value is calculated as:
least as large as 9.27, if = 5.
436
Graphically:
Sampling distribution of X under = 5
p
|
5
...
..................
. ..
.
.
..
...
...
.
.
...
...
...
|
9.27
437
This p-value is now given by p = 0.038

Conclusion:
The average treatment effect is significantly larger than
5 mmHg (p = 0.038).
438
20.4
Hutchins et al. [24]

Description of methods, p.8315:
Authors in favour of one-sided tests

Journal required two-sided results
439
Results, p.8316:
Results (abstract):
440
Chapter 21
Describing associations
. Introduction
. Pearson correlation
. Relative risk
. Odds ratio
441
21.1
Introduction
All test procedures discussed so far aim at expressing to what extent an observed
relation between two variables can be ascribed to pure chance:
. Unpaired t-test: The relation between a continuous response Y (e.g., weight
gain) and a dichotomous variable X (e.g., protein level) which defines the
groups to be compared.
. Chi-squared test: The relation between a dichotomous response Y (e.g.,
sickness absence) and a dichotomous variable X (e.g., gender) which defines
the groups to be compared.
As discussed before, p-values do not express the size of a relation: A highly
significant effect does not necessary mean that the effect is clinically relevant, i.e.,
the association between the variables is not necessarily very strong.
442
A number of association measures is available to describe the strength of

association between two variables.
Association measures frequently used in the biomedical literature are:
. the correlation coefficient
. the relative risk
. the odds ratio
443
21.2
Pearson correlation
As an example, we re-consider the surgery data, in which the relation is studied

between the time needed, after surgery, for the BP to recover to a normal level,
and its relation to the BP during the surgery, and the dose of the drug needed to
keep the BP sufficiently low during the surgery.

444
445
Let us focuss on describing the association between the recovery time, and the
log(dose), irrespective of the type of operation.
For each patient, we have two measurements:
. The log(dose): xi for the ith patient
. The recovery time: yi for the ith patient

Our data are couples (xi, yi), which can be graphically explored using a
scatterplot.
The scatterplot suggests a positive relation between X and Y
Note that such a relation is an average relation, not a relation at the patient level
Also, the relation is not expected to be very strong: Knowing the dose, one
cannot predict the recovery time very precisely.
446
The Pearson correlation is a quantitative measure for the strength of

association between two variables X and Y , and is defined as:
r =
x)(yi y)
r
P
P
2
2
(x
x)
i i
i (yi y)
i (xi
where x and y are the sample averages of the observed x-values and y-values,
respectively:
1
x =
n
xi ,
1
y =
n
yi
Insight in the above expression can best be obtained graphically.
447
x)(yi y)
r
r= P
P
2
2
i (xi x)
i (yi y)
r
yi
i (xi
(,+)
(+,+)
(,)
(+,)
x
xi
448
The Pearson correlation coefficient measures to what extent there is a linear

relation between X and Y , and has the following properties:
. 1 r 1
. r < 0 : negative linear trend between the xi and the yi
. r > 0 : positive linear trend between the xi and the yi
. r = 1 : the data points xi and yi are located on a decreasing straight line
. r = 1 : the data points xi and yi are located on an increasing straight line
. r = 0 : there is no LINEAR trend between the xi and the yi
449
450
Note that the correlation r is computed from the observed values (xi, yi), and
only describes the association that has been observed in the sample.
However, this sample correlation r can be considered an estimate for the
population correlation , i.e., the correlation that would be obtained if the
total (infinite) population would be studied.
Usually it is of interest to use the observed sample to test whether can be
considered different from zero
Formally, the following hypothesis is to be tested:
H0 : = 0,
versus
HA : 6= 0
The test procedure assumes X and Y to be jointly normally distributed.

Alternatively to testing a hypothesis about , C.I.s can be computed for as well
451
P
O
P
U
L
A
T
I
O
N
RANDOM
S
A
M
P
L
E
H0 : = 0
HA : 6= 0
Scatterplot of (xi, yi)
Estimate for
b = r
452
For our example, the correlation matrix for the three variables in the surgery data
set is:
453
The corresponding scatterplot matrix is:
454
Note that the normality assumption for the time variable is questionable, implying
that the reported p-values may not be correct
One way to solve this is to transform the variable logarithmically, leading to:
The conclusions do not change qualitatively

455
21.3
Relative risk
We re-consider the sickness absence example, where the following data were
observed in one of the companies studied:
Sickness absence
Gender
Yes
No
female
117
152
269
male
378
711
1089
495
863
1358
The observed proportions of 117/269 = 43.49% and 378/1089 = 34.71% of

sickness absence in females and males, respectively, were found to be significantly
different (chi-squared: p = 0.007, Fisher Exact: p = 0.009).
456
The relative risk (RR) quantifies how much more sickness absence occurs in
females, compared to males:
RR =
Proportion sickness absence in females

Proportion sickness absence in males
117/269
= 1.26
=
378/1089
This implies that sickness absence occurs 1.26 times more in females than in males
Alternatively, we can conclude that the risk on sickness absence is 26% larger in
females than in males
As for the correlation coefficient, the RR can be considered an estimate, based on
our sample, for the theoretical relative risk in the total population.
457
Note that a RR equal to 1 would imply that the risk is the same for both genders,
i.e., that there is no relation between sickness absence and gender.
It is therefore often of interest to test whether the relative risk in the population is
equal to 1. Alternatively, C.I.s for the relative risk can be constructed as well.
For example, a 95% C.I. for the RR in our example, is given by [1.0692; 1.4686].
Since 1
/ [1.0692; 1.4686], we know that the null hypothesis of no relation
between gender and sickness absence is rejected.
Note that formal testing of this hypothesis was done before using the chi-squared
and Fisher Exact test.
458
21.4
Odds ratio
We re-consider the data on the relation between the occurrence of cervical cancer
and the age at first pregnancy:
Disease status
Age
Cervical cancer
Control
25
42
203
245
> 25
114
121
49
317
366
It was shown before that there is a highly significant relation between age at first
pregnancy and the occurrence of cervical cancer (p = 0.002, chi-squared and
Fisher Exact).
459
The relative risk of interest would indicate how much more likely cervical cancer is
to occur when the first pregnancy is before the age of 25 years, compared to when
the first pregnancy is after the age of 25 years.
Hence, the relative of interest is
RR =
Proportion cancer cases when first pregnancy 25yrs.

Proportion cancer cases when first pregnancy > 25yrs.
As discussed before, the case-control nature of this study does not allow
estimation of the proportions needed to calculate the above RR.
This is a direct consequence of the fact that the scientist him-/herself decides how
many cancer cases and how many controls will be selected in the sample.
460
The effect of that decision can be seen from comparing several situations with
different numbers of selected controls:
Table :
RR:
25yrs > 25yrs

Case
42
Control
203
114
42/(42 + 203)
= 2.96
7/(7 + 114)
25yrs > 25yrs

Case
Control
42
2030
1140
42/(42 + 2030)
= 3.36
7/(7 + 1140)
This means that the RR can be completely influenced by taking more or less
controls.
Therefore, the RR cannot be used to describe the strength of association in
case-control studies.
461
An alternative to the RR, which can be used for case-control studies, is the odds
ratio, defined as the ratio of the odds of cancer in the 25 group over the odds
of cancer in the > 25 group.
The odds of cancer in the 25 group is defined as:
Odds25 =
Proportion cancer cases when first pregnancy 25yrs.

Proportion non-cancer cases when first pregnancy 25yrs.
42/(42 + 203)
42
=
=
= 0.2069
203/(42 + 203)
203
Note that this odds is a measure for the risk of cancer in the 25 group, since it
will be large if there are many cancer cases, and small otherwise.
462
Similarly, the odds of cancer in the > 25 group is defined as:

Odds>25
Proportion cancer cases when first pregnancy > 25yrs.

=
Proportion non-cancer cases when first pregnancy > 25yrs.
7/(7 + 114)
7
=
= 0.0614
114/(7 + 114)
114
This odds is a measure for the risk of cancer in the > 25 group, since it will be
large if there are many cancer cases, and small otherwise.
The odds ratio is now defined as:
Odds25
0.2069
OR =
=
= 3.37
Odds>25
0.0614
463
Hence there is 3.37 times more odds on developing cervical cancer when the first
pregnancy is at an age younger than 25 years old.
The odds ratio is difficult to interpret, but it clearly gives a general indication of
how much more risk there is in one group, compared to another group.
Note that the odds ratio also equals:
42 114
= 3.37
OR =
203 7
In general, we have, for a general 2 2 table:
Group 1 Group 2
Case
Control
OR =
AD
BC
464
This shows that, in contrast to the RR, the OR does not depend on the numbers
of selected cases and controls.
This can also be seen in our earlier examples:
Table :
25yrs > 25yrs

Case
42
Control
203
114
25yrs > 25yrs

Case
Control
42
2030
1140
RR:
42/(42 + 203)
= 2.96
7/(7 + 114)
42/(42 + 2030)
= 3.36
7/(7 + 1140)
OR:
42 114
= 3.37
7 203
42 1140
= 3.37
7 2030
465
As for the correlation coefficient and the RR, the OR can be considered an
estimate, based on our sample, for the theoretical odds ratio in the total
population.
Note that an OR equal to 1 would imply that the risk is the same for both groups,
i.e., that there is no relation between cervical cancer and the age at first
pregnancy.
In that case, one would also have RR = 1.
It is therefore often of interest to test whether the odds ratio in the population is
equal to 1. Alternatively, C.I.s for the odds ratio can be constructed as well.
466
For example, a 95% C.I. for the OR in our example, is given by [1.4658; 7.7457].
Since 1
/ [1.4658; 7.7457], we know that the null hypothesis of no relation
between cervical cancer and age at first pregnancy is rejected.
Note that formal testing of this hypothesis was done before using the chi-squared
and Fisher Exact test.
467
21.5
Giantomaso et al. [25]

. Figure 2, p.398:
Positive association between the

actual distance and the distance
estimated by the physician
468
. Table 1, p.398:
. Negative Pearson correlation (r = 0.139)
. Correlation of patient estimate with physician estimate

equals r = 0.349, r 2 = 0.12
. Joint normality of X and Y are questionable (see graph)
469
Marlow et al. [9], Table 1:
Classmates
Impaired
2 (1.3%)
Preterm
99 (41%)
Not impaired 158 (98.7%) 142(59%)
OR =
158 99
= 56
2 142
99/241
RR =
= 33
2/160
470
Chapter 22
Non-parametric statistics
. Introduction
. The principle of ranks
. Wilcoxon test
. Example: Survival times in cancer patients
. Spearman correlation
. Example: Surgery data
. Remarks
471
22.1
Introduction
Most test procedures commonly used in statistics are based on specific

assumptions about the way the outcome Y of interest is distributed in the
population. Examples are:
. Normality
. Equal variance
This is why all techniques discussed so far are examples of so-called parametric
statistics
Sometimes, transformations of the data can be used in order to satisfy these
assumptions.
However, this (slightly) complicates the interpretation of the results
472
Also, in some cases, it is not possible to find a suitable transformation. For

example, consider the case where non-normality is caused by multi-modality:
If no transformation can be found, or if transformations are not desired,

non-parametric methods can be used.
473
22.2
The principle of ranks
We re-consider the analysis of the survival times of cancer patients, where a

comparison of stomach cancer patients to colon cancer patients was of interest.
Overlaid histogram of the survival times of both groups:
474
This suggests that the colon-cancer cases have longer survival times than the
stomach-cancer cases, i.e., that the distribution of the the survival times in one
group is shifted more to the right from the distribution in the other group.
This implies that, if all observations would be ranked, we expect to see more
observations from the stomach-cancer group in the lower ranks, and more from
the colon-cancer group in the higher ranks.
This suggests that it is sufficient to study the ranks of the observations, i.e.,
which observations are larger/smaller than others, to decide whether the survival
times in both groups can be assumed to be sampled from the same distribution.
The actual location of the observations is not needed, it is sufficient to know their
ranks.
Most non-parametric tests are based on replacing the observations by their ranks.
475
This will now be illustrated with two frequently-used non-parametric procedures:

. The Wilcoxon test
. The Spearman correlation coefficient
476
22.3
Wilcoxon test
The Wilcoxon test is the non-parametric version of the unpaired t-test. Hence, it
allows comparison of two populations, without having to assume the data to be
normally distributed in both populations
The null and alternative hypotheses are:
H0: one distribution

Stomach cancer
Colon cancer
HA: shifted distributions

Stomach cancer
Colon cancer
477
Hence, the alternative assumes that one distribution is just shifted from the other.
As an example of how the Wilcoxon test proceeds, consider the comparison of two
populations (A en B), on the basis of the following two samples:
A 7 4 9 17
B 11 6 21 14 18
The observations are now sorted, while keeping track of the population from
which they were sampled (group A or B):
4 6 7 9 11 14 17 18 21
A B A A B B A B B
478
The observed values are now replaced by their rank in the complete data set
(groups A and B together):
1 2 3 4 5 6 7 8 9
A B A A B B A B B
The sum of the ranks of all observations from one group is now calculated. For
example, for group A, this becomes:
WA = 1 + 3 + 4 + 7 = 15
Obviously, if WA is exceptionally large, this means that the observations in group
A are located more to the right, when compared to the observations in group B
479
Alternatively, if WA is exceptionally small, this means that the observations in

group A are located more to the left, when compared to the observations
in group B
Hence, H0 will be rejected if WA is too large or too small.
Question:
How large/small is too large/small ?
Answer:
If the observed value for WA
480
We therefore calculate the propability p of observing an experiment with similar

value for WA, if the two populations would be identical.
Hence, even if the two samples were drawn from the same population, there would
be 28.57% chance of observing two samples shifted from each other as much as in
the current experiment, by pure chance.
Hence, what has been observed in the current experiment is perfectly in line with
what is to be expected, if the two populations are identical.
481
We therefore conclude that there is no significant difference between the groups A

and B (p = 0.2857).
This testing procedure is called the Wilcoxon (rank sum) test or, equivalently,
the Mann-Whitney U test.
Note that, alternatively, one can also decide to sum the ranks of the other group
(here group B):
WB = 2 + 5 + 6 + 8 + 9 = 30
This would lead to identical results, since WA is large if WB is small, and vice
versa. Indeed, we have that
(nA + nB + 1) (nA + nB )
WA + WB =
2
Hence, knowing WA is equivalent to knowledge of WB .
482
22.4
Example: Survival times in cancer patients
The survival times of colon cancer patients was compared before with those of
stomach cancer patients, using the unpaired t-test, after logarithmic
transformation of the survival times.
We can now repeat this non-parametrically, for the original as well as
log-transformed survival times:
t-test
Wilcoxon
Original data
p = 0.2483
p = 0.0945
Log-transformed data
p = 0.0671
p = 0.0945
483
Note that the Wilcoxon test yields a p-value closer to the one obtained from the
t-test based on log-transformed data than to the one obtained from the t-test
based on the original data
Since the Wilcoxon test is based on ranks rather than the original data,
transforming the data will not affect the result, as long as monotonic
transformations are used.
484
22.5
Spearman correlation
The Pearson correlation coefficient r expresses the strength of linear association

between two variables X and Y
As discussed before, the test for significance of the observed correlation assumes
X and Y to be jointly normally distributed.
In cases where a transformation is not possible or not desired, a non-parametric
version can be derived, leading to the so-called Spearman correlation
coefficient.
As for the Wilcoxon test, the Spearman correlation coefficient will be based on
replacing the observations by their ranks.
485
As an example of how the calculation of the Spearman correlation proceeds,

consider the following 8 observations for the variables X and Y :
yi
xi
yi xi
yi
0 0.10 10 6.55
13 8.17
2 1.05
6 5.30 12 6.65
4 4.00
8 5.75
xi
Each value xi is now replaced by its rank amongst all observed values for X.
Similarly, each value yi is now replaced by its rank amongst all observed values
for Y .
486
Grahically:
rank(yi)
rank(xi ) rank(yi ) rank(xi ) rank(yi )

1
rank(xi)
One now calculates a Pearson correlation as a measure of association between
the so-obtained ranks.
487
In the above example, the ranks show a perfect linear relation, implying that the
Spearman correlation will equal 1.
Note that the original data did not show a perfect linear fit, implying that the
Pearson correlation would be less than 1.
The Spearman correlation coefficient measures to what extent there is a
monotone relation between X and Y , and has the following properties:
. 1 r 1
. r < 0 : negative trend between the xi and the yi
. r > 0 : positive trend between the xi and the yi
. r = 1 : there is a perfect negative monotone relation between the xi and yi
. r = 1 : there is a perfect positive monotone relation between the xi and yi
. r = 0 : there is no monotone trend between the xi and the yi
488
A statistical test for significance of the Spearman correlation can be constructed

as well.
This test procedure is not based on any distributional assumptions about X or Y .
Note that, although the Spearman correlation is often interpreted as just the
non-parametric version of the Pearson correlation, it is important to realize that,
strictly speaking, both correlations measure different types of association:
. Pearson: Linear association
. Spearman: Monotone association
489
22.6
Example: Surgery data
As an example, we re-consider the surgery data, in which the relation is studied

between the time needed, after surgery, for the BP to recover to a normal level,
and its relation to the BP during the surgery, and the dose of the drug needed to
keep the BP sufficiently low during the surgery.

490
491
Before, a Pearson correlation analysis was performed, and the variable Time was
log-transformed in order to satisfy the normality assumption.
We compare the previous results with those from a non-parametric Spearman
correlation analysis:
Note that Spearman correlations are not always larger/smaller than Pearson
correlations.
Since the Spearman correlation is based on ranks rather than the original data,
monotone tansformations of the data will not affect the result.
492
22.7
Remarks
For most simple statistical procedures, non-parametric versions are available.

Non-parametric procedures are not based on distributional assumptions for the
data.
Since non-parametric procedures are based on ranks, they are not affected by
monotone transformations of the data. Hence, transforming the data prior to a
non-parametric analysis does not make any sense.
Since non-parametric procedures are based on ranks, they are not influenced by
extreme values (outliers).
493
In general, the use of non-parametric procedures should be consistent with the

summary statistics used to describe the observed data:
+ Parametric tests
. Medians and interquartile ranges + Non-parametric tests
In case the distributional assumptions of a specific test are satisfied, one has the
choice between the parametric and non-parametric test.
In such cases, the parametric techniques are to be preferred, as they are more
powerful to detect relevant effects.
Unfortunately, many research questions will require more complex statistical tools
for which no non-parametric alternatives are available.
494
22.8
Choksy et al. [26]
. Statistical methodology, p.647:
. Power analysis does not specify the test

. Parametric and non-parametric data ?
495
. Figure 3:
496
Chen et al. [17], Table 3:
. Spearman rank correlations

. Many tests, few significant results, multiple testing
497
Huang et al. [27], Figure 1:
. Spearman correlation to quantify

linear relations
. Spearman correlation not affected
by outlier
498
Bibliography
499
Bibliography
[1] S. Graham and W. Shotz. Epidemiology of cancer of the cervix in buffalo, new york. Journal of the National Cancer Institute, 63:2327,
1979.
[2] D.J. Hand, F. Daly, A.D. Lunn, K.J. McConway, and E. Ostrowski. A handbook of small datasets. Chapman & Hall, first edition, 1989.
[3] P. Armitage and P. Berry. Statistical methods in medical research. Blackwell Scientific Publications, 1987.
[4] E. Cameron and L. Pauling. Supplemental ascorbate in the supportive treatment of cancer: re-evaluation of prolongation of survival times
in terminal human cancer. Proceedings of the National Academy of Science U.S.A., 75:45384542, 1978.
[5] G.A. MacGregor, N.D. Markandu, J.E. Roulston, and J.C. Jones. Essential hypertension: effect of an oral inhibitor of
angiotensin-converting enzyme. British Medical Journal, 2:11061109, 1979.
[6] M. Bland. An introduction to medical statistics. Oxford University Press, 3 edition, 2006.
[7] J.D. Robertson and P. Armitage. Comparison of two hypotensive agents. Anaesthesia, pages 5364, 1959.
[8] H.A. Boushey, C.A. Sorkness, T.S. King, et al. Daily versus as-needed corticosteroids formild persistent asthma. The New England
Journal of Medicine, 352:15191528, 2005.
[9] N. Marlow, D. Wolke, M.A. Bracewell, et al. Neurologic and developmental disability at six years of age after extremely preterm birth.
The New England Journal of Medicine, 352:919, 2005.
500
[10] C.A. Wong, B.M. Scavone, A.M. Peaceman, et al. The risk of cesarean delivery with neuraxial analgesia given early versus late in labor.
The New England Journal of Medicine, 352:655665, 2005.
[11] F. Blanchon, M. Grivaux, B. Asselain, et al. 4-year mortality in patients with non-small-cell lunc cancer: development and validation of a
prognostic index. Lancet Oncology, 7:829836, 2006.
[12] K.M. Kellett, D.A. Kellett, and L.A. Nordholm. Effects of an exercise program on sick leave due to back pain. Physical Therapy,
71:283293, 1991.
[13] S.P. Wu. Maximum acceptable weight of lift by chinese experienced male manual handlers. Applied Ergonomics, 28:237244, 1997.
[14] T. Nawrot, M. Plusquin, J. Hogervorst, et al. Environmental exposure to cadmium and risk of cancer: a prospective population-based
study. The Lancet Oncology, 7:119126, 2006.
[15] S.E. Nissen, E.M. Tuzcu, P. Schoenhagen, et al. Statin therapy, LDL cholesterol, C-reactive protein, and coronary artery disease. The
New England Journal of Medicine, 352:2938, 2005.
[16] E. Zuskin, J. Mustajbegovic, N. Schachter, et al. Longitudinal study of respiratory findings in rubber workers. American Journal of
Industrial Medicine, 30:171179, 1996.
[17] N.H. Chen, P.C. Wang, M.J. Hsieh, et al. Impact of severe acute respiratory syndrome care on the general health status of healthcare
workers in taiwan. Infection Control and Hospital Epidemiology, 28:7579, 2007.
[18] C.A.S. De Clercq, J.S.V. Abeloos, M.Y. Mommaerts, and L.F. Neyt. Temporomandibular joint symptoms in an orthognathic surgery
population. Journal of Cranio Maxillo-Facial Surgery, 23:195199, 1995.
[19] A.I. Amin, O. Hallbook, A.J. Lee, R. Sexton, B.J. Moran, and R.J. Heald. A 5-cm colonic j pouch colo-anal reconstruction following
anterior resection for low rectal cancer results in acceptable evacuation and continence in the long term. Colorectal Disease, 5:3337,
2003.
[20] S. Kaplan, S. Etlin, I. Novikov, and B. Modan. Occupational risks for the development of brain tumours. American Journal of Industrial
Medicine, 31:1520, 1997.
501
[21] Y. Baba, J.D. Putzke, N.R. Whaley, Z.K. Wszolek, and R.J. Uitti. Gender and the parkinsons disease phenotype. Journal of Neurology,
252:12011205, 2005.
[22] T. Shatari, M.A. Clark, T. Yamamoto, A. Menon, C. Keh, J.Alexander-Williams, and M. Keighley. Long strictureplasty is as safe and
effective as short strictureplasty in small-bowel crohns disease. Colorectal Disease, 6:438441, 2004.
[23] P. Sripalakit, P. Nermhom, and S. Maphanta. Bioequivalence evaluation of two formulations of Doxazosin tablet in healthy Thai male
volunteers. Drug Development and Industrial Pharmacy, 31:10351040, 2005.
[24] L.F. Hutchins, S.J. Green, P.M. Ravdin, D. Lew, S. Martino, M. Abeloff, A.P. Lyss, C. Allred, S.E. Rivkin, and C.K. Osborne.
Randomized, controlled trial of Cyclophosphamide, Methotrexate, and Fluorouracil versus Cyclophosphamide, Doxorubicin, and
Fluorouracil with and without Tamoxifen for high-risk, node-negative breast cancer: Treatment results of intergroup protocol int-0102.
Journal of Clinical Oncology, 23:83138321, 2005.
[25] T. Giantomaso, L. Makowsky, N.L. Ashworth, and R. Sankaran. The validity of patient and physician estimates of walking distance.
Clinical Rehabilitation, 17:394401, 2003.
[26] S.A. Choksy, P.L. Chong, C. Smith, M. Ireland, and J. Beard. A randomised controlled trial of the use of a tourniquet to reduce blood loss
during transtibial amputation for peripheral arterial disease. European Journal of Vascular and Endovascular Surgery, 31:646650, 2006.
[27] C.-C.J. Huang, C.-M. Li, C.-F. Wu, S.-P. Jao, and K.-Y. Wu. Analysis of urinary N-acetyl-S-(propionamide)-cysteine as a biomarker for
the assessment of acrylamide exposure in smokers. Environmental Research, 104:346351, 2007.
502

Biomedwet Bachelor PDF

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Biomedwet Bachelor PDF

Transféré par

Droits d'auteur :

Formats disponibles

Introduction to Biostatistics

Bachelor Biomedical Sciences Bachelor Pharmaceutical Sciences

Introduction, motivation and example

Homeopathy: The test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Basic principles of statistical methodology

Population versus sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Causality and randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Describing and summarizing data

Types of outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Graphical presentation of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Summary statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

Basic concepts of statistical inference

Describing the population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

Estimation, sampling variability, bias, and precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

Some frequently used tests

The comparison of two means: Unpaired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270

The comparison of two proportions: Unpaired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

The comparison of two means: Paired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317

The comparison of two proportions: Paired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337

Further topics on statistical inference

Errors in statistics: Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359

Errors in statistics: Practical implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392

One-sided versus two-sided tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428

Describing associations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441

Non-parametric statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471

Copies of the course notes: Toledo

. Available in all K.U.Leuven PC rooms

Vestac JAVA applets

. Take-home assignment (individualized)

Part B: Open book, individualized, written, multiple choice exam

Original version: BBC (Horizon)

Scientists dont always

In some cases it is important that patients themselves do not know what

. Comparison of different bandages

pills that contain no active ingredient at all,

. : The advancement of science is dependent on the sacrifice of a few for the

The ultimate experiment

The solutions have been analysed in parallel by two different labs

The results can also be summarized as a 2 2 table:

11 out or the 20 H tubes are scored as being active

If P and H were really equivalent we would expect:

Since we observed more correct classifications (11/20), the result of the

p: Probability of at least x correct

Theres absolutely no evidence at all

Since what is observed in an experiment is subject to random variation, also the

Statistics can prove everything

Statistics helps to:

. quantify the errors

Example: Sickness absence

In occupational medicine, one is interested in studying factors that influence

184/429 = 42.9% of the females, and 58/156 = 37.2% of the males

Example: Cervical cancer

Graham and Shotz [1]; Hand et al. [2] p. 247

Example: Weight gain in rats

Armitage & Berry [3] p. 109

The averages suggest differences.

Example: Survival times of cancer patients

Cameron and Pauling [4]; Hand et al. [2] p. 255

The average survival times suggest differences.

Example: Captopril data