Vous êtes sur la page 1sur 506

Introduction to Biostatistics

Geert Verbeke
Biostatistical Centre, K.U.Leuven
geert.verbeke@med.kuleuven.be
http://perswww.kuleuven.be/geert verbeke

Bachelor Biomedical Sciences Bachelor Pharmaceutical Sciences

Contents
I

Introduction, motivation and example

Introductory material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Homeopathy: The test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

II

Basic principles of statistical methodology

What is statistics ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Population versus sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Causality and randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Introduction to Biostatistics

27

III

Describing and summarizing data

Types of outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Graphical presentation of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Summary statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

IV

Basic concepts of statistical inference

Describing the population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

10

From the population to the sample, and back to the population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

11

Estimation, sampling variability, bias, and precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

12

Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

13

Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

Some frequently used tests

14

The comparison of two means: Unpaired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270

Introduction to Biostatistics

101

151

269

ii

15

The comparison of two proportions: Unpaired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

16

The comparison of two means: Paired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317

17

The comparison of two proportions: Paired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337

VI

Further topics on statistical inference

18

Errors in statistics: Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359

19

Errors in statistics: Practical implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392

20

One-sided versus two-sided tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428

21

Describing associations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441

22

Non-parametric statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471

Bibliography

Introduction to Biostatistics

358

499

iii

Part I
Introduction, motivation and example

Introduction to Biostatistics

Chapter 1
Introductory material

. Motivation
. Course material
. Evaluation system

Introduction to Biostatistics

1.1

Motivation

Master thesis
Statistics in the (bio-)medical literature
Correct analysis of collected data
Correct interpretation of results

Introduction to Biostatistics

1.2

Course material

Copies of the course notes: Toledo


Data sets analysed in the course: Toledo
Papers from biomedical literature, discussed in course: Toledo
Statistica software:

. Available in all K.U.Leuven PC rooms


. Available through LUDIT:
http://ludit.kuleuven.be/software/

. ...

Introduction to Biostatistics

Vestac JAVA applets

. Online:
http://ucs.kuleuven.be/links/index.htm

. Local installation:
http://ucs.kuleuven.be/java/download/download.html
and follow instructions

Introduction to Biostatistics

1.3

Evaluation system

Part A:

. Take-home assignment (individualized)


. Data analysis
. Project initiated during practical sessions

Part B: Open book, individualized, written, multiple choice exam

Introduction to Biostatistics

Chapter 2
Homeopathy: The test

. The controversy
. The movie
. Blinding
. Placebo
. The ultimate experiment
. The statistics
. Errors in statistics

Introduction to Biostatistics

2.1

The controversy

Introduction to Biostatistics

2.2

The movie

Original version: BBC (Horizon)


Dutch version: VRT (Overleven)

Introduction to Biostatistics

2.3

Blinding

Scientists dont always


think rationally. . .
They can fool
themselves.

(J.Randy)
Introduction to Biostatistics

10

Bias can be introduced if the scientist knows what samples are being investigated.
This can be avoided by blinding
Blinding is obtained by randomly assigning codes to the samples/treatments.
The codes are broken after all data have been collected.
The less objective the measurements are, the more important is blinding:
. Survival of the patient is an objective measure
. Tumour reduction is a semi-objective measure

Introduction to Biostatistics

11

Introduction to Biostatistics

12

In some cases it is important that patients themselves do not know what


treatment was received:
. Pain measurements
. Quality of life measurements
One can then use double blinding
Blinding is not always possible:

. Comparison of different bandages


. Comparison of different surgery suture techniques (staples versus thread)

Introduction to Biostatistics

13

2.4

Placebo

pills that contain no active ingredient at all,


just plain sugar
Introduction to Biostatistics

14

The fact that treated patients improve is no evidence for the efficacy of the
treatment:
. Natural improvement can occur
. Improvement can be the result of the attention given to the patient
Hence, showing efficacy of a treatment requires comparison to placebo
This explains the popularity of the placebo-controled trials in bio-medical sciences
Is the use of placebo ethical ?

. : The advancement of science is dependent on the sacrifice of a few for the


benefit of many others
. : Physicians/investigators are never relieved of their obligations of care
towards their patients (Declaration of Helsinki, 1964)

Introduction to Biostatistics

15

In cases where the use of placebo is considered unethical, the new treatment is
often compared to a standard one.
The aim of the study is then to show that the new treatment is at least as
good as the standard treatment.

Introduction to Biostatistics

16

2.5

The ultimate experiment

2 5 tubes are prepared, with 5C dilution. The first five starting from active
substance, the second five starting from pure water
These 10 tubes are given a random label: blinding
The tubes are diluted further to obtain 2 20 dilutions of 18C

Introduction to Biostatistics

17

New labels are assigned, in order to rule out any form of fraude.
A sample of living human cells is added to a drop of each tube.
How many cells have been activated by the different test solutions ?
Measurements are performed by two different labs in parallel.
Labs were told there were 20 active solutions and 20 placebo solutions. This
was done to avoid that researchers would classify all solutions as non-active.

Introduction to Biostatistics

18

2.6

The statistics

The solutions have been analysed in parallel by two different labs


Lets focuss on the results from the second lab:

Introduction to Biostatistics

19

The results can also be summarized as a 2 2 table:

Decision
Reality

Homeopathy

Placebo

Homeopathy

11

20

Placebo

11

20

20

20

40

11 out or the 20 H tubes are scored as being active


Since the labs knew that there were 20 active solutions, this immediately implies
that 9 out of the 20 P tubes are scored as non-active

Introduction to Biostatistics

20

If P and H were really equivalent we would expect:

Decision
Reality

Homeopathy

Placebo

Homeopathy

10

10

20

Placebo

10

10

20

20

20

40

Since we observed more correct classifications (11/20), the result of the


experiment can be considered as some evidence for H efficacy.
On the other hand, 11/20 could have occurred by pure chance

Introduction to Biostatistics

21

This random variability is also reflected in the results of the first lab:

Decision
Reality

Homeopathy

Placebo

Homeopathy

11

20

Placebo

11

20

20

20

40

In general, repeating an experiment would rarely lead to exactly the same results
By random chance, one may obtain results which are slightly different from 10/20
How much is slightly ?

Introduction to Biostatistics

11/20 ?

12/20 ?

13/20 ?

...

22

What number x of correct positive test results should have been obtained in order
to consider this sufficient evidence in favour of H ?
The answer should be based on the probability of having at least x correct positive
test results by pure chance
Such probabilities can be calculated using probability theory:
x

Introduction to Biostatistics

11

p: Probability of at least x correct


positive test results by pure chance
0.3762

12

0.1715

13

0.0564

14

0.0128

15

0.0019
23

Note how unlikely it would be to observe, e.g., x = 15 correct positive test results
if H and P would be equivalent (p = 0.0019)
Therefore, observing x = 15 could be considered strong evidence in favour of H
On the other hand, if there is no difference between H and P, one will observe at
least 11 correct positive test results in 37.62% of the cases, by pure random
chance.
Our experiment therefore does not provide any evidence in favour of H

Theres absolutely no evidence at all


to say that there is a difference. . .
(M.Bland)
Introduction to Biostatistics

24

2.7

Errors in statistics

Since what is observed in an experiment is subject to random variation, also the


conclusion is subject to random variation.
For example, even if H and P are equivalent, one may observe 15 correct positive
test results by pure chance: This will happen in 19 experiments out of 10000.
In such a case one would conclude that there is evidence for a difference between
H and P, even when there is no difference at all.
This shows that statistics will only help in summarizing and expressing the
strength of the evidence, in favour or against some specific statement.
One always has to keep in mind that errors can be made in the conclusions

Introduction to Biostatistics

25

Conclusion:

Statistics can prove everything

Statistics helps to:

. quantify the errors


. control the errors

Introduction to Biostatistics

26

Part II
Basic principles of statistical methodology

Introduction to Biostatistics

27

Chapter 3
What is statistics ?

. Examples
. Conclusion

Introduction to Biostatistics

28

3.1

Example: Sickness absence

In occupational medicine, one is interested in studying factors that influence


absence due to sickness
The following data were obtained from 585 employees with a similar job:

Sickness absence
Gender

Introduction to Biostatistics

No

Yes

female

245

184

429

male

98

58

156

343

242

585

29

Research question:
Is there a relation between absence and gender ?

184/429 = 42.9% of the females, and 58/156 = 37.2% of the males


have been absent
This suggests that females are more absent than males
However, even if absence due to sickness is equally frequent amongst males and
females, the above results could have occurred by pure chance.
It therefore would be of interest to calculate how likely it would be to observe such
differences, by pure chance

Introduction to Biostatistics

30

If this would be very unlikely, then the data provide evidence for a relation
between gender and absence
If this would not be unlikely, then the data provide no evidence for such a relation.

Introduction to Biostatistics

31

3.2

Example: Cervical cancer

Graham and Shotz [1]; Hand et al. [2] p. 247


In order to study the relationship between the occurrence of cervical cancer and
the age at first pregnancy, data were collected on 49 cancer cases and 317
non-cancer cases (controls). All women were asked about their age at first
pregnancy, and the data are summarized as:

Disease status
Age

Introduction to Biostatistics

Cervical cancer

Control

25

42

203

245

> 25

114

121

49

317

366

32

Research question:
Is there a relation between cancer and age ?

Note that 7/49 = 14.3% of the cancer cases had their first pregnancy after the
age of 25 years, while this is 114/317 = 35.96% in the control group
This suggests that cancer is more likely to occur when the first pregnancy was
before the age of 25 years.
How likely are the observed differences to occur by pure chance, if there is no
relation at all between cancer and age at first pregnancy ?

Introduction to Biostatistics

33

If the chance of observing this would be extremely small, then the observed data
provide evidence that there indeed is a relationship
If this chance is high, then the above data do not provide evidence for any relation
at all.

Introduction to Biostatistics

34

3.3

Example: Weight gain in rats

Armitage & Berry [3] p. 109


Consider the gain in weight (g) of 19 female rats between 28 and 84 days after
birth.
12 were fed on a high protein diet and 7 on a low protein diet:
Research question:
Does the weight gain depend on the diet ?

Introduction to Biostatistics

35

Dataset Weightgain:
Heigh protein

Low protein

134

70

146

118

104

101

119

85

124

107

161

132

107

94

83

Average (g)
High protein:

120

Low protein:

101

113
129
97
123

The averages suggest differences.


However, the observed differences could have occurred by pure chance.
Introduction to Biostatistics

36

It would be of interest to know how likely such differences are to occur by pure
chance, i.e., if weight gains would be completely unrelated to protein intake.
If this is very unlikely, the above data provide evidence that weight gains really
depend on the diet.
If such differences are likely to occur by pure chance, then the above data do not
provide evidence that weight gains show any relation with protein intake.

Introduction to Biostatistics

37

3.4

Example: Survival times of cancer patients

Cameron and Pauling [4]; Hand et al. [2] p. 255


Patients with advanced cancer of the stomach, bronchus, colon, ovary, or breast
were treated (in addition to standard treatment) with ascorbate.
The outcome of interest is the survival time (days)
Research question:
Do survival times differ with organ affected ?

Introduction to Biostatistics

38

Dataset Cancer:
Stomach

Bronchus

Colon

Ovary

Breast

124

81

248

1234

1235

42

461

377

89

24

25

20

189

201

1581

45

450

1843

356

1166

412

246

180

2970

40

51

166

537

456

727

1112

63

519

3808

46

64

455

791

103

155

406

1804

876

859

365

3460

146

151

942

719

340

166

776

396

37

372

223

163

138

101

72

20

245

283

Introduction to Biostatistics

Average (days)
Stomach:

286

Bronchus:

211.6

Colon:

457.4

Ovary:

884.3

Breast:

1395.9

39

The average survival times suggest differences.


However, these differences could have occurred by pure chance.
It would be of interest to know how likely such differences are to occur by pure
chance.
If this is very unlikely, the above data provide evidence that survival times really
differ with the organ affected.
If such differences are likely to occur by pure chance, then the above data do not
provide evidence that survival would differ with the organ affected.

Introduction to Biostatistics

40

3.5

Example: Captopril data

MacGregor et al. [5]; Hand et al. [2] p. 56


15 patients with hypertension
The response of interest is the supine blood pressure, before and after treatment
with CAPTOPRIL
Research question:

How does treatment affect BP ?

Introduction to Biostatistics

41

Dataset Captopril
Before

After

Patient

SBP

DBP

SBP

DBP

210

130

201

125

169

122

165

121

187

124

166

121

160

104

157

106

167

112

147

101

176

101

145

85

185

121

168

98

206

124

180

105

173

115

147

103

10

146

102

136

98

11

174

98

151

90

12

201

119

168

98

13

198

106

179

110

14

148

107

129

103

15

154

100

131

82

Introduction to Biostatistics

Average (mm Hg)


Diastolic before:

112.3

Diastolic after:

103.1

Systolic before:

176.9

Systolic after:

158.0

42

It would be of interest to know how likely the observed changes in BP are to occur
by pure chance.
If this is very unlikely, the above data provide evidence that BP indeed decreases
after treatment with Captopril. Otherwise, the above data do not provide evidence
for efficacy of Captopril.
Introduction to Biostatistics

43

3.6

Example: Prevalence of severe colds in children

Bland [6] p. 246


Study about the prevalence of severe colds in 1319 Kent schoolchildren, measured
at the ages of 12 and 14
The response of interest is whether the child had severe colds during the last 12
months

Severe colds at 14 yrs.


Yes

Severe colds Yes


at 12 yrs.
No

Introduction to Biostatistics

No

212

144

356

256

707

963

468

851

1319
44

Research question:
Is the prevalence of severe colds different at the two ages ?
At age 12, 356/1319 = 27% of the children reported severe colds.
At age 14, this percentage equals 468/1319 = 35%
These data suggest that the prevalence of severe colds increases with age.
It would be of interest to know how likely the observed change in prevalence is to
occur by pure chance.
If this is very unlikely, the above data provide evidence that the prevalence indeed
changes with age. Otherwise, the above data do not provide evidence for such a
change.
Introduction to Biostatistics

45

Note that the data structure is similar to the one in the Captopril data, in the
sense that subjects are measured twice at different time points:

Introduction to Biostatistics

46

3.7

Example: Surgery data

Robertson and Armitage [7]; Hand et al. [2] p. 100


Sometimes, a patients BP needs to be lowered during surgery, using a
hypotensive drug, which is administered continuously durig the relevant phases of
the operation.
Since the duration of these phases varies, so does the total amount of drug
administered
Patients also vary in the extent to which the drug succeeds in lowering the BP
The sooner the BP rises again to normal after the drug is discontinued, the better
Data on 53 patients, with 3 types of operation
Introduction to Biostatistics

47

Available measurements:

. Time (min.) before the patients systolic BP returns to 100 mmHg

. The 10-base log(dose) of the drug in log(mg)


. The average systolic BP while the drug was being administered
Research question:

How is recovery time related to other two variables ?

Note that the potential relation between BP and log(dose) makes it difficult to
disentangle the their relative relations to the recovery time.

Introduction to Biostatistics

48

Dataset Surgery
Minor non-thoracic

Introduction to Biostatistics

Major non-thoracic

Thoracic

log(dose)

BP

Time

log(dose)

BP

Time

log(dose)

BP

Time

2.26
1.81
1.78
1.54
2.06
1.74
2.56
2.29
1.80
2.32
2.04
1.88
1.18
2.08
1.70
1.74
1.90
1.79
2.11
1.72

66
52
72
67
69
71
88
68
59
73
68
58
61
68
69
55
67
67
68
59

7
10
18
4
10
13
21
12
9
65
20
31
23
22
13
9
50
12
11
8

1.74
1.60
2.15
2.26
1.65
1.63
2.40
2.70
1.90
2.78
2.27
1.74
2.62
1.80
1.81
1.58
2.41
1.65
2.24
1.70

68
63
65
72
58
69
70
73
56
83
67
84
68
64
60
62
76
60
60
59

26
16
23
7
11
8
14
39
28
12
60
10
60
22
21
14
4
27
26
28

2.45
1.72
2.37
2.23
1.92
1.99
1.99
2.35
1.80
2.36
1.59
2.10
1.80

84
66
68
65
69
72
63
56
70
69
60
51
61

15
8
46
24
12
25
45
72
25
28
10
25
44

49

Introduction to Biostatistics

50

3.8

Conclusion

The aim of statistics is twofold:

. Descriptive statistics: Summarizing and describing observed data such that


the relevant aspects are made explicit.
. Inferential statistics: Studying to what extent observed trends/effects can
be generalized to a general (infinite) population

Introduction to Biostatistics

51

Examples of descriptive statistics include tables, graphs, calculation of


averages,. . .
Valid inferential statistics requires a strong link between the sample and the
population about which one wishes to draw conclusions.
Valid inferential statistics requires:
. Correct statistical methodology

. Correct interpretation of results

Introduction to Biostatistics

52

Chapter 4
Population versus sample

. The population
. The (random) sample
. Statistics versus probability theory
. Types of studies
. Random samples variability uncertainty

Introduction to Biostatistics

53

4.1

Introduction

Observed data can always be considered as taken from some population.


The strength of the evidence in the data, and the validity of the conclusions based
on the data depends entirely on:
. the definition of the population
. the way the sample is drawn from the population

Introduction to Biostatistics

54

4.2

The population

In practice, the population of interest is defined through inclusion and exclusion


criteria
The inclusion criteria are the characteristics a subject/patient needs to have in
order to belong to the population
Examples of inclusion criteria:
. specific disease
. age range
The exclusion criteria are the characteristics a subject/patient is not allowed to
have in order to belong to the population

Introduction to Biostatistics

55

Examples of exclusion criteria:

. previous treatment for same disease


. pregnancy

It is important that objective criteria are used:

. Tumour must be beyond hope of surgical eradication


. An expected survival of at least 90 days

The population is not fixed, but changes constantly.


For example, interest is not only in todays patients with a specific condition, but
also in all patients in the (near) future
Therefore, the population is often (considered) infinite

Introduction to Biostatistics

56

4.3

The (random) sample

4.3.1

Example: Low back pain in nurses

Consider a study on risk factors for the prevalence of low back pain in nurses
Suppose that interest is in the population of all nurses in all Belgian hospitals
Data sets with the following characteristics would be problematic:
. Only female nurses

. Only nurses from university hospitals


. Only nurses aged 40

Introduction to Biostatistics

57

Observed effects/trends cannot be generalized to the entire population since one


cannot rule out that such effects would only occur in females, in university
hospitals, or in younger nurses.
Ideally, the sample should be a perfect reflection of the total population:
. Same proportion of males and females
. Same types of hospitals
. Same age distribution

Introduction to Biostatistics

58

4.3.2

Example:

Suppose a study is designed to compare 2 antidepressants


Study participants can be obtained from a number of psychiatric hospitals
The so-obtained data set is not necessarily representative for the total population
of depressed people, as only those hospitalized are considered.
Ideally, the sample should be a perfect reflection of the total population of interest

Introduction to Biostatistics

59

4.3.3

The random sample

Obviously, a data set in which all subjects satisfy the in- and exclusion criteria of
the population will not necessarily allow generalizations of observed effects/trends
to the total population.
Ideally, the sample should be a perfect reflection of the total population of interest
This can only be realized by taking a completely random selection from the total
population
Imbalances for some variables then only occur in small samples, and by pure
chance.

Introduction to Biostatistics

60

P
O
P
U
L
A
T
I
O
N

RANDOM
S
A
M
P
L
E

Introduction to Biostatistics

61

Taking a completely random sample is difficult in practice


Even if a completely random sample has been obtained, imbalances may still occur
due to various reasons:
. Selected subjects refuse to participate
. Subjects leave the study due to side effects
. ...

Introduction to Biostatistics

62

4.4

4.4.1

Statistics versus probability theory

Probability theory

Suppose it is known that a specific treatment is effective in 70% of the patients


receiving the treatment
This implies that the population consists of patients for whom the treatment is
not effective (30%) as well as patients for whom the treatment does have an
effect (70%)
If the treatment is administered to 100 randomly chosen patients, more than 70
may experience improvement, or less than 70

Introduction to Biostatistics

63

Question:
If 100 patients are given the treatment,
what is the probability that less than 60 of them
will experience an improvement ?
Probability theory aims at predicting the outcome of an experiment, knowing the
population

Introduction to Biostatistics

64

4.4.2

Statistics

Suppose one wants to investigate the effectiveness of a specific treatment in some


specific population
100 subjects, meeting the criteria of the population, receive the treatment
73 out of them experience considerable improvements
Question:
What is the efficacy rate in the total population ?
Statistics aims at drawing conclusions about the population, based on what
has been observed in the experiment

Introduction to Biostatistics

65

4.4.3

Conclusion

Introduction to Biostatistics

66

P
O
P
U
L
A
T
I
O
N

RANDOM
S
A
M
P
L
E

Introduction to Biostatistics

Efficacy rate in population

STATISTICS

73/100 treated patients improved

67

4.5

4.5.1

Types of studies

Introduction

There are various ways data can be collected


Which questions can be answered, and which ones cannot be answered, entirely
depends on how a specific data set arose
Also the strength of evidence depends on the method of data collection

Introduction to Biostatistics

68

4.5.2

Prospective versus retrospective study

Prospective: A group of people is folowed for the occurrence or non-occurrence


of specified endpoints or events or measurements
Examples:

. Treated patients are examined 1 month after the start of the treatment

. Cancer patients are followed after chemotherapy and the outcome of interest is
the time untill disease progression.
. Datasets weightgain, cancer, captopril
Retrospective: Subjects having a particular outcome or endpoint are identified
and studied. Often, measurements from the past are of interest

Introduction to Biostatistics

69

Examples:

. A sample of subjects is questioned about the food intake during


the last two days

. Cancer patients are questioned about potential exposure to polution.


. Datasets on sickness absence, or cervical cancer

Introduction to Biostatistics

70

4.5.3

Experimental versus observational data

Experimental: The data are collected in a newly designed and conducted


experiment
Examples:

. Rats are treated and measured afterwards, at pre-specified time points

. Cancer patients are followed after chemotherapy and the outcome of interest is
the time untill disease progression.
. Datasets Captoptil, weightgain, cancer
Observational: The data are collected on a routinely basis, and no new
experiment is set up

Introduction to Biostatistics

71

Examples:

. Hospital records on all patients treated


. Data collected during yearly medical check-ups
. Dataset on sickness absence

It is often not clear from which population observational data can be believed to
be sampled from.
For example, consider observational data collected on a routinely basis on all
patients treated in a university hospital
For example, consider the data collected by an occupational health service, on a
routinely basis, during a specific year

Introduction to Biostatistics

72

Contrasting the percentages of observed subjects in the various occupational


classes, with those in the total Flemish population, shows that the data cannot be
believed to be a random sample from the Flemisch population:
Data on females
Occupational sector

Collected data (%)

Flemish population (%)

1. Agriculture / fisheries

1.1

2.7

2. Energy / water

0.2

0.5

3. Minerals / chemistry

0.7

2.2

4. Metal / mechanical and optical industry

2.4

4.8

5. Other industry

10.7

10.2

6. Construction

0.6

1.0

7. Commerce / hotel / restaurant / bar

16.4

21.5

8. Transportation / communication

1.0

3.6

9. Bank / insurance / services w.r.t. companies

6.0

8.3

10. Other services

60.9

45.2

Missing

6.1

Introduction to Biostatistics

73

4.5.4

Cross-sectional versus longitudinal study

Cross-sectional: Study participants are measured once, at a fixed (pre-specified)


moment in time
Examples:

. Study where blood pressure is related to subject characteristics such as


bodyweight, food intake, living habbits,. . .

. Study where height of children at the age of 12 years old are related to the
height of the parents
. Datasets on sickness absence, or cervical cancer
Longitudinal: The outcome of interest is measured repeatedly over time
Example: Captopril and Severe cold data
Introduction to Biostatistics

74

4.5.5

Clinical trial

A rigourously designed experiment aiming at finding the best treatment for future
patients in a specific condition
All aspects are pre-specified in the study protocol
Typically, a group of patients is randomly allocated to one of a number of treatments, after
which the outcome(s) of interest are measured
These are the only studies that are accepted
by regulatory agencies that approve marketing
of treatments
The random allocation allows causal interpretation of observed treatment effects (see later)
Introduction to Biostatistics

75

Clinical trials are always prospective


Clinical trials are always experimental
Clinical trials can be of a longitudinal or a cross-sectional nature.

Introduction to Biostatistics

76

4.5.6

Cohort study

A well-defined group of subjects is followed over time, usually untill a specific


event happens.
Examples:

. Students who graduated in 2005


. All patients who received surgery with a specific technique, during a specific
period of time.

. Dataset cancer, dataset on severe colds


A birth cohort is a cohort of people all born in the same period, and is often used
to exclude effects from the fact that subjects lived in different periods.

Introduction to Biostatistics

77

For example, 40-year olds can be different


from 20-year olds, just by the fact that the
first 20 years of life were lived under completely different circumstances
Over 20 years, the 20-year olds from today will
not necessarily be equal to the 40-year olds
from today.

Introduction to Biostatistics

78

4.5.7

Case-control study

Suppose we want to investigate the relation between smoking and lung cancer
One may select a group of smokers and a group of non-smokers, and follow them
for a (long) period of time
The outcome of interest is the incidence of lung cancer.
A potential dataset is:

Lung cancer
Smoking status

Introduction to Biostatistics

Yes

No

Yes

42

203

245

No

114

121

49

317

366

79

Such a study would take an very long time to conduct, as one has too wait untill a
sufficiently high number of cancer cases has been observed.
One therefore often conducts a case-control study, in which a number of cases
(cancers) and controls (non-cancers) are selected, which are questioned about
their smoking behaviour in the past.
This potentially may lead to the same data.
Note that, since the number of sampled cases and controls is pre-defined, such a
study design does not allow the estimation of the prevalence of lung cancer.
However, the case-control study does allow to study the relation between risk
factors and the prevalence of some disease (see later).

Introduction to Biostatistics

80

4.5.8

Matched case-control study

Suppose a case-control study is conducted to study the relation between smoking


behaviour and the incidence of lung cancer.
Suppose also that a strong relation is observed.
However, how should this association be interpreted if the smokers are much older
than the non-smokers, and/or if the group of smokers contains many more males
than the group of non-smokers ?
Maybe, the observed relation is indirectly induced by the differences in age and
gender.
Matched case-control studies allow to guarantee that cases and controls are
exactly the same with respect to some important subject-specific characteristics.

Introduction to Biostatistics

81

For example, matching for age and gender can be done as follows:
. Sample the required number of cases

. For each case, select a control with the same age and gender as the case
In some situations, one may want to match multiple controls to each case, or
multiple cases to each control.
Ideally, one would like to match for as many factors as possible.
However, matching on too many factors complicates the search for appropriate
controls.

Introduction to Biostatistics

82

4.6

Random samples variability uncertainty

A sample needs to be taken randomly such that it well represents the total
population. Only then, valid conclusions can be drawn
Note however, that different random samples will include different subjects, with
different observations
Hence, each new random sample or, equivalently, each new experiment will lead to
(slightly) different conclusions, implying that, sometimes, wrong conclusions will
be drawn
Note that absolute certainty cannot be expected as conclusions are based on only
a small part (the sample) from the total, infinitely large, population.

Introduction to Biostatistics

83

Conclusion:

Statistics can prove everything

Statistics helps to:

. quantify the errors


. control the errors

Introduction to Biostatistics

84

Chapter 5
Causality and randomization

. Causal effects
. Methods of randomization
. Randomization not always possible

Introduction to Biostatistics

85

5.1

Causal effects

Suppose an experiment is set up to compare homeopathy (H) with placebo (P).


Two groups of patients are selected. One receives H, the other receives P.
(Double) blinding is necessary:

. Believers may overestimate the effect of H


. Non-believers may underestimate the effect of H

An observed difference between H and P does not necessarily imply H is (more)


effective, not even under double blinding

Introduction to Biostatistics

86

What if:

. H-group contains more females ?


. H-group is younger ?
. H-group contains better patients ?

Any difference between both groups, other than the treatment, may explain the
observed difference in efficacy.
In such cases a difference between H and P should not automatically be ascribed
to the treatment.
In general, an observed effect is not necessarily a causal effect in the sense that
the difference in treatment can be interpreted as the cause of the observed
difference in response.

Introduction to Biostatistics

87

The only way to assure treatment groups are comparable with respect to all
known and unknown factors is to assign treatments to subjects in a completely
random way.
This is randomization
Groups then only differ with respect to
the treatments they received
Small imbalances can occur by pure
chance, in small studies

Introduction to Biostatistics

88

Randomization is required whenever causal relations are to be shown:

Cause = Effect

Introduction to Biostatistics

89

5.2

5.2.1

Methods of randomization

Simple randomization

Throwing coins or dice,


wheels,. . .

spinning

Random number generators


Pre-generated lists should not be made
available in advance, in order to guarantee blinding

Introduction to Biostatistics

90

5.2.2

Block randomization

Simple randomization usually does not lead to equal group sizes


Equal numbers of subjects in all groups can be obtained by simple randomization
as long as the required numbers have not been reached.
Once a group contains sufficient subjects, randomization is done over the
remaining groups only.
There then is a tendency for the last few subjects all to be in the same group,
implying that the assignment for the last subjects is not completely unpredictable
With block randomization, subjects are put in small equal-sized groups and,
within each block, equal numbers are allocated to the groups.

Introduction to Biostatistics

91

Block randomization also implies approximately equal numbers at each moment


during the study.
This is very useful in situations where a training effect of the physician is to be
expected.

Introduction to Biostatistics

92

5.2.3

Stratified randomization

With relatively small studies, (serious) imbalance can be obtained by pure chance.
Stratified randomization can be used to ensure complete balance, at least with
respect to some measured important prognostic factors.
For example, suppose gender and age are believed to be strongly related to the
outcome of interest.
Ideally, the two treatment groups would have exactly the same age and gender
distribution.
This can be realized by using separate (block) randomization for each combination
of age with gender.
Hence, separate randomization lists are to be constructed for each combination of
age with gender.
Introduction to Biostatistics

93

In practice, one often would like to stratify for as many factors as possible.
However, stratification on too many factors may lead to many incomplete blocks
implying that the balance hoped for cannot be realized
Some extreme versions of stratified randomization are:

. Twin studies: both subjects are assigned randomly to the two treatments

. Cross-over studies: subjects receive all treatments but in a random order


. Pre-test post-test studies: all subjects are measured before as well as after
the treatment
(e.g., Captopril data, severe cold data)
. Both hands, feet, treated with different treatments, assigned at random.

Introduction to Biostatistics

94

Introduction to Biostatistics

95

5.3

Randomization is not always possible

5.3.1

Example 1

Suppose one wants to study the effect of chemotherapy in women, on the unborn
baby and its evolution after birth.
Ideally, one would randomize pregnant women into two groups
. Group 1: receives chemotherapy

. Group 2: no chemotherarpy (placebo)


Obviously, this is ethically not possible.

Introduction to Biostatistics

96

In practice, for each pregnant woman getting chemotherapy, another pregnant


woman is searched for, who does not get chemotherapy, but who is comparable for
the most important known prognostic factors.
Often, the controls are taken from an earlier collected data set, and are therefore
called historical controls.
After birth, the children are followed and the outcomes of interest are measured
(e.g., IQ level at the age of 5), and the association with chemotherapy can be
studied.
Note that associations detected, should not be interpreted as causal

Introduction to Biostatistics

97

5.3.2

Example 2

Suppose one wants to study the relation between smoking and lung cancer.
Ideally, one would randomly subdivide subjects into two groups:

. Group 1: subjects have to smoke many cigarettes, daily, during many years.

. Group 2: subjects are not allowed to smoke


Obviously, this is ethically not possible.
As in the previous example, one could select a group of smokers, and a
comparable group of non-smokers
However, in order to be able to measure occurrence of lungcancer in all these
subjects, one would have to wait many years.

Introduction to Biostatistics

98

In such studies, one will select a group of cancer cases, and a comparable group of
non-cancer cases.
All subjects are questioned about their smoking behaviour in the past.
This will still allow to study the association between smoking and the occurrence
of lung cancer
This is an example of a case-control study
Note that associations detected, should not be interpreted as causal

Introduction to Biostatistics

99

5.3.3

Implications

Imbalance with respect to some important prognostic factors cannot be ruled out
Imbalances with respect to measured known factors can be corrected for by
appropriate statistical techniques
However, as one cannot correct for the imbalance with respect to unknown or
unmeasured factors, causality can still not be concluded from such analyses.
For example, one will never be able to show any causality in the relation between
smoking and lung cancer.

Introduction to Biostatistics

100

Part III
Describing and summarizing data

Introduction to Biostatistics

101

Chapter 6
Types of outcomes

. Qualitative data
. Quantitative data

Introduction to Biostatistics

102

6.1

Qualitative data

Qualitative variables are not characterized by a numerical value


Further subdivision:

. Dichotomous: only 2 possible values: gender, survival


. Nominal: no ordering: color of hair, cause of death
. Ordinal: ordered oucome values: pain score (never always)

Introduction to Biostatistics

103

6.2

Quantitative data

Quantitative: variables have values that are intrinsically numerical


Further subdivision:

. Discrete: the possible values are distinct and separated


. Examples: Number of particles emitted by a radio-active source, heart rate

. Continuous: values are within a continuous, uninterrupted range


. Examples: height, age, blood pressure
Note that continuous variables are always measured in a discrete way
Discrete variables with many possible values are often treated as continuous

Introduction to Biostatistics

104

Chapter 7
Graphical presentation of data

. One variable
. Multiple variables

Introduction to Biostatistics

105

7.1

Graphs of single qualitative variables

Bar plot or pie chart:

Introduction to Biostatistics

106

7.2

Graphs of single quantitative variables

Histogram:

Note that the choice of the intervals is crucial


Introduction to Biostatistics

107

Box (-Whiskers) plot:

Introduction to Biostatistics

108

7.3

Graphs of multiple qualitative variables

Categorized bar plot (similar for pie chart):

Introduction to Biostatistics

109

7.4

Graphs of multiple quantitative variables

Scatterplot:

Introduction to Biostatistics

110

Scatterplot with box plots (also with histograms)

Introduction to Biostatistics

111

Scatterplot matrix (also with box plots):

Introduction to Biostatistics

112

7.5

Graphs of mixed quantitative / qualitative variables

Categorized box plots:

Introduction to Biostatistics

113

Categorized histograms:

Introduction to Biostatistics

114

Bubble plot:

Introduction to Biostatistics

115

Chapter 8
Summary statistics

. Introduction
. Measures of location
. Measures of spread
. Percentages
. Geometric mean and standard deviation
. Missing data
. Graphical representation
. Examples from the biomedical literature

Introduction to Biostatistics

116

8.1

Introduction

A B

A and B have the same location but different spread


A and C have the same spread but different location
Introduction to Biostatistics

117

8.2

Measures of location

Location measures:
Where are the observations more or less located ?
As an example, consider the small sample:

1, 3, 3, 4, 5, 14

Sample average (sample mean):


1 + 3 + 3 + 4 + 5 + 14
x1 + x2 + x3 + x4 + x5 + x6
=
6
6
n
x1 + . . . + xn
1 X
=
=
xi = 5
n
n i=1

x =

Introduction to Biostatistics

118

Introduction to Biostatistics

119

The sample median is the middle


observation:
1

3+4
2

{z

4}

14

= 3.5

The sample mode is the value that


was observed the most often:
1,

3,

Introduction to Biostatistics

3,

4,

5,

14

120

Note that the sample average is very sensitive to outliers:


1, 3, 3, 4, 5, 14 5
1, 3, 3, 4, 5, 20 6
1, 3, 3, 4, 5, 26 7
This is not the case with the sample median:
1, 3, 3, 4, 5, 14 3.5
1, 3, 3, 4, 5, 20 3.5
1, 3, 3, 4, 5, 26 3.5
The mode is not always informative:

Mode
Introduction to Biostatistics

121

For symmetric data, the average and the median are the same. In general, they
are not:

Symmetric

Median = Mean

Introduction to Biostatistics

Skewed

an ean
i
ed M
M

122

With skewed data, the mean can be heavily influenced by the random presence of
a/some extreme observation(s).
In order to still get a good idea about the location of the data, one then prefers
the use of the median over the mean:

Symmetric data = Mean


Skewed data = Median

Introduction to Biostatistics

123

8.3

Measures of spread

Obviously, a measure of location only summarizes one specific aspect of the


observed data:
Statitician drowning in a lake of average depth 0.5m

Introduction to Biostatistics

124

Measures of spread:
How similar are the observations ?

xn

....
x8
x7
x6
x5
x4
x3
x2
x1

Introduction to Biostatistics

xn

.. ..
x7
x4

or

x2

x8
x6

x5

x3
x1

125

As an example, re-consider the small sample:

1, 3, 3, 4, 5, 14

Mean deviation from the mean :


1
n

n
X

(xi x) =

i=1

4 2 2 1 + 0 + 9
0
=
= 0
6
6

Mean quadratic deviation from the mean:


1
n

2
2
2
2
2
2
(4)
+
(2)
+
(2)
+
(1)
+
0
+
9
(xi x)2 =
i=1
6
n
X

Introduction to Biostatistics

106
= 17.67
6

126

Sample variance:
s2 =

1
n1

n
X

(xi x)2

i=1

(4)2 + (2)2 + (2)2 + (1)2 + 02 + 92


106
=
= 21.2
=
5
5
Note that the units of the sample variance and the mean quadratic deviation are
the squared units of the original observations
The sample standard deviation is in the same units as the original
observations:

s =

Introduction to Biostatistics

v
u
u
u
u
t

1
n1

n
X

(xi

i=1

x)2

= 21.2 = 4.60

127

Sample range:
R = max xi min xi = 14 1 = 13
i

Note that the range strongly depends on the sample size n: Larger samples are
more likely to contain extreme observations, hence are more likely to have a larger
range
Since we hope that our measure of spread reflects the amount of variation in the
population, we prefer a measure that does not depend on the sample size.
The sample interquartile range is the range obtained after deletion of the 25%
highest and 25% lowest values in the sample:
1, 3, 3, 4, 5, 14

Introduction to Biostatistics

3,3,4,5

IQR = 5 3 = 2

128

The interquartile range does not depend on the sample size n, since a larger
number of observations is deleted in larger samples.
The variance (hence also mean quadratic deviation and standard deviation), and
the range are very sensitive to outliers:
1, 3, 3, 4, 5, 14 s2 = 21.2,
1, 3, 3, 4, 5, 20 s2 = 48.8,
1, 3, 3, 4, 5, 26 s2 = 88.4,

R = 13
R = 19
R = 28

This is not the ase with the interquartile range:


1, 3, 3, 4, 5, 14 IQR = 2
1, 3, 3, 4, 5, 20 IQR = 2
1, 3, 3, 4, 5, 26 IQR = 2
Introduction to Biostatistics

129

With skewed data, the standard deviation can be heavily influenced by the random
presence of a/some extreme observation(s).
In order to still get a good idea about the variation in the data, one then prefers
the use of the interquartile range over the standard deviation:
Symmetric data = Standard deviation
Skewed data = IQR

Introduction to Biostatistics

130

Introduction to Biostatistics

131

8.4

Percentages

Traditionally, measurements are summarized by a measure of location and a


measure of spread
However, suppose the variable of interest is sickness absence
For each subject i in the sample, we define xi as:

xi =

1 if subject i was absent due to illness


0 otherwise

The sample average equals


x =

x1 + x2 + . . . + xn
Number of people with sickness absence
=
n
n

Introduction to Biostatistics

132

Hence, the average equals the observed proportion (percentage) of people with
sickness absence
Note that, once the average is known, the original observations are known, hence
also the variability:
0

1
x6
x5
x4

x3
x2
x1
x = 0.5
Introduction to Biostatistics

x6
x5
x4
x3
x2
x1
x = 0.16

1
x6
x5
x4
x3
x2

x1
x = 0.84
133

One can show that the variance is obtained as


s2 =

n
x (1 x)
n1

Since the variance directly follows from average, only the mean is reported, no
measure of spread
In general, measures of location and spread are only used for quantitative
(continuous) variables.
Other variables are described by observed frequencies and percentages.

Introduction to Biostatistics

134

For example, the variables sickness absence and cancer type could be
summarized as follows:
Variable
Sickness:

Cancer type :

Introduction to Biostatistics

(n = 256)
Yes

103 (40.23%)

No

153 (59.77%)

Breast

79 (30.86%)

Stomach

26 (10.16%)

Bronchus

83 (32.42%)

Colon

58 (22.66%)

Ovary

10 (3.90%)

135

8.5

Geometric mean and standard deviation

The mean and standard deviation are used to describe symmetric data
In case of skewness, alternatives such as median and IQR are used
An alternative is to transform the original data such that the transformed
observations are symmetric
A special, frequently occurring, case is when symmetry is obtained using a
logarithmic transformation
As an example, we consider the survival times of cancer patients, and we restrict
to the patients with stomach cancer

Introduction to Biostatistics

136

Summary statistics:

However, the histogram of the observations suggests skewness:

Introduction to Biostatistics

137

Often, skewness in the direction of the large values can be solved with a
logarithmic transformation:
X = survival time Y = ln(X) = ln(survival time)

Introduction to Biostatistics

138

Stomach
X Y = ln(X)
124

4.82

42
25

3.74
3.22

45

3.81

412
51

6.02
3.93

1112

7.01

46
103

3.83
4.63

876

6.78

146
340

4.98
5.83

396

5.98

Introduction to Biostatistics

139

Assessing symmetry is difficult due to the small number of observations. However,


the evidence against symmetry is much weaker now, and use of mean and
standard deviation seems justified for the description of the Y -values.
Often this mean and standard deviation is back-transformed to the original
units, leading to the geometric mean and standard deviation:

Outcome
Survival time (days)

Stomach cancer
mean (stand.dev.)
144.03
(3.49)
= exp(4.97)

= exp(1.25)

geometric means and standard deviations

which is very different from the arithmetic mean and standard deviation
that were reported before:

Introduction to Biostatistics

140

8.6

Missing data

Sometimes, not all observations are available


For example, for some of the subjects in the sample, absence of sickness was not
measured.
Summarizing this variable can be done in two ways:
Variable
Sickness:

(n = 203)
Yes

103 (50.74%)

No

100 (49.26%)

Variable
Sickness:

(n = 256)
Yes

103 (40.23%)

No

100 (39.07%)

Missing

53 (20.70%)

Not accounting for missing observations results in misleading summary statistics


Introduction to Biostatistics

141

8.7

Graphical representation of summary statistics

Boxplots (with 1%, 25%, median, 75%, and 99% percentiles)


Means and standard deviations

Introduction to Biostatistics

142

8.8

Examples from the biomedical literature

Boushey et al. [8], Figure 2:

. Bar plot
. Categorized for 3 groups

Introduction to Biostatistics

143

Marlow et al. [9], Figure 1:

. Scatterplot
. Jittered to avoid overlapping symbols
. Categorized for 42 groups
. Lines are probably means

Introduction to Biostatistics

144

Wong et al. [10], Table 1 (first part):

. Means and standard deviations


. Medians and IQRs
. Percentages

Introduction to Biostatistics

145

Blanchon et al. [11], Table 1 (parts):

. Categorization of continuous
variables
. Explicit acknowledgement of
missing values

Introduction to Biostatistics

146

Kellett, Kellett, and Nordholm [12], Table 2:

. Means and standard deviations


. Variables are NOT symmetrically distributed

Introduction to Biostatistics

147

Wu [13], Figure 2:

. Averages for 33 combination of


box size and frequency of lifting
. No indication of variability

Introduction to Biostatistics

148

Two completely different hypothetical scenarios for variability:

Introduction to Biostatistics

149

Nawrot et al. [14], Table 1:

. Geometric means
. IQR instead of geometric
standard deviations

Introduction to Biostatistics

150

Part IV
Basic concepts of statistical inference

Introduction to Biostatistics

151

Chapter 9
Describing the population

. Stochastic variable
. Discrete probability distribution
. Continuous probability distribution
. Summary characteristics for probability distributions
. The normal distribution

Introduction to Biostatistics

152

9.1

Stochastic variable

Suppose a random sample of size n = 321 from a specific population is available,


and interest is in the outcome BMI
The outcome variable BMI is often denoted as X, while the n = 321 observations
are usually denoted as
x1 , x2 , . . . , x321
The variable X is a random or stochastic variable since the value that it takes
is subject to chance
Indeed, if one randomly selects one subject from the population, the BMI of that
subject cannot be predicted, and entirely depends on which subject has been
selected

Introduction to Biostatistics

153

At most, one can say that, e.g., it is more likely that this subject will have a BMI
between 20 and 25 than a BMI larger than 35
So, the realized value of X depends on random variability
Our sample x1, x2, . . . , x321 can be considered as n = 321 realizations of the same
random variable X, for subject i, i = 1, 2, . . . , 321.
Drawing the sample can be viewed as performing n = 321 small experiments, each
time selecting one subject and measuring this subjects BMI, leading to the
realized value xi of X
How likely it is to observe certain values or certain ranges of values is described by
the probability distribution
Similar to the classification of observations, one can classify random variables as
qualitative, quantitative, discrete, continuous,. . .
Introduction to Biostatistics

154

Other examples:
Experiment
Selecting one Belgian

Throwing a die

Selecting n = 321 people

Introduction to Biostatistics

Random variable

Type of variable

. Weight

. Qualitative, continuous

. Height

. Qualitative, continuous

. Gender

. Quantitative, dichotomous

. Number of throws until first 6

. Quantitative, discrete

. Number of times a 6 was thrown,


out of 10 trials

. Quantitative, discrete

. Percentage of women

. Quantitative, disc./cont. ?

. Average age

. Quantitative, continuous

. Number of cancer cases

. Quantitative, disc./cont. ?

155

9.2

Discrete probability distribution

A discrete probability distribution describes how likely it is to observe specific


values for a discrete random variable.
Suppose X is the random variable sickness absence:

X =

1 if absence due to illness


0 otherwise

X can only take the values 0 and 1


The probability distribution of X describes the probability of observing a 0 or a 1,
respectively

Introduction to Biostatistics

156

These probabilities are the percentages of 0s and 1s one would observe if the
experiment would be repeated over and over again.
Hence, we need to describe the observations one would observe in an experiment
of size n = +
We will do this in exactly the same way as how discrete observations were
described before, i.e., using the Bar plot:
1.0

0 + 1 = 1

0.5

1
0.0
0 (No)

1 (Yes)

Sickness absence (X)


Introduction to Biostatistics

157

0 is the probability of observing a 0, P (X = 0)


0 is the proportion of 0s one would observe in a sample of size n = +
1 is the probability of observing a 1, P (X = 1)
1 is the proportion of 1s one would observe in a sample of size n = +
A discrete distribution for a discrete random variable X describes what values X
can have, and what the associated probabilities are to observe those values.

Introduction to Biostatistics

158

For example, if the experiment is to throw a die, and X is the result of one throw,
then the probability distribution of X is given by:

Graphically:

xi :

1 2 3 4 5 6

i = P (X = xi):

1
6

1
6

1
6

1
6

1
6

1
6

1.0
0.8

1 = 2 = 3 = 4 = 5 = 6 =

1
6

0.6
0.4
0.2
0.0

Result of throwing a die (X)


Introduction to Biostatistics

159

For example, suppose there is equal probability for a newborn to be male or


female. Let X be the number of boys in a family of 5 children, then the
probability distribution of X is given by:
xi :

i = P (X = xi):
Graphically:

0.0312 0.1563 0.3125 0.3125 0.1563 0.0312

1.0
0.8
0.6

0.4
0.2
0.0

1
1

6
2

Number of boys (X)


Introduction to Biostatistics

160

Many frequently used distributions are given a name:


. Sickness absence example: Bernoulli distribution
. Die example: Multinomial distribution
. Gender example: Binomial distribution
Other frequently used discrete distributions are the poisson, the geometric, the
hypergeometric, the beta-binomial, the negative binomial, . . . , distribution.

Introduction to Biostatistics

161

9.3

Continuous probability distribution

A continuous probability distribution describes how likely it is that a continuous


random variable takes values within certain ranges
Suppose X is the random variable BMI
How can we describe what ranges of values of X are likely to observe, and which
ranges of values are less likely to observe ?
For discrete variables this was done by generalizing the bar plot to an infinitely
large sample (= the population)
The same idea is now used for continuous variables: How can the histogram be
generalized to an infinitely large sample (= the population) ?

Introduction to Biostatistics

162

To study this, we draw samples from our population, and study the behaviour of
the histogram of BMI, when the sample size increases
Six samples will be drawn, with sample sizes:
. n = 10
. n = 20
. n = 50
. n = 100
. n = 500
. n = 5000
For each sample, the histogram of the observed BMI values is constructed
We will use histograms with intervalwidth equal to 1
Introduction to Biostatistics

163

For samples of size n = 10 and n = 20:

Introduction to Biostatistics

164

For samples of size n = 50 and n = 100:

Introduction to Biostatistics

165

For samples of size n = 500 and n = 5000:

Obviously, the obtained histogram becomes smoother as the sample size increases

Introduction to Biostatistics

166

Eventually, the histogram becomes a smooth function, f (x), called the density
function of the random variable X:

Introduction to Biostatistics

167

Since we started from histograms with intervalwidth equal to 1, all histograms had
the property that the total surface of a bar represented the proportion of
observations in the corresponding interval
This property is now carried over to the density function:

f (x)

The probability of observing a value for X between


a and b equals the surface below f , between a and b
.....................
......
......
...
....
....
...
....
..
.
.
....
..
....
.
....
....
...
.
...
.
....
....
...
....
...
....
...
....
...
....
...
....
....
...
....
...
....
...
....
....
...
....
...
....
...
....
...
....
....
...
....
...
....
....
...
....
...
....
..
....
....
....
.....
...
....
....
.................
...
....
..
.....
..... ..
.....
..... .
...
..... .........
...
.. ..
..
................
...
.......
.....
.......
.....
.
.......
.....
.
.
...
.
.
........
...
.
........
.
.
..
.
.........
......
....
.........
..........
..
..........
...
...........
.............
...
.............
..
................
..................
...
.....................
...
..........................
..................................
...
...................................................
..
................................................................................................
..

......
..............
......................
..............................................
......................................
...............................................
..........................................................................................
......................................................................
..............................................................................
.........................................................................................................................................
.........................................................................................................
....................................................................................................................
.....................................................................................................................................................................................................
.....................................................................................................................................................................................................................................
..............................................................................................................................................................................
...........................................................................................................................................................................................
..........................................................................................................................................................................................................................................................................................
............................................................................................................................................................................................
............................................................................................................................................................................................
..........................................................................................................................................................................................................................................................................................
..............................................................................................

P (a X b)

a
Introduction to Biostatistics

b
168

This also implies that the total surface below f is equal to 1


Note that f completely defines what values are possible for X and how likely
these values are to occur.
Hence, f has the same properties as the discrete distributions discussed before
Many continuous distributions exist, all defined by a specific density function
Moreover, every positive function f with total surface below the function equal to
1 defines a probability distribution
The calculation of probabilities requires computation of surfaces below f , hence of
integrals.
This can be done using tables, or computer packages.
Introduction to Biostatistics

169

For example, the package Statable can be freely downloaded from:


http://www.cytel.com/Products/StaTable/
Some frequentely used continuous distributions are the normal, the t, the
chi-squared (2 ), the F , the gamma, the beta, . . . , distribution.
Some borrow their name to specific statistical techniques, such as t-tests, F -tests,
chi-squared tests, . . .

Introduction to Biostatistics

170

9.4

Summary characteristics for probability distributions

The probability distribution can be viewed as an extension of the bar plot and the
histogram to the total population, or equivalently, an infinite sample
It describes how likely specific values are to be observed when randomly drawing
from the population
Similarly, we can now define measures of location and spread for the total
population.
These are the measures of location and spread one would observe if the total
population would be measured, i.e., in an infinite sample

Introduction to Biostatistics

171

These measures are usually denoted with Greek letters, e.g.,


. population average:

. population variance: 2
Note that, similarly to the probability distribution, the population versions of the
measures of location and spread, are theoretical concepts, as one will never
observe them, or measure them.
Indeed, in practice, one only observes a finite sample, from which it is possible to
calculate the sample-based versions, such as the sample average x and s2:
Population
Sample
(never observable) (observable)

Introduction to Biostatistics

Location:

Spread:

s2
172

9.5

The normal distribution

The most frequently used distribution in statistics is the Normal or Gaussian


distribution
It has density function f (x) equal to

1
1
2
f (x) =
exp

(x

2 2
2 2

It depends on two parameters IR and 2 > 0, which are the (population)


mean and variance.
If a random variable X is normally distributed with mean and variance 2, this
is denoted as X N (, 2)
Introduction to Biostatistics

173

Always symmetric around the mean


For = 0 and 2 = 1, we have the
standard normal distribution:
X N (0, 1)
Any normal can be transformed in a
standard normal, and vice versa:
X N(, 2) =

X
N(0, 1)

X N(0, 1) = + X N(, 2 )

Introduction to Biostatistics

174

P ( X + )

= P 1
1 = 68.27%

P ( 1.96 X + 1.96)

= P 1.96
1.96 = 95%

P ( 2 X + 2)

2 = 95.45%
= P 2

P ( 3 X + 3)

= P 3
3 = 99.73%

Introduction to Biostatistics

175

The normal distribution is very popular


because:
. Many stochastic processes follow normal distributions, or can be well approximated by normals
. Statistical theory shows that the normal is often a reasonable approximation
. The parameters in the normal distribution have a natural interpretation,
since they represent the mean and the
variance in the population.

Introduction to Biostatistics

176

Chapter 10
From the population to the sample, and back to the
population

. From the population to the sample


. From the sample to the population
. Example
. Normal values

Introduction to Biostatistics

177

10.1

From the population to the sample

The probability distribution describes how likely specific values are to be observed
when randomly drawing from the population
Also, the probability distribution summarizes how the data in an infinitely large
sample would be distributed.
Hence, when a sufficiently large random sample is drawn from that population,
one expects the observed histogram to be close to the probability distribution.
This is probability theory

Introduction to Biostatistics

178

P
O
P
U
L
A
T
I
O
N

RANDOM
S
A
M
P
L
E

Introduction to Biostatistics

RANDOM

NOT RANDOM

179

10.2

From the sample to the population

In statistics, the observations in the sample are used to learn about the
population.
Obviously, in order for the sample to learn us something about the population, the
sample needs to be drawn randomly
This procedure, in which information from the sample is used to draw conclusions
about the population is called statistical inference or estimation

Introduction to Biostatistics

180

P
O
P
U
L
A
T
I
O
N

STATISTICAL INFERENCE AND ESTIMATION


S
A
M
P
L
E
Introduction to Biostatistics

181

Introduction to Biostatistics

182

P
O
P
U
L
A
T
I
O
N

RANDOM
S
A
M
P
L
E

Introduction to Biostatistics

Distribution of X

INFERENCE AND ESTIMATION


Histogram of xi

?
183

10.3

Example: BMI

Suppose a random sample of n = 2605 Belgian males is available, and the


outcome X of interest is the body mass index (BMI).
Suppose interest is in estimating the percentage of people with overweight
(BMI> 25) and with obesity (BMI> 30) respectively.
Summary statistics are:

One way to proceed is to estimate the distribution of X, from which probabilities


can be calculated
Introduction to Biostatistics

184

Histogram of observed values:

Obviously, BMI is not symmetrically distributed.


Due to the flexibility of the normal distribution, one often approximates
non-normal distributions with a normal one, after appropriate transformation.
Introduction to Biostatistics

185

Which transformation(s) is/are appropriate depends on the shape of the original


histogram
In case of skewness with a tail towards large values, a transformation with
negative curvature is needed,
such that large values are transformed towards the
smaller ones, e.g., ln(x), x, . . .
In case of skewness with a tail towards small values, a transformation with positive
curvature is needed, such that large values are transformed away from the smaller
ones, e.g., exp(x), x2, . . .
Sometimes, x needs to be rescaled or shifted before applying any of the above
transformations, e.g., ln(1 + x) in case x = 0 is possible
One always should check whether the transformed values are approximately
normally distributed

Introduction to Biostatistics

186

Histogram

Introduction to Biostatistics

Possible transformations

187

For our BMI example, a logarithmic transformation proofs helpful:

We can now approximate the distribution of Y = ln(X) by a normal one.


Which normal distribution N (, 2) ?

Introduction to Biostatistics

188

We need to decide what values for and 2 will be used.


and 2 are the unknown mean and variance of the distribution of Y , i.e., these
are the average and variance, respectively, one would observe in an infinitely large
sample of observations for Y .
We therefore estimate these parameters based on the observed values in the
sample, assuming that the sample was large enough to yield a sufficiently good
approximation for and 2
The summary statistics for the log-BMI observations are:

Introduction to Biostatistics

189

Hence, our estimates for and 2 will be 3.21 and 0.152


If the resulting normal distribution N (3.21, 0.152 ) is close to the true one, we
expect the histogram of our random sample of Y values to be close to the normal
density.
This can be graphically checked by adding the density of this normal distribution
to our histogram of Y values:

Introduction to Biostatistics

190

From now on, Y = ln(BMI) will be assumed normally distributed, with mean 3.21
and variance 0.152 , and probabilities of interest can be calculated
For example, the proportion of males in the population with overweight
(BMI> 25) equals:


P (X > 25) = P (Y > ln(25)) = P N (3.21, 0.15 ) > 3.22 = 0.4734


Similarly, the proportion of obese males in the population (BMI> 30) equals:


P (X > 30) = P (Y > ln(30)) = P N (3.21, 0.15 ) > 3.40 = 0.1026


Note that the last step in the above equations is obtained from using statistics
tables or computer programs.

Introduction to Biostatistics

191

P
O
P
U
L
A
T
I
O
N

RANDOM
S
A
M
P
L
E

Distribution of BMI

P (X > 25) = 47.34%


P (X > 30) = 10.26%

INFERENCE AND ESTIMATION


Histogram of xi

Introduction to Biostatistics

192

10.4

Example: Normal values

Normal, or reference, values are often used in the reporting of clinical test results
95% normal values are the values c1 and c2 such that 95% of the total population
falls in between those values:

95%

c1

c2

With clinical test results, the normal values are with respect to the normal
(healthy) population
Introduction to Biostatistics

193

Example:

Introduction to Biostatistics

194

The probability that a randomly selected, healthy, patient has a value within the
95% normal values is by definition 95%.
When two independent parameters are measured, the probability that a randomly
selected, healthy, patient has both parameters within the respective 95% normal
value ranges equals
P (Both parameters within the 95% normal range) = 0.95 0.95 = 0.9025
Hence, combining two sets of 95% normal values leads to region which contains
only 90.25% of the total population.
In general, one has :
P (k parameters within the 95% normal range) = 0.95k

Introduction to Biostatistics

195

Some values:

k
1
2
5
10
20
50
100

0.95k
0.9500
0.9025
0.7738
0.5987
0.3585
0.0769
0.0059

Hence, for 100 tests, we have almost certainty that at least one parameter will
take a value outside its 95% normal range.
Obviously, one can use higher percentages (e.g., 99% instead of 95%), but the
problem of multiple testing remains.
Note that the above calculations assume the tested parameters to be independent.
Introduction to Biostatistics

196

For example, suppose that a normal value for parameter 1 always leads to a
normal value for parameter 2, and vice versa, we would have that
P (Both parameters within the 95% normal range)
= P (The first parameter within the 95% normal range) = 0.95
Alternatively, suppose that a normal value for parameter 1 always leads to a
value for parameter 2 which is outside its normal range, we would have that
P (Both parameters within the 95% normal range) = 0
Conclusion:
Normal values need to be interpreted with extreme caution

Introduction to Biostatistics

197

Chapter 11
Estimation, sampling variability, bias, and precision

. Estimation
. Example
. Sampling variability
. Bias and precision
. Sampling distribution of the sample average
. Standard error of the mean

Introduction to Biostatistics

198

11.1

Estimation

One does not always have to estimate the complete distribution of a random
variable X
Often, interest is in specific characteristics of the distribution, such as the
population average
One can then try to draw conclusions about , based on the observed data in the
sample, without having to specify the distribution of X
Since the population characteristics (mean, median, variance, . . . ) are summary
statistics in an infinitely large sample, it is natural to estimate them using the
sample versions.

Introduction to Biostatistics

199

Summarized:
Population parameter

Estimate from sample

c = x

c 2 = s2

population median

sample median

population IQR

sample IQR

...

...

Note that it is very unlikely that the estimate is identical to the parameter it is
estimating
How close the estimate will be to the true value depends on various aspects.
Some key results will be explained in the next sections
Introduction to Biostatistics

200

P
O
P
U
L
A
T
I
O
N

RANDOM
S
A
M
P
L
E

Distribution of X

Characteristic of f (x)
...........
...
.
.
................................ 2
.....
..... ..
..........

INFERENCE AND ESTIMATION


Histogram of x

Introduction to Biostatistics

Estimate for
d
=x
...........
.
..
.
.
2
d2
.................................
=
s
.....
..... ..
..........

201

11.2

Example: BMI

Re-consider the random sample of n = 2605 Belgian males, with the outcome X
of interest being the body mass index (BMI):

Introduction to Biostatistics

202

We previously described the distribution of BMI with a normal distribution for the
log-transformed values, and we estimated the percentage of people in the
population with overweight (BMI > 25) to be 47.34%
Note that this percentage equals = P (X > 25) which is a characteristic of the
BMI distribution in the Belgian male population
We can estimate this by the observed proportion c of males in the sample, with
overweight:
number of males with xi > 25
c =
= 46.99%
2605
Note that c is a new estimate for = P (X > 25), which does not require
estimating the whole distribution of BMI in the total population.
If our new estimate would have been very different from the previous one
(47.34%), this would have been some indication that our estimation of the BMI
distribution was not accurate.
Introduction to Biostatistics

203

P
O
P
U
L
A
T
I
O
N

RANDOM
S
A
M
P
L
E

Distribution of BMI

= P (X > 25)

INFERENCE AND ESTIMATION


Histogram of x

Introduction to Biostatistics

Characteristic of f (x)

Estimate for

b =

observed proportion of
males with BMI> 25

204

11.3

Sampling variability

Suppose interest is in the estimation of some characteristic of the distribution of


a specific random variable X
could be the mean , the variance 2, but for example also the percentage of
people with X > 25.
Based on a random sample, an estimate c for can be obtained, for example:
Population
Sample
(never observable) (observable)
Mean:

Variance:

= 2

In general:
Introduction to Biostatistics

c = x

c = s2
c

205

The estimate c is calculated from the observed data, hence the resulting value for
c completely depends on the sample that was drawn from the population.
Repeating the experiment would lead to another sample, other observations, thus
also to another estimate c for .
The estimate c can therefore be interpreted as one realized value of a random
d
variable
d
d
The distribution of
is called the sampling distribution of .
It describes what
values of c are to be expected should the experiment be repeated many times.
d
In general, the sampling distribution of
depends on:

d
. The statistic :
different for mean, median, variance,. . .

. The distribution of the original data (i.e., of X)


. The sample size
Introduction to Biostatistics

206

11.4

Bias and precision

Suppose, as before, that interest is in some characteristic of the distribution of


X, and that an estimate c is available
d
c can then be interpreted as one observation of the random variable
d
The sampling distribution of
is important as it reflects how likely it is that an
estimate c would be obtained which is far away from the true value .

Introduction to Biostatistics

207

Example:

Distribution of

. Asymmetric
. Unlikely to have serious underestimation
. Likelely to have serious overestimation
. On average, our estimate will be correct
Example:
. Symmetric

Distribution of

. Under- and overestimation equally likely


. On average, our estimate will be correct

Introduction to Biostatistics

208

Example:
. Symmetric

Distribution of

. Under- and overestimation equally likely


. On average, our estimate will be correct
. Very precise estimation of

Example:
. Symmetric

Distribution of

. Under- and overestimation equally likely


. On average, our estimate will be correct
. A lot of uncertainty about

Introduction to Biostatistics

209

Example:

Distribution of

. Symmetric
. On average, our estimate is not correct
. We have a biased estimator c for

In general we prefer unbiased estimators, with as much precision as possible.


d
We will now investigate the sampling distribution of the sample average
=X

Introduction to Biostatistics

210

11.5

Sampling distribution of the sample average

Suppose interest is in the estimation of the mean of some random variable X


Based on a random sample, will be estimated by the sample average x, which is
one realized value of the random variable X
Questions:

. When is the sample average biased ?


. When is the sample average precise ?

As discussed before, the sampling distribution of X will depend on the distribution


of the original data X, as well as on the sample size n
We will therefore simulate the sampling distribution of X, under various settings
Introduction to Biostatistics

211

Simulation steps:

. Randomly generate n observations from the distribution of X: x1, . . . , xn


. Based on this data set, calculate the sample average x
. Repeat the above steps many times (e.g., 1000 times)
. Study the histogram of the (1000) realized averages
Sample 1

x(1)1 , x(1)2, x(1)3, . . . , x(1)n

x(1)

Sample 2

x(2)1 , x(2)2, x(2)3, . . . , x(2)n

x(2)

Sample 3

x(3)1 , x(3)2, x(3)3, . . . , x(3)n

x(3)

Sample 4

x(4)1 , x(4)2, x(4)3, . . . , x(4)n

x(4)

Sample 5

x(5)1 , x(5)2, x(5)3, . . . , x(5)n

x(5)

...

. . .

...

Sample x()1, x()2, x()3, . . . , x()n x()


Introduction to Biostatistics

212

The following scenarios will be used for the distribution of X:


. X N (5, 1), true mean = 5

. X 24, true mean = 4


. X Bernoulli

 

1
5

, true mean = 0.2

. X Poisson(1), true mean = 1


The following scenarios will be used for the sample size n:
.n=2
.n=5
. n = 10
. n = 50
Calculations: Vestac Java Applet basics distribution of mean
Introduction to Biostatistics

213

Results for X N (5, 1) ( = 5):

Introduction to Biostatistics

214

Results for X 24 ( = 4):

Introduction to Biostatistics

215

Results for X Bernoulli

Introduction to Biostatistics

 

1
5

( = 0.2):

216

Results for X Poisson(1) ( = 1):

Introduction to Biostatistics

217

General conclusions: For large samples, the sampling distribution of X becomes. . .


. . . . symmetric around the true value for

. . . . more concentrated around the true value for


. . . . normally distributed
One can prove the following theoretical result:
For any random variable X with mean and variance 2,
and for n sufficiently large,

X N ,
n

This is the Central Limit Theorem (CLT), which will be the basis for most
calculations from now on
Introduction to Biostatistics

218

This implies that, for sufficiently large samples, x is an unbiased estimate for ,
which is more precise as the sample gets larger
What is sufficiently large ? The simulation results have shown that this entirely
depends on the distribution of the original data X. Hence, no generally valid
answer can be given.
One can also use similar simulation studies to investigate the sampling distribution
of other statistics such as the median, the variance,. . . . However, no general
results can be derived as in the CLT
For the variance, such simulations (Vestac Java Applet basics distribution
of variance) show that . . .
. . . . the sample variance s2 is unbiased for 2
. . . . the precision of s2 increases with n

Introduction to Biostatistics

219

These results (and the CLT) are the key motivation for conducting large studies,
since collecting additional information (more observations, larger sample) will lead
to increased precision in the estimation:

One can buy extra precision with extra observations

Introduction to Biostatistics

220

11.6

The standard error of the mean

It follows from the CLT that the standard deviation of X is equal to / n.


This standard deviation is also called the standard error of the mean (s.e.m.)
The s.e.m. reflects the precision in the estimation of by x
The s.e.m. is often presented in publications and reports, as an indication of how
precise ones conclusions are.
As an example, consider summarizing/describing the BMI values for a number of
different professions

Introduction to Biostatistics

221

Average standard deviation


. Describing location in samples
. Describing spread in samples
. Meaningful for symmetric distributions
only

Average s.e.m.

. Describing location in samples


. Describing precision of location
estimation

. Always meaningful since X normal for


sufficiently large n

Introduction to Biostatistics

222

Chapter 12
Confidence intervals

. Example
. The confidence interval
. Interpretation
. Properties of confidence intervals
. Example
. Example from the biomedical literature

Introduction to Biostatistics

223

12.1

Example: Captopril data

Consider the Captopril data, where blood pressure was taken in 15 hypertensive
patients, before and after administration of the drug Captopril:

Interest is in estimating the average change in diastolic BP.


Introduction to Biostatistics

224

Let X be the difference in diastolic BP before and after treatment:


X = BPbefore BPafter
The observed values xi for X can be calculated from the observed values of the
BP in our sample:

Introduction to Biostatistics

Patient

Before
DBP

After
DBP

Change
xi

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

130
122
124
104
112
101
121
124
115
102
98
119
106
107
100

125
121
121
106
101
85
98
105
103
98
90
98
110
103
82

5
1
3
2
11
16
23
19
12
4
8
21
4
4
18

225

Note that, in relatively small samples, the histogram can be difficult to interpret.
One therefore prefers not to estimate the complete distribution of X
On the other hand, there does not seem to be strong evidence for severe skewness.
Focuss will be on the estimation of the average of X. As before, our estimate
will be the sample average:
c = x = 9.27

c it is of interest
Since every other sample would have lead to another estimate ,
to know how likely it is that our estimate is far from the true value

c = 9.27 which is very likely to


We want to derive an interval around our estimate
contain the true value

Introduction to Biostatistics

226

P
O
P
U
L
A
T
I
O
N

RANDOM
S
A
M
P
L
E

Distribution of X

Characteristic of f (x)

: Average change
in diastolic BP

INFERENCE AND ESTIMATION


Histogram of x

Estimate
i

Introduction to Biostatistics

for =

= x = 9.27

227

12.2

The confidence interval

The CLT describes what values for x are to be expected if one would repeatedly
draw new samples. If n is sufficiently large, we have that:

X N , n

So, thanks to the CLT, we can calculate how likely it is to have an estimate far
from the correct value, or close to the correct value
Introduction to Biostatistics

228

In our example, n = 15 which is rather small. However, since the distribution of X


is relatively symmetric, n = 15 is probably sufficiently large for the CLT to apply.
Let us calculate the probability that a random sample would yield an estimate x
which is less than 1 unit apart from :

P (1 X < 1) = P

X
s

2
n

2
n

As always, 2 is estimated by s2 = 74.21, and n = 15, so we have:


P (1 X < 1) = P (0.45 N (0, 1) 0.45) = 35%
Hence, a random sample will in 35% of the cases yield an estimate for which is
less than 1 unit apart from .

Introduction to Biostatistics

229

The above calculations can be repeated for other distances between x and :
Distance |x |
1
2
3
4.36
6.25

Probability
35%
63%
82%
95%
99%

For example, 99% of the random samples would yield a sample average that is not
further away from than 6.25 units
So, there is 99% chance that the interval [x 6.25; x + 6.25] contains .
The interval [x 6.25; x + 6.25] is called the 99% confidence interval (C.I.)
for .
Introduction to Biostatistics

230

In our example, this interval equals:


[x 6.25; x + 6.25] = [9.27 6.25; 9.27 + 6.25] = [3.02; 15.52]
The percentage 99% is called the confidence level
Confidence intervals for other confidence levels:
Level
35%
63%
82%
95%
99%

Confidence interval
[x 1; x + 1]
[8.27; 10.27]
[x 2; x + 2]
[7.27; 11.27]
[x 3; x + 3]
[6.27; 12.27]
[x 4.36; x + 4.36] [4.91; 13.63]
[x 6.25; x + 6.25] [3.02; 15.52]

In biomedical sciences, one traditionally uses 95% confidence levels


Introduction to Biostatistics

231

12.3

Interpretation

Let us focuss on the 95% confidence interval. For other confidence levels, the
interpretation is similar.
We derived that 95% of the random samples would yield a sample average x that
is not further away from than 4.36 units
So, one can expect that approximately 95 out of 100 samples would lead to an
interval [x 4.36; x + 4.36] that contains .
For a specific data set, such as the Captopril data, the obtained confidence interval
[4.91; 13.63] may or may not contain . However it is very likely to contain ,
since only 5 out of 100 data sets would lead to an interval not containing .
Illustration: Vestac Java Applet statistical tests confidence interval for mean
Introduction to Biostatistics

232

Introduction to Biostatistics

233

12.4

Properties of confidence intervals

Ideally, C.I.s are small, as this reflects a very precise estimation of the unknown
population parameter
Hence, a C.I. can be used as an indication of the precision of the estimation:
. short C.I.: precise estimation

. long C.I.: imprecise estimation, much uncertainty


The length of the C.I. increases with the confidence level:
Level
95%
99%

Introduction to Biostatistics

Confidence interval
[4.91; 13.63]
[3.02; 15.52]

234

Intuitively: larger intervals are more likely to contain the unknown population
parameter
The length of the C.I. decreases with the sample size n
Illustration: Vestac Java Applet statistical tests confidence interval for mean

Introduction to Biostatistics

235

Intuitively: More observations leads to more precision:


One can buy extra precision with extra observations
The length of the C.I. increases with the variance 2 of the original
data
Intuitively: The more the observations are alike, the more precise the mean can
be estimated:

Precise estimation of

Introduction to Biostatistics

Imprecise estimation of

236

What about 100% C.I.s ?


The 100% C.I. for equals [; +], which is not informative at all
Intuitively: Absolute certainty about
population characteristics cannot be attained based on a finite sample of observations

Introduction to Biostatistics

237

12.5

Example: BMI

The concept of C.I. has been explained in the context of the estimation of a
population average
However, C.I.s can be constructed for any characteristic of the distribution of
the random variable X of interest (variance, mean, median, proportion)
As an example, we re-consider the BMI data on n = 2605 Belgian males, and we
estimated the proportion of males in the population with overweight (BMI> 25)
by the observed proportion c = 46.99%
As an indication of the precision of this estimate, we can calculate, e.g., a 95%
C.I. for : [0.45; 0.49]
The interval [0.45; 0.49] contains the unknown proportion with 95% probability
Introduction to Biostatistics

238

12.6

Example from the biomedical literature

Wong et al. [10], Table 2:

C.I.s for differences between


means and medians

Introduction to Biostatistics

239

Chapter 13
Hypothesis testing

. Example
. Null and alternative hypothesis
. The p-value and level of significance
. Possible errors in decision making
. Hypothesis testing versus confidence intervals
. Example
. Example from the biomedical literature

Introduction to Biostatistics

240

13.1

Example

We continue the example with the Captopril data, where blood pressure was taken
in 15 hypertensive patients, before and after administration of the drug Captopril:

Interest is in deciding whether the treatment did affect the diastolic BP


Introduction to Biostatistics

241

As before, X is the difference in diastolic BP before and after treatment:


X = BPbefore BPafter
The observed values xi for X can be calculated from the observed values of the
BP in our sample:

Introduction to Biostatistics

Patient

Before
DBP

After
DBP

Change
xi

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

130
122
124
104
112
101
121
124
115
102
98
119
106
107
100

125
121
121
106
101
85
98
105
103
98
90
98
110
103
82

5
1
3
2
11
16
23
19
12
4
8
21
4
4
18

242

Focuss will be on finding evidence that the treatment affected the BP.
In case the treatment would have no effect, the average of X would be zero.
So, if one can show that there is (strong) evidence that 6= 0, then this can be
considered as evidence for a treatment effect.
c = x = 9.27
Based on our sample, the estimate for is

Obviously, this estimate is relatively far away from 0, suggesting that the
treatment might have affected BP
c = 9.27 could have occurred by pure
On the other hand, the observed effect
chance, even if there would be no treatment effect at all.

In general, one will decide that there is evidence that 6= 0 if our estimate x of
is far away from 0, i.e., if |x 0| is large.
Introduction to Biostatistics

243

13.2

Null and alternative hypothesis

The procedure to decide whether there is sufficient evidence to believe the


treatment did affect BP is called test of hypothesis
In practice, the research question is formulated in terms of a null hypothesis H0
and an alternative hypothesis HA:
H0 : = 0

versus

HA : 6= 0

Based on our observed data, we will investigate whether H0 can be rejected in


favour of HA
If not, the null hypothesis H0 is accepted and one decides that the treatment
was not effective

Introduction to Biostatistics

244

Introduction to Biostatistics

245

P
O
P
U
L
A
T
I
O
N

RANDOM
S
A
M
P
L
E

Distribution of X

Characteristic of f (x)

H0 : = 0

HA : 6= 0

INFERENCE AND ESTIMATION


Histogram of x

Estimate for

b = x = 9.27

Introduction to Biostatistics

246

13.3

The p-value and level of significance

Intuitively, it is obvious that H0 : = 0 will be rejected if the observed sample


average x is too far away from 0
Question:
How far is too far ?
Answers:
If this result is very unlikely to happen by pure chance
If this result is not at all what you expect to see if would be 0

Introduction to Biostatistics

247

The CLT will help us in deciding, as it describes what values for x are to be
expected if one would repeatedly draw new samples. If n is sufficiently large, we
have that:

X N , n

As always, 2 is estimated by s2 = 74.21, and n = 15.


Moreover, we are interested in knowing what values for x could be expected if
would be 0

Introduction to Biostatistics

248

Hence, the sampling distribution of interest is:

X N 0, 74.21
15

|
0
So, if = 0, we expect that random samples generate averages that behave
according to the above distribution
Hence, if we observe a random sample, with an average that is very extreme
according to this distribution, we should question the validity of the null
hypothesis = 0

Introduction to Biostatistics

249

How far should x be from 0 in order to consider this extreme ?


Should x = 1 be considered extreme ?

X N 0, 74.21
15

| | |
1 0 1
If = 0, the probability of observing a x less than 1 unit away from 0 is:
P (1 X 1)
= P

1 0

74.21

15

Introduction to Biostatistics

X 0
s

74.21
15

1 0

74.21
15

= P (0.45 N (0, 1) 0.45) = 35%


250

Hence, if there is no treatment effect, i.e., if = 0, then there is only 35% chance
of having a random sample with average x within 1 unit away from 0.
So, there would be 65% chance of observing a sample with an average more than
1 unit away from 0.
Observing x = 1 cannot really be considered a lot of evidence against H0 : = 0
Similar calculations can be used for other values, such as 2, 3, . . .

X N 0, 74.21
15

| | | | | | |
3 1 0 1 3
2
2
Introduction to Biostatistics

251

The corresponding probabilities of observing a sample average more than


1, 2, 3, . . . units away from 0 are:
Distance |x 0|
1
2
3

Probability
65%
37%
18%

This probability can also be calculated for the distance |x 0| = 9.27 that was
observed in our experiment:

X N 0, 74.21
15

|
9.27
Introduction to Biostatistics

|
0

|
9.27
252

The corresponding probability equals:


Distance |x 0|
1
2
3
9.27

Probability
65%
37%
18%
0.1%

Hence, if would indeed equal 0, then it would be very unlikely to observe a


sample with average as extreme as 9.27. This would happen only once every 1000
times a sample would be taken.
We therefore consider the data observed in our experiment sufficient evidence to
reject the null hypothesis and we conclude that the treatment effect is
significantly different from 0, or equivalently, that there is a significant
treatment effect
Introduction to Biostatistics

253

The probability 0.1% that expresses how extreme our observations are in case the
null hypothesis would be true, is denoted by p, and is called the p-value.
A small p-value is indication of extreme results were H0 true. One then rejects
the null hypothesis
A large p-value is indication that the observed results are perfectly in line with
what can be expected to observe, if H0 is true. One then does not reject the
null hypothesis, which is equivalent to accepting the null hypothesis
In practice, one has to decide how small p should get before the null hypothesis is
rejected.
One therefore specifies the so-called level of significance :
p < = reject H0
p = accept H0
Introduction to Biostatistics

254

is typically a small value, such as 0.01, 0.05, 0.10


In biomedical sciences = 0.05 =
5% is standard.
One then rejects the null hypothesis as soon as the observed result
would happen in less than 5 times
in 100 experiments, assuming that
the null hypothesis would be correct
Strictly speaking, one should always mention what level of significance has been
used, and the conclusion would have to be formulated as the treatment effect is
significantly different from 0 at the 5% level of significance, or equivalently,
that there is a significant treatment effect at the 5% level of significance.

Introduction to Biostatistics

255

Note that specification of is only


required if a formal decision is preferred (accept or reject).
It is therefore not meaningful to report borderline significance in
examples where p is only slightly
larger than
(e.g., p = 0.06 > = 0.05)

Introduction to Biostatistics

256

13.4

Possible errors in decision making

In our example about the Captopril treatment, we obtained p = 0.001 leading to


the rejection of the null hypothesis of no treatment effect.
This should not be considered as formal proof that there is a treatment effect
Even if the treatment has no effect at all, a sample like ours would occur once
every 1000 times.
Maybe, our sample was indeed the extreme one that happens once every thousand
experiments.
Alternatively, suppose we would have obtained p = 0.9812. We then would not
have rejected the null hypothesis, and concluded that there is no evidence for any
treatment effect.
Introduction to Biostatistics

257

This should not have been considered as formal proof that any treatment effect
would be absent.
Maybe, the treatment effect is not 0, but very close to 0. The data one then
would observe would look very similar to data that would be observed if = 0,
such that the data do not allow to detect that 6= 0
Conclusion:
Statistics can prove everything

Intuitively: Absolute certainty about


population characteristics cannot be
attained based on a finite sample of observations
Introduction to Biostatistics

258

Introduction to Biostatistics

259

13.5

Hypothesis testing versus confidence intervals

For the Captopril data, we have drawn conclusions about the average treatment
effect in the population, through 2 different statistical procedures:
. 95% confidence interval: [4.91; 13.63]
. Significance of treatment effect, p = 0.001
We know from the C.I. that the average treatment effect is likely to be between
4.91 and 13.63, excluding 0
The significance test has rejected the value 0 as possible value for
So, both procedures agree

Introduction to Biostatistics

260

Question:
Do both procedures always agree ?

Answer:
Yes, provided the levels of significance and
confidence are complementary to each other:
Level of significance Confidence level (1 )100%

Introduction to Biostatistics

0.05

95%

0.10

90%

0.01

99%

261

In case of accepting H0 (p = 0.05):

95% C.I.
[

.....
.. .....
.. ... ..
....
...
...
..

H0
In case of rejecting H0 (p < = 0.05):

95% C.I.
..
.........
.. .... ..
...
....
..
..

H0

Introduction to Biostatistics

262

An alternative interpretation for the C.I. follows immediately:


A 95% C.I. is the collection of all null
hypotheses that would be accepted in a
statistical test

Statistical tests are to some extent equivalent to C.I.s


However, C.I.s have the advantage of giving an indication of the effect size
c
(treatment esstimate ),
as well as of the precision of estimation (width of C.I.)
So, C.I.s should be preferred over statistical tests

Introduction to Biostatistics

Biomedical literature

263

13.6

Example: BMI

The concept of statistical tests has been explained in the context of the
estimation of a population average
However, tests can be constructed for any characteristic of the distribution of
the random variable X of interest (variance, mean, median, proportion)
As an example, we re-consider the BMI data on n = 2605 Belgian males, and we
estimated the proportion of males in the population with overweight (BMI> 25)
by the observed proportion c = 46.99%
Suppose it would be known that 10 years before our sample was taken, only 40%
of the Belgian males suffered from overweight

Introduction to Biostatistics

264

If one wants to investigate whether overweight is occurring more frequently now,


it would be of interest to test the following hypotheses:
H0 : 0.40

versus

HA : > 0.40

If H0 would be rejected in favour of HA we would conclude that we found


evidence that nowadays more males suffer from overweight
The corresponding p-value equals p < 0.0001, so we conclude that the proportion
of males, in the Belgian population, with overweight is significantly larger than
40%, at the 5% level of significance.
This is an example of a one-sided test, since the alternative hypothesis is at one
side of the null hypothesis only.

Introduction to Biostatistics

265

In the earlier Captopril example, the hypotheses were


H0 : = 0

versus

HA : 6= 0

This was an example of a two-sided test


Note how for one- and two-sided tests, the conclusions are formulated slightly
different:
. Two-sided: . . . is (not) significantly different from . . .
. One-sided: . . . is (not) significantly smaller/larger than . . .

Introduction to Biostatistics

266

13.7

Example from the biomedical literature

Wong et al. [10]


. Section on statistical methodology:

. Two-sided tests
. 5% level of significance

Introduction to Biostatistics

267

. Table 2:

. C.I.s for differences


between means and medians
. Corresponding tests for
significance

Introduction to Biostatistics

268

Part V
Some frequently used tests

Introduction to Biostatistics

269

Chapter 14
The comparison of two means: Unpaired data

. Example
. Confidence interval for the difference
of two means
. The unpaired t-test
. Assumptions
. Example: Survival times of cancer patients
. Example from the biomedical literature

Introduction to Biostatistics

270

14.1

Example

Re-consider the example on the weight gain in rats, where interest is in the
comparison between rats fed on a high or low protein diet
Group-specific histograms:

Introduction to Biostatistics

271

Group-specific summary statistics:

On average, there is an observed difference of 19g between the rats on a high


protein diet and those on a low protein diet.
Is this observed difference sufficient evidence to conclude that there indeed is an
effect of diet on the weight gain ?
It would be of interest to know how likely such a difference of 19g is to occur if
weight gain would be completely unrelated to the protein level of the diet.

Introduction to Biostatistics

272

Note that, strictly speaking, we have two populations, with a sample randomly
drawn from each:
. High protein rats: The hypothetical population of all rats that are given a
high protein diet
. Low protein rats: The hypothetical population of all rats that are given a
low protein diet
From the first population, a random sample of n1 = 12 rats was taken. From the
second one, a random sample of n2 = 7 rats was drawn.
The corresponding observed means are x1 = 120 and x2 = 101 respectively.
Because there is no relation between the observations taken from the first
population and those taken from the second, we have unpaired data.

Introduction to Biostatistics

273

14.2

Confidence interval for the difference of two means

Let 1 and 2 be the (unknown) mean weight gain in the high and low protein
population, respectively:
Low protein

High protein

|
2

|
1

Of interest is to draw inferences about 1 2


Introduction to Biostatistics

274

As always, our estimate of 1 2 is


c
c = x x = 19

1
2
1
2

Based on the observed data, C.I.s can be constructed for 1 2


For example, a 95% C.I. for 1 2 is given by
[2.19; 40.19]
The true difference 1 2 may or may not be in the interval [2.19; 40.19].
However, if 100 similar experiments would be conducted, then 95 out of the 100
corresponding C.I.s are expected to contain 1 2.
Hence, with 95% certainty, we can conclude that we believe 1 2 to be within
the interval [2.19; 40.19].

Introduction to Biostatistics

275

This C.I. shows that:

. the estimate (19g) of 1 2 is a very imprecise estimate:


the C.I. is very wide
the estimate is up to 21.19 units precise with 95% chance
. based on our data, it cannot be ruled out that 1 2 would be zero, i.e., that
there would be no difference between both populations.

Introduction to Biostatistics

276

14.3

The unpaired t-test

Often, it is of interest to test whether two populations have the same mean.
This is translated in a set of hypotheses of the form:
H0 : 1 = 2

versus

HA : 1 6= 2

We will reject the null hypothesis if the observed data show too much deviation
from what is expected to see if the null hypothesis were correct
Hence, we will reject H0 if x1 is much larger than x2 , or vice versa
This is equivalent with rejecting H0 if |x1 x2| is too large

Introduction to Biostatistics

277

Question:
How large is too large ?
Answer:
If the observed difference |x1 x2|

is very unlikely to happen by pure chance

We therefore calculate the propability p of observing a similar experiment with


mean difference between the groups of at least 19g, if 1 = 2.

Introduction to Biostatistics

278

In our example, this probability equals p = 0.0757:

So, even if there is no relation at all between the protein content of the diet and
weight gain, then one can still expect to observe a difference of at least 19g in
7.6% of the future similar experiments.
Since p = 0.0757 > 0.05 = , we consider this unsufficient evidence to conclude
that the protein level would indeed affect the weight gain

Introduction to Biostatistics

279

Conclusion:
There is no significant difference (p = 0.0757) in weight gain
between rats on a high protein level diet,
and rats on a low protein level diet

The above testing procedure is called the unpaired t-test since unpaired data are
analysed, and since the calculation of the p-value is based on the t-distribution.

Introduction to Biostatistics

280

14.4

Assumptions

The calculation of the C.I., as well as the computation of the p-value are based on
the sampling distribution of X 1 X 2, which describes what values for x1 x2
can be expected in case the experiment would be repeated many times.
The sampling distribution of X 1 X 2 is completely determined from the
sampling distribution of X 1 and X 2
In case of large samples, those distributions are known to be normal (CLT)
In small samples, this normality of X 1 and X 2 is only valid in cases where the
original data are (approximately) normally distributed.

Introduction to Biostatistics

281

Therefore, in case of small samples, one assumes the outcome to be normally


distributed in each group separately:

Low protein

High protein

|
2

Introduction to Biostatistics

|
1

282

Conclusion:
Low protein

Large samples: no assumptions

High protein

|
2

|
1

Low protein

Small samples: Normality in both groups

High protein

|
2

|
1

Note that the samples in our group were small (n1 = 12 and n2 = 7). Hence the
histograms should be explored for any evidence against symmetry

Introduction to Biostatistics

283

The group-specific histograms are:

Note that, given the small sample sizes, assessment of symmetry is difficult
This illustrates another drawback of small samples: Assumptions are often needed,
which are very hard to check based on the observed data.

Introduction to Biostatistics

284

Subject-matter knowledge can often help in deciding whether the underlying


assumptions are realistic
The unpaired t-test also implicitly asssumes that, both populations have the same
variance
This can be checked with a test for equality of variances, in which the
following hypotheses are tested:
H0 : 12 = 22

versus

HA : 12 6= 22

Most software packages automatically report the results from such a test, and
even provide a corrected unpaired t-test, which corrects for the unequal variances:

Introduction to Biostatistics

285

The variances are not significantly different from each other (p = 0.9788), such
that our original result remains valid.
Note that, since the variances are so similar, the corrected and uncorrected t-tests
yield very similar results (p-values).
Often, non-equality of the variances is associated with non-normality of the data

Introduction to Biostatistics

286

14.5

Example: Survival times of cancer patients

Based on the data on survival times of cancer patients, we want to compare the
surival times of stomach cancer patients with the survival times of colon cancer
patients
Summary statistics:

We observe a large difference of 457.4 286 = 171.4 days in average survival time
between both groups.
Introduction to Biostatistics

287

On the other hand, there is a lot of variability between the subjects in both groups.
Hence, it is not clear whether the observed difference of 171 days is sufficient
evidence to conclude that survival times are indeed different for colon cancer
patients and stomach cancer patients
Results of the unpaired t-test:

We do not find a significant difference between both groups, with respect to the
survival time (p = 0.2483).

Introduction to Biostatistics

288

However, the histograms suggest skewness in the data, such that the underlying
assumption of normality becomes questionable:

The skewness in the direction of the large values suggests that a logarithmic (or
similar) transformation might be useful:
X = survival time Y = ln(X) = ln(survival time)
Introduction to Biostatistics

289

Histogram

Introduction to Biostatistics

Possible transformations

290

Stomach
X Y = ln(X)

Colon
X Y = ln(X)

124

4.82

248

5.51

42
25

3.74
3.22

377
189

5.93
5.24

45

3.81

1843

7.52

412
51

6.02
3.93

180
537

5.19
6.29

1112

7.01

519

6.25

46
103

3.83
4.63

455
406

6.12
6.01

876

6.78

365

5.90

146
340

4.98
5.83

942
776

6.85
6.65

396

5.98

372

5.92

163
101

5.09
4.62

20

3.00

283

5.65

Introduction to Biostatistics

291

As before, assessing symmetry is difficult due to the small number of observations


in both groups. However, the evidence against symmetry is much weaker now.
Results of unpaired t-test based on transformed data:

The observed difference between both groups is still not significant (p = 0.0671),
but the p-value is very different from what we obtained before the transformation
(p = 0.2483).
This illustrates that:

. assumptions need to be checked


. violation of assumptions can lead to serious errors

Introduction to Biostatistics

292

Note that this is another example where geometric means and standard
deviations would be useful to describe the location and spread in survival times in
the two cancer groups separately:

Outcome
Survival time (days)

Stomach cancer
mean (stand.dev.)
144.03
(3.49)

Colon cancer
mean (stand.dev.)
314.19
(2.72)

= exp(4.97)

= exp(5.75)

= exp(1.25)

= exp(1.00)

geometric means and standard deviations

which is very different from the arithmetic means and standard deviations that
were reported before:

Introduction to Biostatistics

293

The fact that the formal test has been performed on the log-transformed survival
times does not change the interpretation of the result
If the log-transformed survival times are different for the two groups, then also the
untransformed survival times
Hence, although the conclusion, strictly speaking, should be that
there is no significant difference in log survival times,
it will often be formulated as
there is no significant difference in survival times.

Introduction to Biostatistics

294

14.6

Example from the biomedical literature

Nissen et al. [15], Table 1:

. Large samples
. Similar variability in both groups
. p < 0.001 rather than p = 0.000

Introduction to Biostatistics

295

Kellett, Kellett, and Nordholm [12], Table 2:

. Relatively small samples

. Variances NOT equal

. Normality assumption NOT satisfied

. No reporting of the p-values

Introduction to Biostatistics

296

Chapter 15
The comparison of two proportions: Unpaired data

. Example
. The chi-squared test
. Assumptions The Fisher Exact test
. Rows versus columns
. Example: Case-control data
. Example from the biomedical literature

Introduction to Biostatistics

297

15.1

Example

We re-consider the data on sickness absence, collected on 585 employees with a


similar job:

Sickness absence
Gender

Introduction to Biostatistics

No

Yes

female

245

184

429

male

98

58

156

343

242

585

298

Research question:
Is there a relation between absence and gender ?

184/429 = 42.9% of the females, and 58/156 = 37.2% of the males have been
absent
This suggests that females are more absent than males
However, even if absence due to sickness is equally frequent amongst males and
females, the above results could have occurred by pure chance.
It therefore would be of interest to calculate how likely it would be to observe such
differences, by pure chance

Introduction to Biostatistics

299

Note that we have again two populations, with a sample randomly drawn from
each:
. Males: The hypothetical population of all male employees with similar job
conditions
. Females: The hypothetical population of all female employees with similar
job conditions
From the first population, a random sample of n1 = 156 males was taken. From
the second one, a random sample of n2 = 429 females was drawn.
Let 1 and 2 denote the proportion of males and females in the total populations
Then 1 and 2 can be estimated based on their sample versions c1 = 0.372 and
c2 = 0.429
Because there is no relation between the observations taken from the first
population and those taken from the second, we have unpaired data.
Introduction to Biostatistics

300

15.2

The chi-squared test

Often, it is of interest to test whether two populations have the same percentage
of people with absence due to sickness.
This is translated in a set of hypotheses of the form:
H0 : 1 = 2

versus

HA : 1 6= 2

We will reject the null hypothesis if the observed data show too much deviation
from what is expected to see if the null hypothesis were correct
Hence, we will reject H0 if c1 is much larger than c2 , or vice versa
c | is too large
This is equivalent with rejecting H0 if |c1
2

Introduction to Biostatistics

301

Question:
How large is too large ?
Answer:
If the observed difference |c1 c2|

is very unlikely to happen by pure chance

We therefore calculate the propability p of observing a similar experiment with


difference between the groups at least equal to
|c1 c2 | = 0.429 0.372 = 0.057, if 1 = 2

Introduction to Biostatistics

302

In our example, this probability equals p = 0.215:

So, even if there is no relation at all between gender and absence, then one can
still expect to observe a difference of 5.7% in 21.5% of the future similar
experiments.
Since p = 0.215 > 0.05 = , we consider this unsufficient evidence to conclude
that the occurrence of sickness absence is related to gender

Introduction to Biostatistics

303

Conclusion:
There is no significant difference (p = 0.215) in prevalence
of sickness absence
between males and females

The testing procedure needed for the comparison of proportions in unpaired data
is called the chi-squared test since the calculation of the p-value is based on the
chi-squared (2 ) distribution.

Introduction to Biostatistics

304

15.3

Assumptions The Fisher Exact test

c
c
The calculation of the p-value is based on the sampling distribution of
1 2 ,
c can be expected in case the experiment
which describes what values for c1
2
would be repeated many times.

c
c
and

Note that
1
2 are the sample averages X 1 and X 2 of the binary variable
sickness absence.

c
c directly follows from
Hence, for large samples, the sampling distribution of
1
2
the CLT

c
c
In small samples, the normality of
1 and 2 can be problematic, and an
alternative calculation of the p-value is needed.

Introduction to Biostatistics

305

The Fisher Exact test provides an alternative way to calculate the p-value,
without relying on the CLT, nor on the assumption of large samples.
As an example, we consider again data on sickness absence, but from a second,
much smaller, company:

Sickness absence
Gender

No

Yes

female

male

10

12

11

14

The results based on the chi-squared as well as on the Fisher Exact test are:

Introduction to Biostatistics

306

We observe considerable differences due to the (extremely) small sample sizes in


both groups
In larger samples, chi-squared and Fisher Exact produce much more similar
p-values:
Sickness absence
Company

Males

Females

Fisher Exact

58/156

184/429

0.215

0.219

2/12

1/2

0.287

0.396

107/330 405/1079

0.091

0.102

Introduction to Biostatistics

p-value

37/97

40/122

0.409

0.477

3/10

48/150

0.895

1.000

56/156

1/11

0.070

0.100

1/12

0/1

0.764

1.000

53/170

0/1

0.501

1.000

378/1089

117/269

0.007

0.009
307

The Fisher Exact test is very time-consuming, and cannot be calculated for large
samples, except with special software.
However, note that, for large samples, the chi-squared test remains possible, and
yields results very similar to the ones that would have been obtained with the
Fisher Exact test
In practice, it is often standard to use Fisher Exact, unless computational
restrictions require the use of chi-squared.
Conclusion:
Large samples: Chi-squared test
Small samples: Fisher Exact test

Introduction to Biostatistics

308

15.4

Rows versus columns

When comparing two unpaired proportions, the data can always be summarized by
a 2 2 table:

Sickness absence
Gender

No

Yes

female

A+B

male

C+D

A+C

B+D

A+B+C +D

in which A, B, C, and D represent the number of observations in each cell.


The hypothesis of interest was to compare the prevalence of sickness absence
between males and females.
Introduction to Biostatistics

309

One can show that this is equivalent with comparing the percentage of males
(females) between the employees with and without sickness absence:
B
D
=
A+B C +D
Proof:

C
D
=
A+C B+D

B
D
=
B(C + D) = D(A + B)
A+B
C+D
BC = AD
C(B + D) = D(A + C)
C
D

=
A+C
B+D

This implies that, for the analysis of a 2 2 table, rows and columns can be
interchanged.
This is of interest for the analysis of case-control data

Introduction to Biostatistics

310

15.5

Case-control data

We consider the data on cervical cancer, where the relationship between the
occurrence of cervical cancer and the age at first pregnancy is studied.
Data were collected on 49 cancer cases and 317 non-cancer cases (controls). All
women were asked about their age at first pregnancy, and the data are
summarized as:

Disease status
Age

Introduction to Biostatistics

Cervical cancer

Control

25

42

203

245

> 25

114

121

49

317

366

311

Research question:
Is there a relation between cancer and age ?

Of interest is to compare the prevalence of cancer between women with first


pregnancy before the age of 25, and those with first pregnancy later.
However, correct estimation of these percentages would have required a sample of
women with first pregnancy before the age of 25, and a sample of women with
first pregnancy later
This was not the setup of the present experiment, where a number of cases and a
number of controls are randomly selected, and where all women are then
questioned about their age at first pregnancy.

Introduction to Biostatistics

312

Such a design only allows correct estimation of the percentage of women with first
pregnancy before the age of 25, for cases and controls separately.
However, since rows and columns can be interchanged, this is sufficient to answer
our research question of interest:

Introduction to Biostatistics

313

For testing purposes, rows and columns can be interchanged, implying that the
analysis of case-control data still answers the research question of interest
For descriptive purposes, however, the choice between row and column
percentages entirely depends on the design of the study.
In the above example on cervical cancer, the row-percentages (i.e., percentage of
women with first pregnancy before the age of 25), for cancer cases and controls
separately, are the only ones that reflect the case-control nature of the experiment.

Introduction to Biostatistics

314

15.6

Example from the biomedical literature

Zuskin et al. [16], p.173 and Table 1:

Introduction to Biostatistics

315

It is not clear when chi-squared is used, and when Fisher Exact is used

Introduction to Biostatistics

316

Chapter 16
The comparison of two means: Paired data

. Example
. Confidence interval for the difference of two means
. The paired t-test
. The paired versus unpaired t-test
. Example
. Assumptions
. Example from the biomedical literature

Introduction to Biostatistics

317

16.1

Example

We re-consider the example with the Captopril data, where blood pressure was
taken in 15 hypertensive patients, before and after administration of the drug
Captopril:

Interest is in deciding whether the treatment did affect the diastolic BP


Introduction to Biostatistics

318

As in the unpaired t-test, we might consider this a two-sample case, where a


sample is taken from each of two populations:
. Population 1: Patients without treatment
. Population 2: Patients after treatment with Captopril
Let 1 be the population average BP if no treatment is given, and let 2 denote
the population average BP after treatment.
After treatment

Without treatment

|
2

Introduction to Biostatistics

|
1

319

Interest is in inference for the difference = 1 2.


The main difference when compared to the unpaired t-test is that each
observation from the first sample now uniquely corresponds to one observation
from the second sample, and vice versa.
Hence, we have paired data
In the case of unpaired data, would be estimated by the difference between the
two sample averages:
c =
c
c = x x

1
2
1
2
In the case of paired data, is estimated by the average of all subject-specific
differences between BPs before and after treatment. More specifically, the
variable of interest becomes the difference X in BP before and after treatment:
X = BPbefore BPafter
Introduction to Biostatistics

320

As before, the observed values xi for X can be calculated from the observed
values of the BP in our sample:
Before
After
Change
Patient

DBP

DBP

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

130
122
124
104
112
101
121
124
115
102
98
119
106
107
100

125
121
121
106
101
85
98
105
103
98
90
98
110
103
82

xi
5
1
3
2
11
16
23
19
12
4
8
21
4
4
18

is the population mean of the variable X, and inference for can be based on
the within-subject differences xi, rather than on the original BP measurements.
Note that this situation has been explained in full detail in the Chapters 12 and 13.
Introduction to Biostatistics

321

16.2

Confidence interval for the difference of two means

In Chapter 12, we derived that a 99% confidence interval for is given by


[3.02; 15.52].
Other confidence levels (95%, 90%, . . . ) are possible as well

Introduction to Biostatistics

322

16.3

The paired t-test

The hypothesis of interest is


H0 : 1 = 2

versus

HA : 1 6= 2

This is equivalent with the following test about the mean of the difference X in
bloodpressure:
H0 : = 0

versus

HA : 6= 0

In Chapter 13, we obtained a significant treatment effect (p = 0.001).


The testing procedure used in Chapter 13 is called the paired t-test since paired
data are analysed, and since the calculation of the p-value is based on the
t-distribution.
Introduction to Biostatistics

323

16.4

The paired versus unpaired t-test

What if the Captopril data were analysed using an unpaired t-test ?

Introduction to Biostatistics

324

Results from unpaired and paired t-tests, respectively:


. Unpaired:

. Paired:

Although both tests lead to a significant result, there is a serious difference in


p-values, showing that ignoring the paired nature of the data can lead to wrong
conclusions.

Introduction to Biostatistics

325

Conclusion:

15 2 measurements 6= 30 1 measurement
In general, the analysis of an outcome, measured multiple times per subject
(repeated measures), requires different statistical procedures than when the
outcome is measured only once for each subject.

Introduction to Biostatistics

326

16.5

Example

Obviously, it is important to correctly account for the paired nature of the data
In practice, this requires knowledge about the design of the study and the way
data have been collected
As an example, suppose interest is in testing for differences in BMI between males
and females
Suppose that BMI measurements are available for 100 males and 100 females.
The unpaired t-test is the obvious choice for the analysis, provided all assumptions
are satisfied.
Suppose now that the 100 males and females are taken from 100 married couples,
would this change the preferred method for analysis ?
YES !
Introduction to Biostatistics

327

16.6

Assumptions

It has been shown in the Chapters 12 and 13 that the calculation of both the C.I.
and the p-value entirely depends on the sampling distribution of X, the sample
average of the differences in BP before and after treatment.
In large samples, this sampling distribution is normal (CLT)
In small samples, this normality is only valid in cases where the difference in BP is
(approximately) normally distributed.
Therefore, in case of small samples, one assumes the difference X to be normally
distributed.
Note that, in this context, the sample size refers to the number of pairs, not the
number of observations in the data set
Introduction to Biostatistics

328

Conclusion:
Difference X

Large samples: no assumptions

|
=0?

Difference X

Small samples: Normality for difference X

|
=0?

In our Captopril example, the sample size was small (n = 15). Hence the
histogram of the observed differences should be explored for any evidence against
symmetry

Introduction to Biostatistics

329

Histogram of observed differences:

Assessment of symmetry is again difficult due to the small sample size, but there
is no strong evidence for severe skewness.
Note that the normality assumption is with respect to the difference X, not the
original measurements.

Introduction to Biostatistics

330

In our example, the original BP measurements (before and after treatment) are
allowed to be skewed, as long as their differences are symmetrically distributed:
After treatment

Before treatment

|
2

|
1

Difference X

|
=0?

Hence, it is useless to check symmetry of the original observations.

Introduction to Biostatistics

331

Note that, in case of skewness, it is often difficult and/or not helpful to transform
the observed differences xi:
. Since often negative
differences are observed, several standard transformations
such as ln() or are not possible
. Even if a transformation such as, e.g., yi = ln(xi + 10) would yield symmetric
observations yi, it is not clear what null hypothesis should be tested.
. Obviously, one can no longer test whether the mean of Y is equal to zero.
In case of skewness, one therefore usually transforms the original data in such way
that the differences become symmetric. This has the advantage that:
. Simple, standard, transformations can often be used
. One can still test for mean zero.

Introduction to Biostatistics

332

For example, a potential transformation for the Captopril data would be:

BPbefore
BPafter

ln(BPbefore)
ln(BPafter)

X = ln(BPbefore) ln(BPafter)

instead of:

BPbefore
BPafter

Introduction to Biostatistics

X = BPbefore BPafter

Y = ln(X + 5)

333

16.7

Example from the biomedical literature

Chen et al. [17], p. 76 and Tables 1 and 2:

Introduction to Biostatistics

334

Paired t-test to test for time trends (IAC versus AOD)

Introduction to Biostatistics

335

Unpaired t-test to test for group differences (SARS verus Control)

Introduction to Biostatistics

336

Chapter 17
The comparison of two proportions: Paired data

. Example
. Mc Nemar test
. Assumptions
. Remark
. Mc Nemar versus chi-squared
. Example from biomedical literature

Introduction to Biostatistics

337

17.1

Example

Consider the data on the prevalence of severe colds in 1319 children, measured at
the ages of 12 and 14.
The response of interest is whether the child had severe colds during the last 12
months

Severe colds at 14 yrs.


Yes

Severe colds Yes


at 12 yrs.
No

Introduction to Biostatistics

No

212

144

356

256

707

963

468

851

1319

338

Research question:
Is the prevalence of severe colds different at the two ages ?
At age 12, 356/1319 = 27% of the children reported severe colds.
At age 14, this percentage equals 468/1319 = 35%
These data suggest that the prevalence of severe colds increases with age.
It would be of interest to know how likely the observed change in prevalence is to
occur by pure chance.
If this is very unlikely, the above data provide evidence that the prevalence indeed
changes with age. Otherwise, the above data do not provide evidence for such a
change.
Introduction to Biostatistics

339

Note that the data structure is similar to the one in the Captopril data, in the
sense that subjects are measured twice at different time points:

Hence, we have again paired data.


Introduction to Biostatistics

340

17.2

Mc Nemar test

Let 1 and 2 be the percentage of children in the total population with a severe
cold at the ages 12 and 14 respectively.
Interest is in testing whether 1 and 2 are equal, which would reflect no change
over time in the percentage of children with a severe cold.
The hypothesis of interest is
H0 : 1 = 2

versus

HA : 1 6= 2

Note that a change over time in the percentage of severe colds can only occur if
children change their status:
. No severe cold at 12yrs severe cold at 14yrs

. Severe cold at 12yrs no severe cold at 14yrs


Introduction to Biostatistics

341

Moreover, in order to have a change over time, more children should change in
one direction than in the other
Our test will therefore reject H0 if the number of changers in one direction is
much larger than the number of changers in the other direction.
In our example, we will reject H0 if |256 144| is too large
Question:
How large is too large ?
Answer:
If the observed difference |256 144|

is very unlikely to happen by pure chance


Introduction to Biostatistics

342

We therefore calculate the probability p of observing a similar experiment with


difference between the numbers of changers at least equal to |256 144| = 112, if
there would be no change over time in the total population.
In our example, this probability equals p < 0.0001:

This p-value !

So, if severe colds would occur equally frequently at both ages, it would be very
unlikely to observe what has been observed in this particular experiment

We therefore conclude that our data provide evidence that the probability of
having a severe cold at the age of 12 is not the same as the probability of having a
severe cold at the age of 14.

Introduction to Biostatistics

343

Conclusion:
There is a significant difference (p < 0.0001) in the
occurrence of severe colds between the ages 12 and 14

The testing procedure needed for the comparison of proportions in paired data is
called the Mc Nemar test.

Introduction to Biostatistics

344

17.3

Assumptions

Similarly to the chi-squared test, the calculation of the p-value is based on the
assumption of a large sample
In case of small samples, the p-value can be calculated without approximations
based on CLT
The exact calculation is similar to the Fisher Exact test for unpaired data.
Many statistical packages only support the large-sample calculations.

Introduction to Biostatistics

345

17.4

Remark

As discussed before, the Mc Nemar test rejects H0 if the off-diagonal elements are
too different from each other, i.e., if there are many more changes in one direction
than in the other direction.
This implies that the testing procedure is independent of the observed diagonal
elements
Examples:
Table:
McNemar: comparison:
result:
Introduction to Biostatistics

20 20

200 20

40 50

40 500

60
130

vs.

40
130

p = 0.0142

240
760

vs.

220
760

p = 0.0142
346

17.5

Mc Nemar versus chi-squared

There seems to be a lot of confusion about when Mc Nemar test and when
chi-squared test should be used.
As an example, consider the results from a survey in which 75 people were
questioned about their intended vote in the US presidential elections, before and
after a debate on the national television:

After TV debate
Before
TV debate

Introduction to Biostatistics

Reagan

Carter

Reagan

27

34

Carter

13

28

41

40

35

75

347

Depending on the research question, this table can be analysed in two different
ways:
. Chi-squared: test for relation between vote before and after debate
. Mc Nemar: test for equal proportion Reagan voters before and after debate
Hence, even when data are paired, the chi-squared test can be used
Note that, in case of continuous data, there is no such choice:
. Unpaired data
. Paired data

Introduction to Biostatistics

= Unpaired t-test
= Paired t-test

348

17.5.1

Mc Nemar test
After TV debate
Before
TV debate

Reagan

Carter

Reagan

27

34

Carter

13

28

41

40

35

75

Research question:
Is the proportion Reagan voters the same
before and after the debate ?

The observed proportions are 34/75 = 45.3% and 40/75 = 53.3%

Introduction to Biostatistics

349

The p-value obtained from the Mc Nemar test equals p = 0.2636:

Hence the observed difference of 45.3% versus 53.3% would happen in 26.36% of
the cases, even if the percentage of voters for Reagan is the same before and after
the debate.
Conclusion:
The debate has not significantly changed the voting
behaviour (p = 0.2636).

Introduction to Biostatistics

350

17.5.2

Chi-squared test
After TV debate
Before
TV debate

Reagan

Carter

Reagan

27

34

Carter

13

28

41

40

35

75

Research question:
Is there a relation between voting behaviour before and
after the debate ?
Or equivalently:
Is the proportion of Reagan voters after the debate the same
amongst those who were in favour of Reagan before the debate as
amongst those who were in favour of Carter before the debate ?
Introduction to Biostatistics

351

The observed proportions are 27/34 = 79.4% and 13/41 = 31.7%


Note that this comes down to comparing the proportion of Reagan voters after
the debate, between two separate groups: Those who were in favour of Reagan
before the debate, and those who were not in favour of Reagan before the debate.
Hence, we now compare unpaired proportions.
The p-value obtained from the Chi-squared test equals p < 0.0001:

The observed difference of 79.4% versus 31.7% is very unlikely to happen if there
would be no relation between the voting behaviour before and after the debate.

Introduction to Biostatistics

352

Conclusion:
There is a significant relation between the voting behaviour
before and after the debate (p < 0.0001).

Introduction to Biostatistics

353

17.5.3

General conclusion

The survey results can be analysed in two different ways, leading to two different
conclusions:
. Mc Nemar: There is no evidence that a TV debate would change the results
of an election (p = 0.2636)
. Chi-squared: There is a strong relation between voting behaviour before and
after the debate (p < 0.0001).
Note that the proportion of Reagan voters before and after a TV debate could
also be compared based on unpaired data.
One then would question 75 people before the debate, and one would question 75
other people after the debate.

Introduction to Biostatistics

354

The resulting 2 2 table would then contain 150 subjects:


Preference
TV debate

Reagan

Carter

Before

34

41

75

After

40

35

75

74

76

150

The chi-squared test would compare the observed proportions 34/75 = 45.3% and
40/75 = 53.3%, which are the same ones as those compared before with the
Mc Nemar test for the experiment with paired observations

Introduction to Biostatistics

355

17.5.4

Some further examples

There is no relation between (non-)significance of the chi-squared test and


(non-)significance of the Mc Nemar test
Examples:

Table:
2: comparison:
result:
McNemar: comparison:
result:

Introduction to Biostatistics

25 25

10 10

40 10

5 20

25 25

40 40

10 40

45 30

25
50

vs.

25
50

10
50

vs.

10
50

40
50

vs.

10
50

5
50

vs.

20
50

p = 1.0000

p = 1.0000

p < 0.0001

p = 0.0291

50
100

50
100

50
100

50
100

vs.

50
100

p = 1.0000

vs.

20
100

p < 0.0001

vs.

50
100

p = 1.0000

vs.

25
100

p = 0.0098

356

17.6

Example from biomedical literature

De Clercq et al. [18], Abstract:

Mc Nemar test to compare the presence of sumptoms before and after surgery.

Introduction to Biostatistics

357

Part VI
Further topics on statistical inference

Introduction to Biostatistics

358

Chapter 18
Errors in statistics: Basic concepts

. Introduction
. Two types of errors
. Power
. Sample size calculation
. Examples
. Remarks
. Example from the biomedical literature

Introduction to Biostatistics

359

18.1

Introduction

Re-consider the example on the weight gain in rats, where interest is in the
comparison between rats fed on a high or low protein diet
Group-specific histograms:

Introduction to Biostatistics

360

Group-specific summary statistics:

On average, there is an observed difference of 19g between the rats on a high


protein diet and those on a low protein diet.
Based on the unpaired t-test, we obtained before that this observed difference is
not sufficient evidence to believe that the weight gain is really different for the two
diets (p = 0.0757)

Introduction to Biostatistics

361

Conclusion:
There is no significant difference (p = 0.0757) in weight gain
between rats on a high protein level diet,
and rats on a low protein level diet

As indicated before, the result of a statistical test should be interpreted as


evidence in favour or against the null hypothesis, and should not be interpreted as
formal proof.
In our example, the difference in weight gain between a population treated with
one diet and a population treated with the other diet is too small to be detected
based on 12 and 7 animals, respectively.

Introduction to Biostatistics

362

Alternatively, if the t-test would have lead to p = 0.001, this would still not
formally proof that there is a difference between both populations.
After all, p = 0.001 would only indicate that the observed difference of 19g occurs
once every 1000 times, even if there is no difference at all between both
populations.
Maybe, our sample was indeed the extreme one that happens once every thousand
experiments.
Hence, whenever statistical tests are used, one has to be aware that errors in the
conclusions can occur.
It is therefore important to quantify the errors, and to keep them under
control

Introduction to Biostatistics

363

18.2

Two types of errors


Reality
Test result

Accept H0

H0 correct

H0 not correct

No error

Type II error

Reject H0 Type I error

No error

Type I error: H0 is incorrectly rejected


Type II error: H0 is incorrectly accepted

Introduction to Biostatistics

364

18.3

Type I error

A type I error occurs if H0 is correct but the test leads to a significant result.
Question:
How likely is such an error to occur ?

Suppose the test is performed at the = 5% level of significance


If H0 is correct, then one will observe a significant result in 5% of the cases
Hence, in 5% of the cases, H0 would be incorrectly rejected
Introduction to Biostatistics

365

The probability of making a type I error is therefore equal to the chosen level of
significance.
In practice, the probability of making a type I error is kept under control by
choosing sufficiently small
In biomedical sciences = 5% is often used, hereby allowing to make a type I
error in 5% of the cases.

Reality
H0 correct

Test result

Accept H0

Reject H0

H0 not correct

1
If H0 is correct, then the probability of making a type I error is , while the
probability of correctly accepting H0 is 1 .
Introduction to Biostatistics

366

18.4

Type II error

A type II error occurs if H0 is incorrect but the test has not detected this, i.e., a
non-significant result is obtained
Question:
How likely is such an error to occur ?

In contrast to the type I error, the probability of making a type II error is not easily
controlled, and depends on various aspects of the sample(s) and population(s)

Introduction to Biostatistics

367

In analogy to the type I error, the type II error rate is denoted by

Reality
Test result

H0 correct

H0 not correct

Accept H0

Reject H0

The power of a statistical test is 1 , the probability of correctly rejecting H0

Introduction to Biostatistics

368

18.5

Power

In general, a specific testing procedure is acceptable, only if:

. the chance of making a type I error rate is sufficiently small


. the power to detect deviations from H0 is sufficiently large

The first condition can be met by specifying sufficiently small.


The second condition is more difficult to meet, as the power depends on various
aspects of the sample(s) and population(s)
This will be illustrated in the context of the comparison of two groups (such as
the weight gain experiment)

Introduction to Biostatistics

369

As before, let 1 and 2 represent the average weight gain in the total population,
under high and low protein diets, respectively.
The null and alternative hypotheses are given by
H0 : 1 = 2

versus

HA : 1 6= 2

The power is the probability of correctly rejecting H0.


In that case, 1 6= 2, and we denote the true difference between both
populations by = 1 2
The unpaired t-test assumes the data to be normally distributed in both
populations, with equal variability 2

Introduction to Biostatistics

370

Graphically:

Low protein

High protein

2 .. . 2 ..

.
.................................... .....................................

|
|
2
1
.................
........................

Introduction to Biostatistics

371

18.5.1

Power as a function of

The smaller , the smaller the power

Intuitively: Type I errors are less likely if the null hypothesis is rejected less
often. However, in cases where H0 is truly wrong, it will still be rejected less often.
An extreme case is obtained for = 0:

. = 0 implies that the null hypothesis is always accepted


. So, in case the null hypothesis is wrong, it is still accepted, leading to power 0

Introduction to Biostatistics

372

18.5.2

Power as a function of true difference

The smaller , the smaller the power

Intuitively: Large deviations from the null hypothesis are easier to detect
Low protein

High protein

|
|
2
1
.................
.........................

Introduction to Biostatistics

Low protein

High protein

| |
2 1
.........
.

373

18.5.3

Power as a function of variability 2

The smaller 2, the larger the power

Intuitively: Homogeneous groups are easier discriminated than heterogeneous


groups
Low protein

High protein

|
|
2
1
..................
.........................

Introduction to Biostatistics

Low protein

High protein

|
|
2
1
..................
........................

374

18.5.4

Power as a function of sample size(s)

The more observations, the larger the power

Intuitively: More observations yields more information about the population(s),


therefore implying more precision in the conclusions

Introduction to Biostatistics

375

18.5.5

Conclusion

The power depends on various aspects:


. Level of significance

. True difference between the populations


. Within-group variance 2
. Sample size(s)
Note that the sample size is the only aspect under control of the investigator.
In practice, one can calculate the sample size needed to reach a sufficiently high
power.

Introduction to Biostatistics

376

18.6

Sample size calculation

As indicated before, a testing procedure is only acceptable if it has sufficient


power, i.e., if the probability of making a type II error is sufficiently small.
Since the sample size is the only aspect influencing the power, which is under
control of the investigator, it is important that experiments are sufficiently large in
order for the power to be sufficiently large as well
The level of significance is chosen such that the probability of making a type I
error is sufficiently small
The within-group variance 2 is pre-specified based on earlier, similar experiments,
relevant literature, or a pilot study

Introduction to Biostatistics

377

To be on the safe side, usually an upperbound for 2 is used: In case the


variability would be smaller, the power would be higher, hence still sufficiently high
In practice, is not known. Instead, the smallest which would still be clinically
relevant to detect, is specified.
If sufficient power is attained for the smallest meaningful , we have that:
. Any larger difference will be detected with even larger power

. We are not concerned about small powers for detecting smaller differences, as
such differences are not relevant anyway.
One can then calculate the number(s) of observations needed to reach a desired
level of power.

Introduction to Biostatistics

378

18.7

Example: Weight gain data

In the weight gain data, the observed difference of 19g was found not to be
significant (p = 0.0757)
We can calculate the power that a real difference of 19g would be found
significant if a new experiment were to be conducted, again with 12 and 7
observations in the high and low protein diet groups, respectively.
Group-specific summary statistics, from the current experiment:

Introduction to Biostatistics

379

Power calculations will be based on = 21, and = 0.05


The power to detect a difference of 19g equals 43.45%
Hence, with 12 and 7 observations respectively, there is only 43.45% chance that
a true difference of 19g would be detected.
If a difference of 19g is considered clinically relevant, then the weight gain
experiment was clearly too small, since it is very likely that such a difference would
remain undetected.
We can also calculate the power for other values of

Introduction to Biostatistics

380

Summary:

Power to detect a difference

0g

5.00%

10g

15.70%

19g

43.45%

30g

80.80%

40g

96.49%

: equal to

For example, 12 and 7 observations would be sufficient to show a true difference


of 40g with more than 96% chance.
Alternatively, one can also calculate how large the samples should be to detect a
difference of, e.g., 20g with sufficiently high power.
Introduction to Biostatistics

381

Introduction to Biostatistics

382

If a power of 90% is required to detect true effects as small as = 20g, at least


25 observations are needed in each group.
With 30 observations in each group, the probability of making a type II error,
when the true effect is not smaller than 20g, is approximately 5%.

Introduction to Biostatistics

383

18.8

Example: Sickness absence

We re-consider the data on sickness absence, collected on 585 employees with a


similar job:

Sickness absence
Gender

No

Yes

female

245

184

429

male

98

58

156

343

242

585

The observed difference between the absence rate 42.9% in females and 37.2% in
males was found not significant (chi-squared test, p = 0.215).

Introduction to Biostatistics

384

In case the percentages of sickness absence would be 42% in the total female
population, and 37% in the total male population, and in case a random sample of
429 females and 156 males would be taken, there would be 19.01% chance to
reach a significant effect.
So, if the population proportions are indeed 42% and 37%, an experiment with
429 en 156 would detect this difference only 19 times out of 100 experiments.
If a difference of 5% is considered clinically relevant, then the current experiment
was clearly too small, since it is very likely that such a difference would remain
undetected.
We can calculate how large the samples should be in order to detect a difference
between 42% and 37%, with sufficiently high power

Introduction to Biostatistics

385

Introduction to Biostatistics

386

For example, two samples of approximately 2500 observations are needed in order
to show a difference between 37% and 42%, with 95% probability

Introduction to Biostatistics

387

18.9

Remarks

The earlier examples of power and/or sample size calculations were in the context
of the unpaired t-test and chi-squared test.
Similar calculations can be done in any other statistical testing situation, e.g.,
Fisher Exact test, paired t-test, McNemar test, . . .
Strictly speaking, all experiments should be preceded by a realistic sample size
calculation to avoid experiments with unacceptable high type II error rates, i.e.,
with almost no chance at all to show clinically meaningful effects.

Introduction to Biostatistics

388

18.10

Example from the biomedical literature

Wong et al. [10]


Methodology section, p.658:

Introduction to Biostatistics

389

Table 2 with results:

Discussion, p.664:

Introduction to Biostatistics

390

The difference on which the sample size calculation was based was much larger
than what actually was observed in the experiment
Therefore, the power to reject equality of the groups was (much) lower than the
expected 80%
The current study cannot tell the difference between a 9% increase and a 3%
decrease.
If such differences are considered clinically important, then the current study was
under-powered, due to the fact that the difference was overestimated at the time
of the sample size calculation.

Introduction to Biostatistics

391

Chapter 19
Errors in statistics: Practical implications

. Multiple testing
. Bonferroni correction
. Tests for baseline differences
. Equivalence tests
. Significance versus relevance
. Examples from biomedical literature

Introduction to Biostatistics

392

19.1

Multiple testing

Each time a test is performed, there is probability of making a type I error


For example, if = 0.05, we can expect to incorrectly reject the null hypothesis in
5 out of 100 times.
Implication:
The more tests one performs, the higher the probability
that something is detected by pure chance
This problem of multiple testing occurs very frequently in bio-medical sciences,
in various settings

Introduction to Biostatistics

393

19.1.1

Example: A classroom experiment

On entry in the classroom, assign each student at random to be seated at the left
or at the right side of the classroom
Compare both sides with respect to 100 aspects including weight, height, age,
gender, color of hair, color of eyes,. . .
It is to be expected that for at least 5 of these outcomes, a significant difference is
obtained at the 5% level of significance, by pure chance.

Introduction to Biostatistics

394

19.1.2

Example: Testing many relations

Amin et al. [19], Table 2:

. 18 tests performed
. only 2 significant results

Introduction to Biostatistics

395

19.1.3

Example: Subgroup analyses

Kaplan et al. [20], Table 5:

. Tests based on C.I.s for odds ratios


. C.I. containing 1 is equivalent to a
non-significant test result
. 21 3 = 63 tests performed
. only 5 significant results

Introduction to Biostatistics

396

19.1.4

Example: Searching for the most significant results

This scientific finding was printed in the Belgian newspapers:

It was even stated that those who wake up before 7.21am have a statistically
significant higher stress level during the day than those who wake up after 7.21am.

Introduction to Biostatistics

397

19.1.5

Conclusion

Significant results obtained by multiple testing are often overinterpreted


If the number of tests is reported, the reader knows that such results need to be
interpreted with extreme care
The problem arises when only the significant results are reported, and one does
not know how many tests were performed in total
This leads to reporting results which turn out to be not reproducible
For example, a new study would not find that students seated on the left are taller
than those on the right. Instead, students seated on the left may weigh more than
those seated on the right.

Introduction to Biostatistics

398

For example, a new experiment might show no difference in stress levels between
subjects waking up early and those waking up late. Or maybe a difference would
be found only when waking up is later than 8.12am.

Introduction to Biostatistics

399

19.2

Bonferroni correction

Suppose two tests are performed, both at the 5% level of significance.


The probability that at least one type I error will be made can be shown not to
exceed 2 0.05 = 0.10:
P (at least 1 type I error) 2 5% = 10%
In general, if k tests are performed, all at the 5% level of significance, the
probability of making at least one type I error can only be shown not to exceed
k 5%
Obviously, controling the overall type I error rate can be done by performing each
separate test at the /k level of significance.

Introduction to Biostatistics

400

For example, performing 2 tests at the 2.5% level of significance each implies that
the probability of making at least one type I error will not exceed 5%.
In general, when k tests are performed at the /k level of significance, one is sure
that the overall probability of making at least one type I error will not exceed .
This correction of the significance level is called the Bonferroni correction.
When confidence intervals are used instead of p-values, the confidence levels can
be corrected in a similar way

Introduction to Biostatistics

401

Some examples:
Number of tests

Significance level

Confidence level

0.05

95%

0.025

97.5%

0.01

99%

0.05/k

(1 0.05/k) 100%

For example, if CI1 , CI2 , . . . CI5 are 5 intervals with 99% confidence, for 5
unknown parameters 1 , 2 , . . . , 5, then there is at least 95% probability that all
5 C.I.s will contain all 5 unknown parameters:
P (CI1 contains 1 and

Introduction to Biostatistics

...

and CI5 contains 5) 95%

402

Note that, strictly speaking, the Bonferroni correction is an overcorrection, since


the overall type I error rate can only be shown not to exceed 5%, and usually
will be smaller than the required 5%.
In some specific testing situations (e.g., ANOVA analysis), more accurate
corrections are available.

Introduction to Biostatistics

403

19.3

Examples from the biomedical literature

Baba et al. [21], p.1202 and p.1203:

Introduction to Biostatistics

404

Kellett et al. [12], Table 2 (for example):

Introduction to Biostatistics

405

In the discussion, R.Roy writes:

Note that the reader cannot perform the Bonferroni correction as the exact
p-values have not been reported.
Introduction to Biostatistics

406

19.4

Tests for baseline differences

In order to show causal effects, patients are often randomized into 2 or more
groups
This ensures (at least in large studies) that all treatment groups are identical,
except for the treatment the patients receive
In (relatively) small studies, imbalances can still occur by pure chance
Therefore, one often compares the various groups with respect to important
factors which are believed to be strongly related to the outcome of interest.
This is called testing for baseline differences, as one compares the
characteristics of the patients at the start of the study.

Introduction to Biostatistics

407

As an example, suppose interest is to compare two oral treatments, A and B, for


the treatment of hypertension.
Suppose the change in diastolic BP is the oucome of interest
Age is one of the factors believed to be strongly related to BP. Therefore, it is
important that both treatment groups have the same age distribution
Therefore, one often tests for age differences between A and B, e.g., based on the
two-sample t-test.
The hypothesis tested is
H0 : A = B

versus

HA : A 6= B

Note that H0 and HA express properties of the populations, not the samples
Introduction to Biostatistics

408

In the populations (infinitely large), we know that, due to the randomization, A


and B are identical
Conclusion:
It makes no sense at all to perform baseline tests
in randomized studies
No matter how small the resulting p-value would be (e.g., < 108 ) we know that
the observed difference in age between groups A and B has occurred purely by
chance.
A meaningful alternative is to calculate a C.I. of the average age difference
between both groups, to ensure that the observed difference is sufficiently small to
conclude that it cannot (completely) explain the observed differences in the
outcome of interest.
Introduction to Biostatistics

409

In our example suppose that a 95% confidence interval for the average difference
in age (years) is given by [0.1; 0.3], then we believe that this difference would be
too small to explain why patients in group A show more decrease in BP than
patients in group B.
Note also that testing for baseline differences cannot be used to check whether
the randomization was done properly.

Introduction to Biostatistics

410

19.5

Example from the biomedical literature

Nissen et al. [15], abstract and table 1:

A two-arm randomized study

Introduction to Biostatistics

411

formal tests at baseline

Introduction to Biostatistics

412

19.6

Equivalence tests

Suppose two groups A and B are to be compared, and a two-sample t-test is used
to test
H0 : A = B versus HA : A 6= B
In case of a non-significant test result, one often concludes that both groups are
identical or equivalent
An alternative interpretation is that the experiment did not have sufficient power
to show an effect which is present.
Conclusion:
Non-significance should not be interpreted as equivalence

Introduction to Biostatistics

413

This can also be seen from the fact that, if the two-sample t-test could be used to
show equivalence, it would be best to collect data on (extremely) small samples,
as this would increase the chance to obtain an non-significant result, due to lack
of power.
Instead, one should reverse H0 and HA:
H0 : |A B | >

versus

HA : |A B |

where is a pre-specified constant, defining equivalence


Note that HA is equivalent to A B
Hence, in order to reject H0, one needs to show evidence that A and B are less
than away from each other
One way to proceed is to construct a C.I. for A B and to check whether it is
entirely within the interval [; ].
Introduction to Biostatistics

414

Graphically, H0 would be rejected if:

95% C.I.
...
...
..
...
...
....
..
...
...
...
...
....
..
...
...
...
...
.

[c

A B
....
.......
.. .. ...
.. .... ..
....
...
....
..
....
..
...
....
...

...
...
..
...
...
....
..
...
...
...
...
....
..
...
...
...
...
.

0
Graphically, H0 would not be rejected if:

95% C.I.
[

...
...
...
...
....
..
...
....
..
...
....
..
...
....
..
...
....

A B
c

.....
......
.. ... ..
.. ... ...
...
..
....
..
...
....
..
...
....
..

...
...
...
...
....
..
...
....
..
...
....
..
...
....
..
...
....

0
Introduction to Biostatistics

415

Obviously, the result of the equivalence test entirely depends on the choice of
Therefore, needs to be specified prior to the data collection

Introduction to Biostatistics

416

19.7

Example from the biomedical literature

Shatari et al. [22]:


. Title:

Introduction to Biostatistics

417

. Table 1:

No significant
differences !

Introduction to Biostatistics

418

. Results and conclusions (abstract):

Introduction to Biostatistics

419

Sripalakit et al. [23], abstract, Table 3, and p.1038


. Title:

Introduction to Biostatistics

420

. Study design:

Aim: equivalence of 2 treatments


Cross-over: all subjects receive both treatments
Washout period of 1 week between both treatments
Treatments given in random order
Introduction to Biostatistics

421

. Definition of equivalence:

Paired data, with skewed distribution for differences


Log transformation of original outcomes: ln(Yi) ln(Xi) = ln(Yi/Xi )
Equivalence defined as: = 0.22 = [; +] = [0.22; +0.22]
Back-transformed: [exp(0.22); exp(+0.22)] = [0.80; 1.25]

Introduction to Biostatistics

422

. Table 3 with results, and conclusion (abstract):

Introduction to Biostatistics

423

19.8

Significance versus relevance

We discussed before that the power to detect some effect increases with the
sample size
This implies that any effect , no matter how small, will, sooner or later, be
detected, if the sample is sufficiently large.
For example, consider the Captopril data, where the observed difference of 9.27
mmHg was found significantly different from zero (p < 0.001), based on data
from 15 patients only:

Introduction to Biostatistics

424

The 99% confidence interval for the average change in BP was found to be
[3.02; 15.52].
Suppose that the observed difference would have been 0.1 mmHg.
A p-value as small as 0.001 would be likely to be obtained, provided that the
sample would be sufficiently large.
Obviously, an average change in BP as small as 0.1 mmHg is not relevant from a
clinical point of view.
Conclusion:
Statistical significance

Introduction to Biostatistics

6=

Clinical relevance

425

A highly significant effect can be a large effect:

p = 0.0001

95% C.I.
.
....
...
..
...
...
...
...
....
..
...
....
..
...
....
..
...
..

A highly significant effect can also be a very small effect, but estimated with high
precision, due to a large sample size:

95% C.I.
....
...
..
...
...
....
..
...
...
...
...
....
..
...
....
..
...

p = 0.0001

[c]

0
Introduction to Biostatistics

426

The p-value cannot distinguish between both situations


It is therefore important not to blindly overinterpret significant results without
knowing the size of the effect
This is another reason why confidence intervals are to be preferred over
significance testing

Introduction to Biostatistics

427

Chapter 20
One-sided versus two-sided tests

. Introduction
. One-sided tests
. Example
. Example from the biomedical literature

Introduction to Biostatistics

428

20.1

Introduction

c = 9.27 mmHg
Re-consider the Captopril data, where the observed difference of
was found significantly different from zero (p < 0.001):

The hypothesis tested is


H0 : = 0

versus

HA : 6= 0

This hypothesis is two-sided since it is not pre-specified whether, in case H0 is


rejected, is larger or smaller than 0
Introduction to Biostatistics

429

This implies that an observed difference much larger or much smaller than 0
provides evidence against H0
This is also reflected in the calculation of the p-value:
p is the probability of observing an average difference at
least as far away from 0 as 9.27, if = 0.

This is equivalent to
p is the probability of observing an average difference larger
than 9.27 or smaller than 9.27, if = 0.

Introduction to Biostatistics

430

Graphically:

Sampling distribution of X under H0

p/2
...........
..... ...
... ...
. .....
...
....
...
...
....
.

|
9.27

Introduction to Biostatistics

p/2
|
0

....
................
.. ..
.
.
...
..
...
.
.
.
..
...
....

|
9.27

431

20.2

One-sided tests

Sometimes it is of interest to test one-sided hypotheses, e.g.,


H0 : 0

versus

HA : > 0

Obviously, observed differences smaller than 0 do not provide any evidence


against H0.
Only differences larger than 0 can be used as evidence in the data against H0
This has implications for the calculation of the p-value:
p is the probability of observing an average difference at
least as large as 9.27, if = 0.

Introduction to Biostatistics

432

Graphically:

Sampling distribution of X under = 0

p
|
9.27
Introduction to Biostatistics

|
0

....
.................
. ..
.
.
.
....
...
...
.
.
.
...
...

|
9.27

433

Note that the above distribution is the sampling distribution of X assuming


= 0.
Intuitively: If the data provide evidence to reject = 0 then also to reject 0
Note that the p-value is now only half the p-value one would obtain when testing
the two-sided hypothesis
As a result, significance is reached more often.
It is therefore tempting to search for arguments justifying one-sided testing rather
than the classical two-sided testing.
Often, this is done after the data have been collected, and after having seen the
direction of the observed effect (positive or negative).

Introduction to Biostatistics

434

However, the study objectives should never be influenced by the data that are
observed.
One-sided testing is justified only if
. it is known that an effect, if any, can only be
in one direction
. only one direction is of scientific interest
. the decision is made prior to the data collection

Introduction to Biostatistics

435

20.3

Example

In the context of the Captopril data, suppose that one is only interested in
treatments which yield an average decrease of at least 5 mmHg in diastolic BP.
This would lead to testing
H0 : 5

versus

HA : > 5

Note that only differences larger than 5 can be used as evidence against H0
The p-value is calculated as:
p is the probability of observing an average difference at
least as large as 9.27, if = 5.

Introduction to Biostatistics

436

Graphically:

Sampling distribution of X under = 5

p
|
5

Introduction to Biostatistics

...
..................
. ..
.
.
..
...
...
.
.
...
...
...

|
9.27

437

This p-value is now given by p = 0.038


Conclusion:
The average treatment effect is significantly larger than
5 mmHg (p = 0.038).

Introduction to Biostatistics

438

20.4

Example from the biomedical literature

Hutchins et al. [24]


Description of methods, p.8315:

Authors in favour of one-sided tests


Journal required two-sided results

Introduction to Biostatistics

439

Results, p.8316:

Results (abstract):

Introduction to Biostatistics

440

Chapter 21
Describing associations

. Introduction
. Pearson correlation
. Relative risk
. Odds ratio
. Examples from biomedical literature

Introduction to Biostatistics

441

21.1

Introduction

All test procedures discussed so far aim at expressing to what extent an observed
relation between two variables can be ascribed to pure chance:
. Unpaired t-test: The relation between a continuous response Y (e.g., weight
gain) and a dichotomous variable X (e.g., protein level) which defines the
groups to be compared.
. Chi-squared test: The relation between a dichotomous response Y (e.g.,
sickness absence) and a dichotomous variable X (e.g., gender) which defines
the groups to be compared.
As discussed before, p-values do not express the size of a relation: A highly
significant effect does not necessary mean that the effect is clinically relevant, i.e.,
the association between the variables is not necessarily very strong.

Introduction to Biostatistics

442

A number of association measures is available to describe the strength of


association between two variables.
Association measures frequently used in the biomedical literature are:
. the correlation coefficient
. the relative risk
. the odds ratio

Introduction to Biostatistics

443

21.2

Pearson correlation

As an example, we re-consider the surgery data, in which the relation is studied


between the time needed, after surgery, for the BP to recover to a normal level,
and its relation to the BP during the surgery, and the dose of the drug needed to
keep the BP sufficiently low during the surgery.
Data on 53 patients, with 3 types of operation
Available measurements:

. Time (min.) before the patients systolic BP returns to 100 mmHg

. The 10-base log(dose) of the drug in log(mg)


. The average systolic BP while the drug was being administered

Introduction to Biostatistics

444

Introduction to Biostatistics

445

Let us focuss on describing the association between the recovery time, and the
log(dose), irrespective of the type of operation.
For each patient, we have two measurements:
. The log(dose): xi for the ith patient

. The recovery time: yi for the ith patient


Our data are couples (xi, yi), which can be graphically explored using a
scatterplot.
The scatterplot suggests a positive relation between X and Y
Note that such a relation is an average relation, not a relation at the patient level
Also, the relation is not expected to be very strong: Knowing the dose, one
cannot predict the recovery time very precisely.
Introduction to Biostatistics

446

The Pearson correlation is a quantitative measure for the strength of


association between two variables X and Y , and is defined as:
r =

x)(yi y)
r
P
P
2
2
(x

x)
i i
i (yi y)
i (xi

where x and y are the sample averages of the observed x-values and y-values,
respectively:
1
x =
n

xi ,

1
y =
n

yi

Insight in the above expression can best be obtained graphically.

Introduction to Biostatistics

447

x)(yi y)
r
r= P
P
2
2
i (xi x)
i (yi y)
r

yi

i (xi

(,+)

(+,+)

(,)

(+,)
x

Introduction to Biostatistics

xi
448

The Pearson correlation coefficient measures to what extent there is a linear


relation between X and Y , and has the following properties:
. 1 r 1
. r < 0 : negative linear trend between the xi and the yi
. r > 0 : positive linear trend between the xi and the yi
. r = 1 : the data points xi and yi are located on a decreasing straight line
. r = 1 : the data points xi and yi are located on an increasing straight line
. r = 0 : there is no LINEAR trend between the xi and the yi

Introduction to Biostatistics

449

Introduction to Biostatistics

450

Note that the correlation r is computed from the observed values (xi, yi), and
only describes the association that has been observed in the sample.
However, this sample correlation r can be considered an estimate for the
population correlation , i.e., the correlation that would be obtained if the
total (infinite) population would be studied.
Usually it is of interest to use the observed sample to test whether can be
considered different from zero
Formally, the following hypothesis is to be tested:
H0 : = 0,

versus

HA : 6= 0

The test procedure assumes X and Y to be jointly normally distributed.


Alternatively to testing a hypothesis about , C.I.s can be computed for as well
Introduction to Biostatistics

451

P
O
P
U
L
A
T
I
O
N

RANDOM
S
A
M
P
L
E

H0 : = 0

HA : 6= 0

INFERENCE AND ESTIMATION

Scatterplot of (xi, yi)

Introduction to Biostatistics

Estimate for

b = r
452

For our example, the correlation matrix for the three variables in the surgery data
set is:

Introduction to Biostatistics

453

The corresponding scatterplot matrix is:

Introduction to Biostatistics

454

Note that the normality assumption for the time variable is questionable, implying
that the reported p-values may not be correct
One way to solve this is to transform the variable logarithmically, leading to:

The conclusions do not change qualitatively


Introduction to Biostatistics

455

21.3

Relative risk

We re-consider the sickness absence example, where the following data were
observed in one of the companies studied:

Sickness absence
Gender

Yes

No

female

117

152

269

male

378

711

1089

495

863

1358

The observed proportions of 117/269 = 43.49% and 378/1089 = 34.71% of


sickness absence in females and males, respectively, were found to be significantly
different (chi-squared: p = 0.007, Fisher Exact: p = 0.009).

Introduction to Biostatistics

456

The relative risk (RR) quantifies how much more sickness absence occurs in
females, compared to males:
RR =

Proportion sickness absence in females


Proportion sickness absence in males

117/269
= 1.26
=
378/1089
This implies that sickness absence occurs 1.26 times more in females than in males
Alternatively, we can conclude that the risk on sickness absence is 26% larger in
females than in males
As for the correlation coefficient, the RR can be considered an estimate, based on
our sample, for the theoretical relative risk in the total population.

Introduction to Biostatistics

457

Note that a RR equal to 1 would imply that the risk is the same for both genders,
i.e., that there is no relation between sickness absence and gender.
It is therefore often of interest to test whether the relative risk in the population is
equal to 1. Alternatively, C.I.s for the relative risk can be constructed as well.
For example, a 95% C.I. for the RR in our example, is given by [1.0692; 1.4686].
Since 1
/ [1.0692; 1.4686], we know that the null hypothesis of no relation
between gender and sickness absence is rejected.
Note that formal testing of this hypothesis was done before using the chi-squared
and Fisher Exact test.

Introduction to Biostatistics

458

21.4

Odds ratio

We re-consider the data on the relation between the occurrence of cervical cancer
and the age at first pregnancy:

Disease status
Age

Cervical cancer

Control

25

42

203

245

> 25

114

121

49

317

366

It was shown before that there is a highly significant relation between age at first
pregnancy and the occurrence of cervical cancer (p = 0.002, chi-squared and
Fisher Exact).

Introduction to Biostatistics

459

The relative risk of interest would indicate how much more likely cervical cancer is
to occur when the first pregnancy is before the age of 25 years, compared to when
the first pregnancy is after the age of 25 years.
Hence, the relative of interest is
RR =

Proportion cancer cases when first pregnancy 25yrs.


Proportion cancer cases when first pregnancy > 25yrs.

As discussed before, the case-control nature of this study does not allow
estimation of the proportions needed to calculate the above RR.
This is a direct consequence of the fact that the scientist him-/herself decides how
many cancer cases and how many controls will be selected in the sample.

Introduction to Biostatistics

460

The effect of that decision can be seen from comparing several situations with
different numbers of selected controls:

Table :

RR:

25yrs > 25yrs


Case

42

Control

203

114

42/(42 + 203)
= 2.96
7/(7 + 114)

25yrs > 25yrs


Case
Control

42

2030

1140

42/(42 + 2030)
= 3.36
7/(7 + 1140)

This means that the RR can be completely influenced by taking more or less
controls.
Therefore, the RR cannot be used to describe the strength of association in
case-control studies.
Introduction to Biostatistics

461

An alternative to the RR, which can be used for case-control studies, is the odds
ratio, defined as the ratio of the odds of cancer in the 25 group over the odds
of cancer in the > 25 group.
The odds of cancer in the 25 group is defined as:
Odds25 =

Proportion cancer cases when first pregnancy 25yrs.


Proportion non-cancer cases when first pregnancy 25yrs.

42/(42 + 203)
42
=
=
= 0.2069
203/(42 + 203)
203
Note that this odds is a measure for the risk of cancer in the 25 group, since it
will be large if there are many cancer cases, and small otherwise.

Introduction to Biostatistics

462

Similarly, the odds of cancer in the > 25 group is defined as:


Odds>25

Proportion cancer cases when first pregnancy > 25yrs.


=
Proportion non-cancer cases when first pregnancy > 25yrs.

7/(7 + 114)
7
=
= 0.0614
114/(7 + 114)
114

This odds is a measure for the risk of cancer in the > 25 group, since it will be
large if there are many cancer cases, and small otherwise.
The odds ratio is now defined as:
Odds25
0.2069
OR =
=
= 3.37
Odds>25
0.0614

Introduction to Biostatistics

463

Hence there is 3.37 times more odds on developing cervical cancer when the first
pregnancy is at an age younger than 25 years old.
The odds ratio is difficult to interpret, but it clearly gives a general indication of
how much more risk there is in one group, compared to another group.
Note that the odds ratio also equals:
42 114
= 3.37
OR =
203 7
In general, we have, for a general 2 2 table:
Group 1 Group 2

Introduction to Biostatistics

Case

Control

OR =

AD
BC

464

This shows that, in contrast to the RR, the OR does not depend on the numbers
of selected cases and controls.
This can also be seen in our earlier examples:

Table :

25yrs > 25yrs


Case

42

Control

203

114

25yrs > 25yrs


Case
Control

42

2030

1140

RR:

42/(42 + 203)
= 2.96
7/(7 + 114)

42/(42 + 2030)
= 3.36
7/(7 + 1140)

OR:

42 114
= 3.37
7 203

42 1140
= 3.37
7 2030

Introduction to Biostatistics

465

As for the correlation coefficient and the RR, the OR can be considered an
estimate, based on our sample, for the theoretical odds ratio in the total
population.
Note that an OR equal to 1 would imply that the risk is the same for both groups,
i.e., that there is no relation between cervical cancer and the age at first
pregnancy.
In that case, one would also have RR = 1.
It is therefore often of interest to test whether the odds ratio in the population is
equal to 1. Alternatively, C.I.s for the odds ratio can be constructed as well.

Introduction to Biostatistics

466

For example, a 95% C.I. for the OR in our example, is given by [1.4658; 7.7457].
Since 1
/ [1.4658; 7.7457], we know that the null hypothesis of no relation
between cervical cancer and age at first pregnancy is rejected.
Note that formal testing of this hypothesis was done before using the chi-squared
and Fisher Exact test.

Introduction to Biostatistics

467

21.5

Examples from the biomedical literature

Giantomaso et al. [25]


. Figure 2, p.398:

Positive association between the


actual distance and the distance
estimated by the physician

Introduction to Biostatistics

468

. Table 1, p.398:

. Negative Pearson correlation (r = 0.139)

. Correlation of patient estimate with physician estimate


equals r = 0.349, r 2 = 0.12
. Joint normality of X and Y are questionable (see graph)

Introduction to Biostatistics

469

Marlow et al. [9], Table 1:

Classmates
Impaired

2 (1.3%)

Preterm
99 (41%)

Not impaired 158 (98.7%) 142(59%)

OR =

158 99
= 56
2 142

99/241
RR =
= 33
2/160
Introduction to Biostatistics

470

Chapter 22
Non-parametric statistics

. Introduction
. The principle of ranks
. Wilcoxon test
. Example: Survival times in cancer patients
. Spearman correlation
. Example: Surgery data
. Remarks
. Examples from biomedical literature

Introduction to Biostatistics

471

22.1

Introduction

Most test procedures commonly used in statistics are based on specific


assumptions about the way the outcome Y of interest is distributed in the
population. Examples are:
. Normality
. Equal variance
This is why all techniques discussed so far are examples of so-called parametric
statistics
Sometimes, transformations of the data can be used in order to satisfy these
assumptions.
However, this (slightly) complicates the interpretation of the results
Introduction to Biostatistics

472

Also, in some cases, it is not possible to find a suitable transformation. For


example, consider the case where non-normality is caused by multi-modality:

If no transformation can be found, or if transformations are not desired,


non-parametric methods can be used.

Introduction to Biostatistics

473

22.2

The principle of ranks

We re-consider the analysis of the survival times of cancer patients, where a


comparison of stomach cancer patients to colon cancer patients was of interest.
Overlaid histogram of the survival times of both groups:

Introduction to Biostatistics

474

This suggests that the colon-cancer cases have longer survival times than the
stomach-cancer cases, i.e., that the distribution of the the survival times in one
group is shifted more to the right from the distribution in the other group.
This implies that, if all observations would be ranked, we expect to see more
observations from the stomach-cancer group in the lower ranks, and more from
the colon-cancer group in the higher ranks.
This suggests that it is sufficient to study the ranks of the observations, i.e.,
which observations are larger/smaller than others, to decide whether the survival
times in both groups can be assumed to be sampled from the same distribution.
The actual location of the observations is not needed, it is sufficient to know their
ranks.
Most non-parametric tests are based on replacing the observations by their ranks.

Introduction to Biostatistics

475

This will now be illustrated with two frequently-used non-parametric procedures:


. The Wilcoxon test

. The Spearman correlation coefficient

Introduction to Biostatistics

476

22.3

Wilcoxon test

The Wilcoxon test is the non-parametric version of the unpaired t-test. Hence, it
allows comparison of two populations, without having to assume the data to be
normally distributed in both populations
The null and alternative hypotheses are:

H0: one distribution


Stomach cancer
Colon cancer

HA: shifted distributions


Stomach cancer

Introduction to Biostatistics

Colon cancer

477

Hence, the alternative assumes that one distribution is just shifted from the other.
As an example of how the Wilcoxon test proceeds, consider the comparison of two
populations (A en B), on the basis of the following two samples:
A 7 4 9 17
B 11 6 21 14 18
The observations are now sorted, while keeping track of the population from
which they were sampled (group A or B):
4 6 7 9 11 14 17 18 21
A B A A B B A B B

Introduction to Biostatistics

478

The observed values are now replaced by their rank in the complete data set
(groups A and B together):
1 2 3 4 5 6 7 8 9
A B A A B B A B B
The sum of the ranks of all observations from one group is now calculated. For
example, for group A, this becomes:
WA = 1 + 3 + 4 + 7 = 15
Obviously, if WA is exceptionally large, this means that the observations in group
A are located more to the right, when compared to the observations in group B

Introduction to Biostatistics

479

Alternatively, if WA is exceptionally small, this means that the observations in


group A are located more to the left, when compared to the observations
in group B
Hence, H0 will be rejected if WA is too large or too small.
Question:
How large/small is too large/small ?
Answer:
If the observed value for WA
is very unlikely to happen by pure chance

Introduction to Biostatistics

480

We therefore calculate the propability p of observing an experiment with similar


value for WA, if the two populations would be identical.
In our example, this probability equals p = 0.2857:

Hence, even if the two samples were drawn from the same population, there would
be 28.57% chance of observing two samples shifted from each other as much as in
the current experiment, by pure chance.
Hence, what has been observed in the current experiment is perfectly in line with
what is to be expected, if the two populations are identical.
Introduction to Biostatistics

481

We therefore conclude that there is no significant difference between the groups A


and B (p = 0.2857).
This testing procedure is called the Wilcoxon (rank sum) test or, equivalently,
the Mann-Whitney U test.
Note that, alternatively, one can also decide to sum the ranks of the other group
(here group B):
WB = 2 + 5 + 6 + 8 + 9 = 30
This would lead to identical results, since WA is large if WB is small, and vice
versa. Indeed, we have that
(nA + nB + 1) (nA + nB )
WA + WB =
2
Hence, knowing WA is equivalent to knowledge of WB .
Introduction to Biostatistics

482

22.4

Example: Survival times in cancer patients

The survival times of colon cancer patients was compared before with those of
stomach cancer patients, using the unpaired t-test, after logarithmic
transformation of the survival times.
We can now repeat this non-parametrically, for the original as well as
log-transformed survival times:

t-test

Wilcoxon

Original data

p = 0.2483

p = 0.0945

Log-transformed data

p = 0.0671

p = 0.0945

Introduction to Biostatistics

483

Note that the Wilcoxon test yields a p-value closer to the one obtained from the
t-test based on log-transformed data than to the one obtained from the t-test
based on the original data
Since the Wilcoxon test is based on ranks rather than the original data,
transforming the data will not affect the result, as long as monotonic
transformations are used.

Introduction to Biostatistics

484

22.5

Spearman correlation

The Pearson correlation coefficient r expresses the strength of linear association


between two variables X and Y
As discussed before, the test for significance of the observed correlation assumes
X and Y to be jointly normally distributed.
In cases where a transformation is not possible or not desired, a non-parametric
version can be derived, leading to the so-called Spearman correlation
coefficient.
As for the Wilcoxon test, the Spearman correlation coefficient will be based on
replacing the observations by their ranks.

Introduction to Biostatistics

485

As an example of how the calculation of the Spearman correlation proceeds,


consider the following 8 observations for the variables X and Y :

yi

xi

yi xi

yi

0 0.10 10 6.55
13 8.17

2 1.05

6 5.30 12 6.65

4 4.00

8 5.75

xi
Each value xi is now replaced by its rank amongst all observed values for X.
Similarly, each value yi is now replaced by its rank amongst all observed values
for Y .
Introduction to Biostatistics

486

Grahically:

rank(yi)

rank(xi ) rank(yi ) rank(xi ) rank(yi )


1

rank(xi)
One now calculates a Pearson correlation as a measure of association between
the so-obtained ranks.

Introduction to Biostatistics

487

In the above example, the ranks show a perfect linear relation, implying that the
Spearman correlation will equal 1.
Note that the original data did not show a perfect linear fit, implying that the
Pearson correlation would be less than 1.
The Spearman correlation coefficient measures to what extent there is a
monotone relation between X and Y , and has the following properties:
. 1 r 1
. r < 0 : negative trend between the xi and the yi
. r > 0 : positive trend between the xi and the yi
. r = 1 : there is a perfect negative monotone relation between the xi and yi
. r = 1 : there is a perfect positive monotone relation between the xi and yi
. r = 0 : there is no monotone trend between the xi and the yi
Introduction to Biostatistics

488

A statistical test for significance of the Spearman correlation can be constructed


as well.
This test procedure is not based on any distributional assumptions about X or Y .
Note that, although the Spearman correlation is often interpreted as just the
non-parametric version of the Pearson correlation, it is important to realize that,
strictly speaking, both correlations measure different types of association:
. Pearson: Linear association
. Spearman: Monotone association

Introduction to Biostatistics

489

22.6

Example: Surgery data

As an example, we re-consider the surgery data, in which the relation is studied


between the time needed, after surgery, for the BP to recover to a normal level,
and its relation to the BP during the surgery, and the dose of the drug needed to
keep the BP sufficiently low during the surgery.
Data on 53 patients, with 3 types of operation
Available measurements:

. Time (min.) before the patients systolic BP returns to 100 mmHg

. The 10-base log(dose) of the drug in log(mg)


. The average systolic BP while the drug was being administered

Introduction to Biostatistics

490

Introduction to Biostatistics

491

Before, a Pearson correlation analysis was performed, and the variable Time was
log-transformed in order to satisfy the normality assumption.
We compare the previous results with those from a non-parametric Spearman
correlation analysis:

Note that Spearman correlations are not always larger/smaller than Pearson
correlations.
Since the Spearman correlation is based on ranks rather than the original data,
monotone tansformations of the data will not affect the result.
Introduction to Biostatistics

492

22.7

Remarks

For most simple statistical procedures, non-parametric versions are available.


Non-parametric procedures are not based on distributional assumptions for the
data.
Since non-parametric procedures are based on ranks, they are not affected by
monotone transformations of the data. Hence, transforming the data prior to a
non-parametric analysis does not make any sense.
Since non-parametric procedures are based on ranks, they are not influenced by
extreme values (outliers).

Introduction to Biostatistics

493

In general, the use of non-parametric procedures should be consistent with the


summary statistics used to describe the observed data:

+ Parametric tests
. Medians and interquartile ranges + Non-parametric tests
. Means and standard deviations

In case the distributional assumptions of a specific test are satisfied, one has the
choice between the parametric and non-parametric test.
In such cases, the parametric techniques are to be preferred, as they are more
powerful to detect relevant effects.
Unfortunately, many research questions will require more complex statistical tools
for which no non-parametric alternatives are available.

Introduction to Biostatistics

494

22.8

Examples from the biomedical literature

Choksy et al. [26]

. Statistical methodology, p.647:

. Power analysis does not specify the test


. Parametric and non-parametric data ?
Introduction to Biostatistics

495

. Figure 3:

Introduction to Biostatistics

496

Chen et al. [17], Table 3:

. Spearman rank correlations


. Many tests, few significant results, multiple testing
Introduction to Biostatistics

497

Huang et al. [27], Figure 1:

. Spearman correlation to quantify


linear relations
. Spearman correlation not affected
by outlier

Introduction to Biostatistics

498

Bibliography

Introduction to Biostatistics

499

Bibliography
[1] S. Graham and W. Shotz. Epidemiology of cancer of the cervix in buffalo, new york. Journal of the National Cancer Institute, 63:2327,
1979.
[2] D.J. Hand, F. Daly, A.D. Lunn, K.J. McConway, and E. Ostrowski. A handbook of small datasets. Chapman & Hall, first edition, 1989.
[3] P. Armitage and P. Berry. Statistical methods in medical research. Blackwell Scientific Publications, 1987.
[4] E. Cameron and L. Pauling. Supplemental ascorbate in the supportive treatment of cancer: re-evaluation of prolongation of survival times
in terminal human cancer. Proceedings of the National Academy of Science U.S.A., 75:45384542, 1978.
[5] G.A. MacGregor, N.D. Markandu, J.E. Roulston, and J.C. Jones. Essential hypertension: effect of an oral inhibitor of
angiotensin-converting enzyme. British Medical Journal, 2:11061109, 1979.
[6] M. Bland. An introduction to medical statistics. Oxford University Press, 3 edition, 2006.
[7] J.D. Robertson and P. Armitage. Comparison of two hypotensive agents. Anaesthesia, pages 5364, 1959.
[8] H.A. Boushey, C.A. Sorkness, T.S. King, et al. Daily versus as-needed corticosteroids formild persistent asthma. The New England
Journal of Medicine, 352:15191528, 2005.
[9] N. Marlow, D. Wolke, M.A. Bracewell, et al. Neurologic and developmental disability at six years of age after extremely preterm birth.
The New England Journal of Medicine, 352:919, 2005.

Introduction to Biostatistics

500

[10] C.A. Wong, B.M. Scavone, A.M. Peaceman, et al. The risk of cesarean delivery with neuraxial analgesia given early versus late in labor.
The New England Journal of Medicine, 352:655665, 2005.
[11] F. Blanchon, M. Grivaux, B. Asselain, et al. 4-year mortality in patients with non-small-cell lunc cancer: development and validation of a
prognostic index. Lancet Oncology, 7:829836, 2006.
[12] K.M. Kellett, D.A. Kellett, and L.A. Nordholm. Effects of an exercise program on sick leave due to back pain. Physical Therapy,
71:283293, 1991.
[13] S.P. Wu. Maximum acceptable weight of lift by chinese experienced male manual handlers. Applied Ergonomics, 28:237244, 1997.
[14] T. Nawrot, M. Plusquin, J. Hogervorst, et al. Environmental exposure to cadmium and risk of cancer: a prospective population-based
study. The Lancet Oncology, 7:119126, 2006.
[15] S.E. Nissen, E.M. Tuzcu, P. Schoenhagen, et al. Statin therapy, LDL cholesterol, C-reactive protein, and coronary artery disease. The
New England Journal of Medicine, 352:2938, 2005.
[16] E. Zuskin, J. Mustajbegovic, N. Schachter, et al. Longitudinal study of respiratory findings in rubber workers. American Journal of
Industrial Medicine, 30:171179, 1996.
[17] N.H. Chen, P.C. Wang, M.J. Hsieh, et al. Impact of severe acute respiratory syndrome care on the general health status of healthcare
workers in taiwan. Infection Control and Hospital Epidemiology, 28:7579, 2007.
[18] C.A.S. De Clercq, J.S.V. Abeloos, M.Y. Mommaerts, and L.F. Neyt. Temporomandibular joint symptoms in an orthognathic surgery
population. Journal of Cranio Maxillo-Facial Surgery, 23:195199, 1995.
[19] A.I. Amin, O. Hallbook, A.J. Lee, R. Sexton, B.J. Moran, and R.J. Heald. A 5-cm colonic j pouch colo-anal reconstruction following
anterior resection for low rectal cancer results in acceptable evacuation and continence in the long term. Colorectal Disease, 5:3337,
2003.
[20] S. Kaplan, S. Etlin, I. Novikov, and B. Modan. Occupational risks for the development of brain tumours. American Journal of Industrial
Medicine, 31:1520, 1997.

Introduction to Biostatistics

501

[21] Y. Baba, J.D. Putzke, N.R. Whaley, Z.K. Wszolek, and R.J. Uitti. Gender and the parkinsons disease phenotype. Journal of Neurology,
252:12011205, 2005.
[22] T. Shatari, M.A. Clark, T. Yamamoto, A. Menon, C. Keh, J.Alexander-Williams, and M. Keighley. Long strictureplasty is as safe and
effective as short strictureplasty in small-bowel crohns disease. Colorectal Disease, 6:438441, 2004.
[23] P. Sripalakit, P. Nermhom, and S. Maphanta. Bioequivalence evaluation of two formulations of Doxazosin tablet in healthy Thai male
volunteers. Drug Development and Industrial Pharmacy, 31:10351040, 2005.
[24] L.F. Hutchins, S.J. Green, P.M. Ravdin, D. Lew, S. Martino, M. Abeloff, A.P. Lyss, C. Allred, S.E. Rivkin, and C.K. Osborne.
Randomized, controlled trial of Cyclophosphamide, Methotrexate, and Fluorouracil versus Cyclophosphamide, Doxorubicin, and
Fluorouracil with and without Tamoxifen for high-risk, node-negative breast cancer: Treatment results of intergroup protocol int-0102.
Journal of Clinical Oncology, 23:83138321, 2005.
[25] T. Giantomaso, L. Makowsky, N.L. Ashworth, and R. Sankaran. The validity of patient and physician estimates of walking distance.
Clinical Rehabilitation, 17:394401, 2003.
[26] S.A. Choksy, P.L. Chong, C. Smith, M. Ireland, and J. Beard. A randomised controlled trial of the use of a tourniquet to reduce blood loss
during transtibial amputation for peripheral arterial disease. European Journal of Vascular and Endovascular Surgery, 31:646650, 2006.
[27] C.-C.J. Huang, C.-M. Li, C.-F. Wu, S.-P. Jao, and K.-Y. Wu. Analysis of urinary N-acetyl-S-(propionamide)-cysteine as a biomarker for
the assessment of acrylamide exposure in smokers. Environmental Research, 104:346351, 2007.

Introduction to Biostatistics

502

Vous aimerez peut-être aussi