Biostat Manual

Lecture notes
in
Biostatistics
Prepared, edited and compiled by

Kazaura, M. R.
Makwaya, C. K.
Masanja, C. M.
Mpembeni, R.C.
Muhimbili University College of Health Sciences

Institute of Public Health
Department of Epidemiology and Biostatistics
Dar es Salaam, 1997
ABOUT THE AUTHORS

Kazaura, M. R. is currently an Assistant Lecturer in the Department of Epidemiology and
Biostatistics. He is a Statistician - Demographer with research interests in population studies,
specifically related to youths and family planning.
Makwaya, C. K. is a Lecturer in Biostatistics in the Department of Epidemiology and Biostatistics of
the Muhimbili University College of Health Sciences. His main interest include analysis of binary
data and mathematical modelling of infectious diseases, particularly STDs.
Masanja, C. M. is an Assistant Lecturer in Biostatistics in the Department of Epidemiology and
Biostatistics. She has a Masters degree in Medical Statistics. Her research interests are in adult
morbidity and mortality.
Mpembeni, R. C. is an Assistant Lecturer in Biostatistics in the Department of Epidemiology and
Biostatistics. She has a Masters degree in Medical Statistics. Her main research interest is in
reproductive health.
CONTENTS
Pages
Chapter 1
Introduction
Background
Definition of Biostatistics
Need for Biostatistics
Application of Biostatistical Methods
Chapter 2
Descriptive Statistics
Introduction
Descriptive Methods for Qualitative Data
Descriptive Methods for Quantitative Data
Chapter 3
Probability
Introduction
Probability Calculation (Addition and Multiplication Rules)
Chapter 4
The Normal Distribution

Introduction
Characteristics of the Normal distribution
Chapter 5
Introduction to Sampling Techniques

Introduction
Sampling Methods
Sample Size
Chapter 6
Estimation
Statistic and Parameter
The Standard Error of a Mean
The Standard Error of a Proportion
Chapter 7
Significance Tests: One Sample

Introduction
Concept of p-values
One Sample Significance Test for a Mean
One Sample Significance Test for a Proportion
Chapter 8
Significance Tests: Two Samples

Comparison of Two Means
Comparison of Two Proportions
Chapter 9
The Chi-Squared Test

The 2x2 Table
Larger Contingency Table (rxc)
Chapter 10
Association Between Quantitative Variables

Introduction
Scatter Diagram
Linear Regression
Correlation
Logistic Regression
Chapter 11
Vital Statistics and Demography

Sources of Demographic Information
Common Rates in Public Health
Standardization of Rates
Life Tables
Population Pyramids
FOREWORD
The study of statistics deals with the collection, processing and interpretation of data. The concepts of
statistics are applied in many scientific fields that include agriculture, business, engineering and health.
When focus is on biological and health sciences, the term biostatistics is used. This manual of
biostatistics was written for students of the health sciences and serves as an introduction to the study of
biostatistics. The contents of the manual are based on the requirements for the biostatistics courses
offered at the Muhimbili University College of Health Sciences for both undergraduates and
postgraduates.
Textbooks on mathematical statistics usually include theoretical examples and exercises. The task of
finding relevant data is so enormous that even textbooks on applied statistics rarely include practical
examples and exercises. In particular, a course in biostatistics which is not introduced via numerous
examples of real data renders a restrictive view of the subject and hence tends to discourage the
uninitiated student. This manual is intended to provide substantial contact with a variety of statistical
methods and data sets so that the student can appreciate their application and the contexts in which they
are used. In the process the manual will facilitate learning of the student and provide handy notes and
references for further reading.
The authors have performed a valuable service in compiling the present manual. Many of the examples
and exercises given in this collection are based on health-related data, and the techniques which the
student is expected to apply cover a wide range of commonly used techniques. The manual will be of
great value both as the basis for a taught course and for private study.
ACKNOWLEDGEMENT
This work would have been impossible without the generous financial support of SIDA (SAREC) as
part of Research Capability Strengthening in the Department of Epidemiology/Biostatistics.
Japhet Z. J. Killewo
Associate Professor and Head of Department
Department of Epidemiology/Biostatistics
Chapter 1
INTRODUCTION
BACKGROUND
Biostatistics can be defined as the application of statistics to biological problems. To many
biomedical scientists, however, the term is considered to mean the application of statistics
specifically to medical problems. For this group of people, therefore, biostatistics and medical
statistics are synonymous. Indeed the kind of (bio)statistics taught in University Medical Schools is
medical statistics in which some applications which are specific for agricultural sciences, for
example, are not included.
Conversely, in Universities of Agriculture the term BIOMETRY is preferred to biostatistics.
Biometry (literally meaning measurement of life), refers to the application of statistical methods to
the analysis of biological data. In strict terms this should include analysis of data from (human)
medical sciences as well, but in practice less weight is attached to this.
Whether biometry or biostatistics (and in some places biomathematics is used) the word statistics is
implied. We attempt in the following section to define statistics by describing what it is.
DEFINITION OF BIOSTATISTICS:
We can define statistics in two forms:First statistics as a noun is a plural for the word statistic which simply means numerical
statements (i.e. information that is available in numbers). Examples of this include:(i)
hospital data on the number of admissions for some condition in a defined time
period
(ii)
How much drug (e.g. chloroquine tablets) is distributed to health
units
hospitals,
health centres, dispensaries, etc.
Secondly statistics as a discipline is a field of study concerned in broad terms with:(i)
Collecting, organizing and summarizing data in a systematic way.
(ii)
Drawing of inferences about a population on the basis of only a part of the
population targeted.
Note: Talking of the singular form of statistics here is as meaningless as it would be in putting
mathematic or physic as singulars for mathematics or physics, respectively.
The first part of the subject is usually referred to as Descriptive Statistics, while the second part,
which, provides objective means of drawing conclusions, constitutes Inferential Statistics.
In this course we are concerned mainly with the second sense of the meaning of statistics - that is, as
a discipline. Moreover, from the above background, the kind of BIOSTATISTICS here will be
specifically that of medical statistics.
NEED FOR BIOSTATISTICS

At first it may not be clear why statistics should be taught in medical schools. But the simple element
of variability in life is in evidence of the need for some standardized techniques to cope with the
inevitable biological variability.
In physical sciences, for example, we often deal with constants.
The number of hydrogen atoms in any single molecule of water is always two.
The velocity of electromagnetic waves in a given medium is always the same (e.g. the speed of
light in vacuum is always equal to 3x108 ms-1 .
But in the biological and medical sciences, no constituent or characteristic of living organisms can be
defined by a single value which is identical for all individuals. Consider, for example, the following
general questions:(a)
What is the normal blood pressure in man?
(b)
What is the amount of haemoglobin in blood?
Or even some specific questions to medical specialists:
(c)
"Mr. Physician, what are the limits of error in your blood pressure measurements?"
(d)
"Mr. Radiologist, what is the probability that your colleagues reports on these X-ray films would
agree with yours?"
(e)
"Mr. Pathologist, what proportions of your diagnoses are correct at post mortem?"
Clearly, answers to the first two questions suggest variations and at the same time we need to quantify
answers to the last three questions in order to cope with the situation. That is, we need a numerical approach
- biostatistical methods.
We illustrate further the need for biostatistics in medical and health sciences generally with the following
(fictitious) example. A study to compare two treatments: new and standard, in which 400 patients (200 males
and 200 females) were recruited gave the following results:Table 1.1:
Results of comparison between two treatments:

OUTCOME
TREATMENT Improved Did not Improve Total
% Improved
Standard
80
120
200
40
New
100
100
200
50
Total
180
220
400
With these results one may be tempted to conclude that the new treatment is better than the old (standard).
But an analysis which looks at the results for male patients separately from the female patients revealed the
following:
Table 1.2:
FEMALES
Results of comparison between two treatments among females:
TREATMENT
Standard
New
OUTCOME
Improved Did not Improve Total
% Improved
32
8
40
80
96
64
160
60
Total
128
72
200
From this table, we note that for female patients it is the standard treatment that is doing better. This is
exactly the opposite of what we saw in the overall assessment, and one might expect the new treatment to fair
better among the male patients.
If this holds the conclusion is going to be:-to female patients give the old treatment while for male patients it
is better to give the new treatment. In practical terms, the decision following this controversial conclusion
would be undesirable. However, when we look at the results relating to the male patients we see the
following:Table 1.3:
MALES
Results of comparison between two treatments among males:
TREATMENT
Standard
New
Total
OUTCOME
Improved Did not Improve Total
% Improved
48
112
160
30
4
36
40
10
52
148
200
Just as in female patients it is the standard treatment that produces a higher percent of improvement. You
should check and verify that all this hangs together; the overall rate of improvement for the standard
treatment, for example is (32+48)/(40+160) = 80/200 = 40% as shown above. With a proper statistical
method of analysis it becomes clear now that the difference in improvement between the two treatments when
sex has been taken into account is 20% in favour of the standard treatment. Such features are common in
medical surveys and are a typical aspect of observational studies. The situation would have been put under
control in an experimental study set up. These arguments emphasize the need for biostatistical methods not
only for data analysis but also for study designs.
APPLICATION OF BIOSTATISTICAL METHODS

Statistical methods have a role to play in:(a)
Official health statistics (statements). (e.g. studying time trends of number of cases of a
disease)
(b)
Epidemiology (e.g. association of diseases with some aetiological factors)
(c)
Clinical studies (e.g. comparison of treatments in clinical trials)
(d)
Human biology (e.g. growth pattern)
(e)
Laboratory studies (e.g. dose-response investigations)
(f)
Health service administration (e.g. with limited resources, there may be need to prioritise
target groups for necessary interventions)
Chapter 2
DESCRIPTIVE STATISTICS
INTRODUCTION
Numerical information needs to be summarized before it can be used. The methods of summarizing data
(methods of descriptive statistics) vary with different types of data which are generated from different types
of variables. We first define what a variable is and then distinguish the different types of data. A variable is
an observation, characteristic or phenomenon that can take different values for different persons, times,
places. etc.
Examples of variables
VARIABLE
Height (cm)
Weight (kg)
Parity
Outcome of disease
Marital status
Age (years)
Haemoglobin (g/dl)
Number of AIDS cases
POSSIBLE VALUES
158, 169.3 170, 200.6, etc.
10.2, 50, 69.4, 84, etc.
0, 1, 6, 8, 10, etc.
Recovery, Chronic illness, death
Single, Married, Widowed, Separated
1, 5, 30, 36, etc.
8.9, 14.2, 12.7, etc.
278, 301, 313, 350, etc.
Types of variables
There are two types of variables:
(1) Qualitative (categorical) variables
(2) Quantitative (numerical) variables.
1. Qualitative (categorical) variables:

These are variables which do not take numerical values.
Examples:
Sex (male, female)
Outcome of disease (recovery, chronic illness, death)
Hair colour (black, blonde)
Marital status (single, married, widowed, separated) etc.
2. Quantitative (numerical) variables:

These are variables which take numerical values
Examples:
Age (yrs) (10, 19, 45, 60), etc.
Height (in cms) 140, 50.6, 200 etc.
Parity 0, 3, 6, 10 etc
Haemoglobin (in g/dl) 16, 8.9, 12.7 etc.
Quantitative variables are of two types:
(i) Continuous variables:
These variables take any value within meaningful extremes.
Examples:
Height (in cm) 159.25cm, 160,35 cm etc
Weight (in Kg) 71.127 Kg, 80.56kg.
Exact age like 21 yrs 6 months and 4 days
(ii) Discrete variables
These variables take only fixed values (in most cases whole numbers).
Examples:
Parity (0, 2, 6, 10 etc)
Age last birth day (5, 19, 45, 90 yrs, etc)
Counts (1, 4, 5, 9, etc)
Number of AIDS cases (100, 10000, 34278 cases, etc)
Levels of measurement
Variables are measured on different levels/scales (Note: The term measurement is used here in a broad
sense).
(a)
Nominal measurement
These are used for identifying various categories that make up a given variable.
Example:
(1) Religion: 1= Muslim , 2 = Christian, 3 = Other
(2) Sex: 1=male, 2=Female
Note that the numbering (codes) does not signify ranking.
The categories comprising a nominal variable can not occur together and are not related.
(b)
Ordinal Measurement
These are used to reflect a rank order among categories comprising a variable.
Example: perceived level of pain
1=No pain, 2=moderate pain, 3=severe pain
Number used has no other meaning than indication of rank order.
10
Ordinal measurement enables one to make a qualitative comparison (such as more/less pain) but not a
quantitative comparison like how much more
(c)
Interval measurement
Numbers used for this level of measurement are more meaningful than in the former levels.
Arithmetic operations (+ and -) can be performed. The distance between any two consecutive
points is the same along the scale.
Examples:
i.
ii.
Difference between 3 and 4 is the same as that between 8 and 9.

Temperature measurement using oC and oF.
The zero point is arbitrarily defined and so multiplicative statements cannot be made i.e.
There is no TRUE ZERO. 0oF does not signify an absence of heat.
(d)
Ratio Measurement
This is the most sophisticated level of measurement. This level has all the characteristics of interval
measurement but it has an absolute zero point that represents an absence of the measured quantity.
Example, weight, length, height, age, etc.
Note: Measurement at ratio level can be converted to lower levels.
DESCRIPTIVE METHODS FOR QUALITATIVE DATA

Frequency and Relative Frequency Distribution.
A simple technique used to summarize data so as to show the important features is the formation of a
frequency distribution.
Definitions:
1. A frequency distribution is a table which shows the values taken by a variable and the frequency with
which each value has occurred.
2. Relative Frequency distribution: Is a frequency taken by a value relative to total frequency of a variable.
3. Cumulative relative frequency distribution. This is accumulated relative frequency distribution as the
value of variable increases.
Use of tallies in making frequency distribution.
A frequency distribution is normally formed (manually) by a process known as tallying. This involves the
following steps:
1. To scan the data and determine the categories.
2. List the categories
3. Work through the data and allocate each observation the category where it belongs and use the tally marks
to keep a count of the number in each category.
4. Add the tally marks to give the frequency.
11
Example, the following data shows a qualitative variable "Result of sputum examination".
If: 1 stands for smear -ve, culture -ve.
2 stands for smear -ve, not cultured.
3 stands for smear or culture +ve.
1
1
1
3
1
2
2
1
2
1
1
1
1
2
1
2
1
2
1
1
1
1
1
1
3
2
1
1
1
1
1
2
2
1
1
1
1
3
1
3
1
1
3
1
1
3
2
1
3
1
1
1
1
1
2
2
1
1
1
1
1
1
1
3
3
3
2
1
1
1
1
1
2
1
1
1
1
2
1
2
1
2
2
1
3
1
1
1
3
1
1
1
3
2
1
2
3
1
3
3
1
3
1
1
1
3
3
1
2
2
1
1
1
1
3
1
3
1
3
1
3
1
1
3
1
1
3
1
1
3
1
2
3
1
1
1
1
1
1
1
2
1
1
3
3
1
3
2
3
1
1
2
1
1
3
1
1
1
2
2
3
1
3
1
3
2
3
1
1
2
3
1
1
2
1
3
3
2
1
1
3
1
1
1
3
1
1
1
1
1
1
1
3
1
1
1
2
1
1
2
2
1
1
1
1
1
1
3
1
1
1
2
1
1
2
1
1
2
1
3
2
1
3
2
1
1
1
1
1
From the above data:

Value
Tally
Smear-ve, culture-ve
Smear-ve, not cultured
Smear or culture +ve
Note:
indicates 5 observations.
----------------
Frequency
144
40
45
Table 2.1:
A frequency distribution for the variable "Result of sputum examination".
Frequency
Relative frequency Cumulative relative frequency
VALUE
Smear -ve, culture -ve
144
62.9
62.9
Smear -ve, not cultured
40
17.5
80.4
Smear or culture +ve
45
19.6
100.0
Total
229
100.0
Use of diagrams:
Frequency distributions can be illustrated visually by means of statistical diagrams. These diagrams serve two
main purposes:(i)
Presentation of information/data (e.g. report) in articles for ease of appreciation
(ii)
To serve as a private aid for further statistical analysis.
Two types of diagrams are commonly used to illustrate qualitative data. These are pie charts and bar charts.
1. Pie chart
Pie charts are used to express the distribution of individual observations into different categories (Note: The
frequencies should be converted into percentages totalling 100 for a pie chart to be used).
12
Example, below is a pie chart showing the distribution of first year students at Muhimbili University
College of Health Sciences (MUCHS) by course of study.
DDS
14.0%
BSc N
12.0%
MD
49.0%
BPharm
25.0%
Fig 1:
D
istribution of First Year Students at MUCHS by Course of
Study.
2. Bar chart
These are the simplest and most effective means of illustrating qualitative data. The various categories of a
variable are represented on one axis and the frequency or relative frequency are represented on the other axis.
The length of each bar represents the number of observations (frequency) in each category or the relative
frequency in percentage. Example, cnsider the following birth control method mix in a certain population:
Abstinence
Oral contraceptive
Depo Provera
Loop
Spermicides
Condoms
Vasectomy
Hysterectomy
Norplant
3%
32%
9%
17%
7%
26%
3%
2%
1%
In this example, use of a pie chart for this variable would not be suitable because the diagram will be
13
too congested. Hence a bar chart is more appropriate.
Percentage
35
30
25
20
15
10
0
Abstinance
Oral Contrac.
Depo Provera
Loop
Spermicides
Condoms
Vasectomy
Hysteroctomy
Norplant
Methods of Birth Control
Fig 2. Distribution of the population using different control methods.

Two-way tables:
A statistical information on two variables can be presented simultaneously in a form of a two-way table. This
table makes the information easier to assimilate by showing at a glance many of the properties of the data.
In a two way table data are presented in rows and columns. The format for a table depends upon the data and
the aspects of the data which are important to portray.
A two-way table should include the following:
1. A clear title.
2. A caption for the rows and columns with units of measurement of the variable.
3. Labels for each individual row or column. i.e The values taken by the variable concerned.
4. Marginal and grand totals.
Consider the following example:
In a study to investigate whether or not HIV1 infection is a risk factor to pulmonary tuberculosis (PTB), a
total of 2165 individuals were examined. Blood samples were also collected from these individual for
laboratory diagnosis of HIV1 infection.
The following results were obtained:
Of the 2165 individuals examined, 651 were found to be negative for HIV1 infection. Of those who were
negative, 57 were found to have PTB. 1526 of the HIV1 positive, 875 were found to have PTB.
14
This information can be summarized in a two way table.
15
Table 2.2:
PTB infection by HIV1 status
PTB STATUS
Negative
Total
HIV STATUS Positive
Positive
875 (57.0)
651 (43.0) 1526 (100.0)
Negative
57 (8.9)
582 (91.0)
639 (100.0)
Total
932 (34.0)
1233 (57.0) 2165 (100.0)
Numbers in brackets show the row percentages.
The cells of a two way table may contain percentages instead of the real counts. Calculation of percentages
may be row-wise or column-wise depending on the purpose of the table.
Example: In the above table our interest is to investigate whether HIV1 infection is a risk factor to PTB. So
our aim is to see whether PTB is higher in HIV1 positives than in HIV1 negatives. Hence, the row
percentages are more appropriate in this case.
Ratio, proportion and rate

These are terms which are used commonly in epidemiology and vital statistics so we should be able to
distinguish them.
Ratio: Any number (numerator) divided by any other number (denominator) gives a ratio.
Example: X/Y is a ratio
X is the numerator and,
Y is the denominator.
It is not necessary that X and Y need not be counts or have the same units.
Example: Sex ratio at birth = Number of male births
Number of female births.
Proportion: A proportion is a special form of a ratio only that in a proportion the numerator is part of a
denominator.
Example: (i)
(ii)
Proportion of girls in the first year MUCHS= Number of girls in 1st Year
Total number of 1st year students
Proportion of male births= Number of male births
Total number of births.
A proportion is often expressed as a percentage.

Rate: A rate is a proportion with an extra dimension of time. i.e. one has to study the population for a
particular period, say, 1 year and count the number of times the particular event occurs. So a rate indicates the
frequency of events occurring in a population per unit of time.
Example: The death rate per year is given by the number of deaths during the year divided by the number of
person-years of exposure to the risk of death.
16
That is, crude death rate = Number of deaths in one year x 1000
Total population
Rates may be expressed per 1000, per 100,000 or per 1,000,000 population depending on convention and
convenience.
DESCRIPTIVE METHODS FOR QUANTITATIVE DATA

Frequency distributions are also used to summarize quantitative data. A frequency distribution for
quantitative data can be for ungrouped or grouped data.
Frequency distribution for ungrouped data:
For discrete variables, frequency may be tabulated for each value.
Table 2.3:
The distribution of the counts of trypanosomes in the tail blood of a rat.
RELATIVE FREQUENCY CUMULATIVE FREQUENCY
COUNT FREQUENCY
0
4
3.1
3.1
1
27
21.0
24.1
2
27
21.0
45.1
3
20
15.6
60.7
4
16
12.5
73.2
5
17
13.4
86.6
6
12
9.5
96.1
7
2
1.6
97.1
8
1
0.8
98.7
9
2
1.6
100.0
Total
128
100.0
Frequency distribution for grouped data:
When dealing with a continuous variable or a discrete variable with a wide range of possible values, a
summary frequency table is produced by distributing the data into CLASSES or GROUPS and determine the
number of observations belonging to each class. Table 2.4 provides an example of frequency distribution for
grouped data.
17
Table 2.4: Frequency distribution of number of lesions caused by small pox virus in egg membranes.
NUMBER OF LESIONS
FREQUENCY (NUMBER OF MEMBRANES)
01
106
2014
3014
4017
508
609
703
806
901
1000
110-119
1
Total
80
Note: "-" means up to but not including the next tabulated value. Example, 10- means 10 is the lower limit
while 19 is the upper limit. 14.5 is the mid point for the class interval 10- .
The following rules are used to make frequency distribution for a grouped data.
1. Determine the range, R, of values. (R=largest value -smallest value)
2. Decide on the number, I, of classes. This number depends on the form of data and the requirements of the
frequency distribution. But usually they should be between 5 and 20 for convenience.
3. Determine the width of the class interval, W, such that W=R/I. A constant width for all classes is
preferable.
4. Choose the upper and lower limits of the class interval careful to avoid ambiguities.
5. List the intervals in order. Use tallies to allocate each observation into the class in which it falls. Add the
tally marks to obtain class frequencies.
Use of diagrams in quantitative data:
A: Histograms:
A histogram is a familiar bar-type diagram. Values of a variable are represented on a horizontal scale and the
vertical scale represents the frequency or relative frequency at each value. Each bar centres at the mid point
of the class. Example, using data on Table 2.3,
18
Fig 3: Histogram representing the frequency distribution of counts of trypanosomes in the tail blood of a rat.
30
Frequency
25
20
15
10
5
0
4
5
Count
If the frequency distribution is made of class intervals which are not equal, it is necessary to calculate the
average frequency per standard interval.
Example:
Table 2.5:
Frequency distribution of age at loss of last tooth
Frequency
Interval width
Average No/year of age
Age
11-15
16-19
20-24
25-29
30-34
35-44
45-54
55-74
1
7
21
35
40
58
28
10
5
4
5
5
5
10
10
20
0.20
1.75
4.20
7.00
8.00
5.80
2.80
0.50
19
Fig.4
A histogram showing distribution of age at loss of last tooth
B: Line diagrams:
These are often used to express the change in some quantity over a period of time or to illustrate the
relationship between continuous quantities. Each point on the graph represent represents a pair of values i.e. a
value on the x-axis and a corresponding value on the y-axis. The adjacent points are then connected by
straight lines.
40
Cumulative no. of cases (Thousands)
30
20
10
0
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
Year
Fig. 5 A line diagram showing cumulative number of AIDS cases in Tanzania from 1983 to 1992.
C: Frequency polygons
Frequency polygons are a series of points (located at the mid-point of the interval) connected by straight
lines. The height of these points is equal to the frequency or relative frequency associated with the values of
the variable (or the interval). The end points are joined to the horizontal axis at the mid points of the groups
immediately below and above the lowest and highest non-zero frequencies respectively.
Frequency polygons are not as popular as histogram but are also a visual equivalent of a frequency distribution. They can
easily be superimposed and therefore superior to histograms for comparing sets of data.
20
30
Frequency
25
20
15
10
5
0
Fig.6
4
5
Counts
Frequency polygon for the number of trypanosomes in the tail blood of a rat.
D: Cumulative Frequency curve

This is similar to a frequency polygon but the vertical axis displays cumulative relative frequency and
the point is placed at the upper limit of the interval. Example,
120
Frequency
100
80
60
40
20
0
Fig.7
4
5
Counts
Cumulative frequency curve for the number of trypanosomes in a tail blood of a rat.
21
Note: When making a statistical diagram:

Axes should be clearly labelled and units of measurement indicated.
Choice of scales should be made with care.
Measures of location/central tendency (mean, median, mode):
The measures of location give the overall magnitude of the values observed for each variable. The three
common measures of location are: arithmetic mean, median and the mode.
1. The arithmetic mean
This is the average or simply the mean. It is the sum of all observations divided by the number of
observations.
Example:, consider the heights of 10 men in cms: 165, 167, 169, 169, 171, 173, 175, 176, 176, 169
The mean height is calculated by adding the heights for the ten men and dividing the sum by 10.
Arithmetic mean
165+167+169+169+171+173+175+176+176+169
10
1710
=
10
171 cm
=
The arithmetic mean is denoted by x
=
Generally: x = Xi
n
where, X = X1 + X2 + X3 + ... + Xn,
Xi
n
= sum all the values of the variable X from i=1 to i=n.

= number of observations.
The arithmetic mean can also be calculated from frequency distributions.

Refer data on Table 2.3. Multiply each value of the variable with its frequency. Add them up and divide by
the total frequency.
i.e. x = Xifi
fi
Where Xi stands for the value of the variable and
fi stands for frequency for value Xi.
Example: Mean count of trypanosomes in a tail blood of a rat is given by:
(0x4)+(1x27)+(2x27)+...+(7x2)+(8x1)+(9x2)
128
402 = 3.1 = 3
=
22
128
With the grouped data the class midpoint should be used when calculating the mean. Consider data on Table
2.4 The mean number of lesions caused by small pox virus in egg membranes is :
(5x1)+(15x6)+(25x14)+ ... +(95x1)+(105x0)+(115x1)
80
= 3670 = 45.8
80
The arithmetic mean is a preferred measure since it uses more information from each observation. However it
tends to be pulled by extreme values. Example, the following are duration of stay in hospital (in days) for
some condition.
5 5 5 7 10 20 102
The mean duration of stay x = 154 = 22, this does not reflect the mean duration of stay.
7
2. Median:
The median is the middle observations when all the observations are listed in increasing or decreasing order.
Example, below is a series of duration (in days) of absence from classes due to sickness.
1, 1, 2, 2, 3, 3, 4, 4, 4, 4, 5, 6, 6, 6, 6, 7, 8, 10, 10, 38, 80
The median is 5.
Generally, when n (number of observations) is odd the median is 1/2 (n+1)th observations. But when n is
even since there is no middle observation, the median is the mean of the two middle observations, i.e. 1/2nth
and (1/2n+1)th observation.
In frequency distributions, the median can be obtained by accumulating the frequencies and noting the value
of the variable which divides the data into two equal halves i.e. An observation where 1/2n of the observation
lie.
Note:
1. The median is less efficient than the mean because it takes no account of the magnitude of most of the
observations.
2. If two groups of observations are pooled, the median of the combined group can not be expressed in terms
of the medians of the two component groups.
3. The median is much less amenable than the mean to mathematical treatments and so it is less used in more
elaborate statistical techniques.
However if the data are distributed asymmetrically, the median is more stable than the mean. Consider the
example on the duration of stay in hospital where the median is 7; this is more realistic than the calculated
mean of 22 days.
3. Mode:
The mode is the value with the highest frequency. i.e. The value which occurs most frequently. The modal
value (days) for the duration of stay in hospital, example given above, is 5.
Measures of variability:
23
These measures express the degree of variation or scatter of a series of observations. Common measures of
variation are range, variance and standard deviation.
1. The range:
This is the difference between the maximum value and the minimum value.
Example: if the lowest and highest of a series of diastolic blood pressure are 65 mm Hg and 95 mm Hg.
Then, the range = 95 mm Hg - 65 mm Hg = 30 mm Hg.
The range is seldom used in statistical analysis because:a) It wastes information since it uses information from only two extreme values.
b) The two extreme values are more likely to be faulty.
c) The range increases with increasing number of observations.
2. Variance and standard deviation:
The variance is a measure of variability which makes use of the differences from each observation and the
mean i.e. (Xi - x ). If all the differences are added together, and their mean calculated, it gives an indication
of the overall variability of the observations.
But (Xi - x ) is always zero since some differences are positive while some are negative. Because of this, the
differences are squared.
The variance is the mean value of the squared deviations from the mean.
i.e. variance = (Xi - x )2
n
and the numerator, (Xi - x )2 is called, the sum of squares about the mean.
Since these differences are squared, the variance is measured in the square of the units in which the variable
X is measured. For example, if X is height in cm. The variance will be in cm2.
A measure of variation which is measured in the original units of the variable is the standard deviation
which is the square root of the variance.
Standard deviation = (Xi - x )2
n
The standard deviation shows the average deviation of observations from the mean. And the interval x +
2SD covers roughly a 95% of all the observations.
The population variance is in most cases unknown because data are normally not available for the whole
population. When this is the case, the population variance is estimated by the sample variance, S2.
S2 = (Xi - x )2
n-1
24
Note a change in the denominator from n to n-1. When n-1 is used in the denominator, it gives a better
estimate of the population variance than when n is used.
Calculation of variance and standard deviation:
To calculate the variance and standard deviation for the following data:
(Xi - x )2
0
9
16
16
49
9
1
100
Xi - x
0
-3
-4
+4
7
-3
-1
Xi
8
5
4
12
15
5
7
56
Xi = 56
n=7
x = 56 = 8
7
(Xi - x )2 = 100
S2= 100 = 16.67
6
S =- 16.67 = 4.08
Variance and standard deviation can be calculated using the shortcut formula for (Xi- x )2 (don't forget to
divide by n-1, later)
(Xi - x )2 = Xi2 - (Xi)2
n
So using the same data above,
Xi
8
5
4
12
15
5
7
Xi=56
Xi2
64
25
(Xi - x )2 = 548 -(56)2
16
7
144
= 548 - 3136
225
7
25
= 100
49
S2
= 100 = 16.67
2
Xi =548
6
S
= 16.67 = 4.08
Symmetric and skew distributions:

25
We defined the mode as the value of the variable which occurs most frequently. In other words it is the value
at which the frequency curve reaches a peak. When the frequency distribution has one peak (one mode), it is
called a unimodal distribution.
Table 2.6:
Frequency distribution of number of males in sibships of eight children.
FREQUENCY
NUMBER OF MALES
(NUMBER OF SIBSHIPS)
0
161
1
1,152
2
3,951
3
7,603
4
10,262
5
8,498
6
4,948
7
1,655
8
264
Total
38,495
The mode of this distribution is 4 and the distribution is Unimodal as seen in Fig. 8.
In some of the unimodal distributions, the frequency curve is "BELL SHAPED". i.e. the mode is somewhere
between the two extremes of the distribution. Such distributions are said to be symmetric.
In symmetric distributions the mean, mode and median coincide. Other Unimodal distributions are
asymmetric. Asymmetric distributions have the mode (peak) not at the centre of the distribution curve. An
asymmetric distribution is called skew distribution.
The distribution is positively skew if the upper tail is longer than the lower tail and is negatively skew if the
lower tail is longer than the upper tail.
Some distributions have more than one mode. If a distribution has two modes, it is called a bimodal
distribution. If the distribution is symmetric but bimodal the mean and the median are approximately the
same, But this common value can be somewhere between the two peaks.
26
12
NO.OF SIBSHIPS. (Thousands)
10
NO.OF MALES
Fig.8:
Distribution of the number of males in sibships of 8 children.
Normally a bimodal distribution indicates that within the population under study there are two distinct groups
which differ in the variable being measured.
Examples of variables that follow a Bimodal distribution are:
i. Body temperature of malaria patients
ii. Distribution of values of dilution levels of phenylthiourea solution to determine tasters and non tasters.
27
The table below shows data for 104 medical students who determined their taste threshold to phenylthiourea
(PTC):
Table 2.7:
Phenylthiourea taste thresholds in a sample of 104 medical students
Solution number
1
2
3
4
5
6
7
8
9
10
11
12
Concentration (mg/l)
1.27
2.54
5.08
10.20
20.30
40.60
81.20
182.00
325.00
650.00
1300.00
>1300.00
Number of students
11
16
23
12
3
0
3
5
8
10
8
5
If log concentration is plotted against frequency, a binomial distribution is observed.
25
No. who tasted
20
15
10
5
0
0.1 0.4 0.7
1.3 1.6 1.9 2.2 2.5 2.8 3.1 3.4 3.7

Log concentration
Fig.9 Taste threshold for PTC.

This shows that the students form two distinct groups one for "tasters" and the other for "non-tasters".
28
EXERCISE
1.
The following table shows the numbers of viral infected patients not in hospital and in hospital
subdivided by sex and age.
NOT IN HOSPITAL
IN HOSPITAL
Age (years)
Males
Females
Males
Females
0 - 14
43
42
25
9
15 - 29
59
49
55
27
30+
65
28
39
14
Obtain a two way summary table to show how the proportion (in percent) of patients who are in hospital
varies with : i. Age ii. Sex
2.
The table below shows Accidental deaths by place and time.
PLACE OF DEATH
Transport
Work
Home
Other
1971
8401
860
6917
3068
1976
7306
712
6250
2831
YEAR
1980
6945
630
6009
2516
1982
6407
457
5468
2781
1983
6138
443
5514
2459
Construct:
a. A bar chart showing accidental deaths by place for each year shown.
b. A pie chart showing accidental deaths by place for 1983.
3.
A sample of 11 patients admitted for diagnosis and evaluation to a newly opened psychiatric ward of
a general hospital experienced the following lengths of stay.
PATIENT NUMBER
1
2
3
4
5
6
LENGTH OF STAY
29
14
11
24
14
14
PATIENT NUMBER
7
8
9
10
11
Find:
a. The mean length of stay for these patients.
b. The variance
c. The mode.
29
LENGTH OF STAY
14
28
14
18
22
4.
The following are the fasting blood glucose levels of 100 children.
56
57
62
63
64
60
61
67
69
68
65
65
68
65
75
66
69
72
75
81
69
66
65
65
65
68
72
73
73
81
65
65
66
68
66
72
73
75
66
73
73
68
69
67
67
67 61
67 77
65 75
62 55
63 60
57
57
62
67
59
72
61
73
63
80
61
76
57
68
64
64
65
76
58
56
71
58
55
79
71
60
80
80
55
65
73
75
74
68
63
74
59
55
65
59
56
52
75
63
74.
Construct for these data:

i.
ii.
5.
A frequency distribution, the relative frequency distribution, cumulative relative frequency

distribution.
A histogram, a frequency polygon and a cumulative relative frequency curve.
The following are the number of babies born during a year in 60 community hospitals.
30 37 32 39 52 55 55 26 56 57 45 43 28 58 46 27 52 40 59 43
56 54 53 49 54 48 42 54 53 31 45 32 29 30 22 49 59 42 53 31
32 35 42 21 24 57 46 54 34 24 47 24 53 28 57 56 57 59 50 29
From these data find:
(a) The mean, (b) The median, (c) The variance, (d) The standard deviation.
6.
The following are the haemoglobin values (g/100ml) of 10 children receiving treatment for
haemolytic anaemia.
9.1 10.1 11.4 12.4 9.8 8.3 9.9 9.1 7.5 6.7
Compute:
The sample mean, median, variance and the standard deviation.
30
Chapter 3
PROBABILITY
INTRODUCTION
The theory of probability underlies the methods for drawing statistical inferences in medicine. The
knowledge of probability will therefore help you to set the groundwork for the development of statistical
inference.
Definition:
Probability of an event is defined to be the proportion of times the event occurs in a long series of random
trials.
Examples:
1. If an unbiased coin is tossed many times roughly 50% of the results will be heads. Thus when tossed once,
the probability of a Head (H) or a Tail (T) is .
2. If you are told that in a certain country 10% of the population are HIV positive. If a person is selected from
this population at random, it could be said that the probability that he/she is HIV positive is 1/10 since this
event occurs on average to one person in 10.
3. A die has six sides numbered 1, 2, 3, 4, 5, 6. If an unbiased die is tossed once, the probability of any of the
sides showing is 1/6 i.e. P(1) = 1/6, P(2)= 1/6, P(3)= 1/6, P(4)= 1/6, P(5)= 1/6, P(6)= 1/6.
Note:
(1) Probabilities are proportions and so they take values between 0 and 1.
(2) A probability of 0 means that the event never occurs whereas the probability of 1 means the event
certainly occurs.
(3) The sum of probabilities of all possible outcomes is 1.
Example: in tossing a coin P(H) + P(T) = 1 and in tossing a die,
P(1) + P(2) + P(3) + P(4) + P(5) + P(6) = 1.
PROBABILITY CALCULATIONS (ADDITION AND MULTIPLICATION RULES)

Mutually exclusive events and the addition law:
Two events that cannot happen together are defined as mutually exclusive events. That is, if events A and B
are mutually exclusive, only one of them (either A or B) will occur at only and one only particular trial.
Example: When a coin is tossed, the events Head(H) or Tail(T) are mutually exclusive. i.e. when you toss a
coin you either get a Head or a Tail and not both.
31
The Addition law:If A and B are possible events with known probabilities of occurrence then, P(A or B or both) = P(A) + P(B) P(A and B).
P(A and B) is the probability of the double event for non mutually exclusive events.
Consider a doctor's name being chosen haphazardly from the Tanzania medical register. If the probability that
this doctor is a male is 0.9 and the probability that the doctor qualified at Muhimbili Medical School is about
0.8,
what is the probability that the doctor is either a male or the doctor qualified at Muhimbili medical
school?
Let A be the event that the doctor is a male and B be the event that the doctor qualified at Muhimbili medical
school.
P(A or B) = P(A) + P(B) - P(A and B)
=0.9 + 0.8 -0.72
=0.98
Note that A and B are not mutually exclusive events because a doctor can be a male and qualified at
Muhimbili Medical School. In this case if the probability of the double event is not subtracted the probability
will exceed 1.
But if the two events are mutually exclusive, the probability of the double event is 0 and so the probability of
either A or B is given by the sum of the probabilities of two events.
That is, P(A or B) =P(A) + P(B)
Example:
When a die is thrown, what is the probability of a 3 or 5 showing?
These are mutually exclusive because you cannot have 3 and 5 at the same time.
So P(3 or 5) = P(3) + P(5).
= 1/6 + 1/6
= 1/3
Multiplication rule:
If for example there are two random sequence of trials proceeding simultaneously. e.g. at each stage a coin
may be tossed and a die thrown.
How can we get the probability of a particular combination of results. e.g. P(H and 5)
we need to use the multiplication rule.
P(H and 5) = P(H) x P(5, given H) can be written as
P(H and 5) = P(H) x P(5/H).
The second term on the right side, P(5/H) is called Conditional Probability i.e. Probability of a 5 showing on
a die given that a Head appeared on the coin.
32
Take an example of playing cards. It is a pack of 52 cards which are 13 Spades, 13 Diamonds, 13 Hearts, 13
Flowers. If you draw two cards (one at a time) from a pack of cards, what is the probability that the 1st and
2nd cards will be Spades?
NOTE: P (spade) = 13/52
P (spade/spade on 1st draw) = 12/51
This is so because of the fact that you have already drawn 1 spade thus decreasing the number of spades and
the pack by 1. So P (spade on 1st and 2nd draw) = 13/52 x 12/51 = 0.0588.
Definition:
Independent events:
Two events are independent if the occurrence of one does not affect in anyway the occurrence of the other.
Thus if A and B are independent events P(B/A) = P(B). When a coin is tossed, the outcome of the 1st trial
does not affect the outcome of the 2nd trial.
In independent trials, the multiplication rule assumes a simple form P(A and B) = P(A) P(B).
e.g. P(H and 5) = P(H) x P(5)
= 1/2 x 1/6 = 1/12.
EXERCISE
1.
Define the following terms:

a) Probability
b) Mutually exclusive events
c) Independent events
d) Conditional probability
2.
The following table shows 1000 nursing school applicants classified according to scores made on a
college entrance examination and the quality of the high school from which they graduated, as rated
by a group of educators.
SCORE
Low (L)
Medium (M)
High (H)
Total
QUALITY OF HIGH SCHOOL

POOR (P)
AVERAGE (A) SUPERIOR (S)
105
60
55
70
175
145
25
65
300
200
300
500
TOTAL
220
390
390
1000
a)
Calculate the probability that, an applicant picked at random from this group:
i) Made a low score on the examination.
ii) Graduated from a superior high school.
iii) Made a low score on the examination and graduated from a superior high school.
iv) Made a high score or graduated from a superior high school.
b)
Calculate the following probabilities:

(i) P(A) (ii) P(H) (iii) P(M) (iv) P(A/H)
33
(v) P(H/S).
Chapter 4
THE NORMAL DISTRIBUTION
INTRODUCTION
Breakdown of the total probabilities into the probabilities of each of the events is called probability
distribution. A variable, the different values of which follow a probability distribution, is known as a
random variable.
In genetical experiment, an example of a probability distribution can be obtained when we may cross two
heterozygotes with genotypes Aa. The progeny will be homozygotes (aa or AA) a heterozygotes (Aa) with the
probabilities shown below:
No. of A genes
Probability
in the genotype
0
1
2
Total
1
Probability distribution can be presented on the drawing as:
Genotype
aa
Aa
AA
0.6
PROBABILITY
0.5
0.4
0.3
0.2
0.1
0.5
1
1.5
NO.OF A GENES
2.5
Fig.10: Probability distribution for a random variable. The number of A genes in the genotype of progeny of
an Aa x Aa cross.
34
However, for continuous random variables, the probabilities of particular values of the variable are negligible
(sometimes zero). So to obtain the probability distribution of a continuous random variable, the concept of
probability should be confined to a specified interval on the continuous scale.
For example: when the probability that a man to be selected at random is exactly 70.2876 inches in height is
presumably zero; the probability that the man selected at random his height is between 70 and 72 inches is
0.12.
Continuous probability distribution.
Different random variables have different probability distributions, but the one which we will discuss here is
the Normal distribution.
The normal distribution:
The Normal (or Gaussian) distribution, is the most important continuous probability distribution.
Characteristics of the normal distribution
1.
2.
3.
4.
5.
It is a distribution of a Continuous Random variable.

It is bell-shaped.
It is symmetrical about its mean.
It is determined by two Quantities, its mean , and its standard deviation, . Changing shifts the
whole curve to the left or right, increasing makes the curve flatter and more spread out.
The probability between the limits:
- and + is 0.68
-1.96 and +1.96 is 0.95
-2.58 and +2.58 is 0.99
Normally, the probability distribution of the variables we observe are unknown. But if the smooth curve
depicting the probability distribution is bell shaped and reasonably symmetrical about the mean, use can be
made of the normal distribution.
The normal distribution as we have seen above, is determined by its mean and its standard deviation. These
quantities are different for different problems and so it is not possible to make tables of the Normal
distribution for all the values of and . So calculations are made by referring to the Standard Normal
distribution which has =0 and =1.
Thus an observation X from the normal distribution with mean and standard deviation can be related to a
standard normal deviate by calculating (SND).
SND = X -
Thus, any normal distribution with mean and standard deviation , the probability between X1 and X2 is
the same as the probability between SND1 and SND2 in the standard Normal distribution where
SND1= X1 - and
SND2 = X2 -
The table showing probabilities of the standard normal distribution is found at the end of this manual.
35
The first two digits of the SND are shown in the 1st column and the third digit is given by the other column
headings. The figures in the body of the table for particular values of the SND show the area under the
standard normal curve to the right of the SND.
If SND = 0.00 , the area to the right of SND is 0.5.
If SND = 1.14, the area to the right of SND is 0.12714 and the area to the left is given by 1-0.12714 =
0.87286.
Examples of applications of the standard normal distribution
1.
A study of blood pressure of Negro schoolboys gave a distribution of systolic blood pressure (SBP)
close to Normal with = 105.8mmHg and = 13.4 mmHg.
a)
What proportion of boys would be expected to have SBP greater than 120mmHg?
Calculate SND = 120 - 105.8 = 1.06

13.5
From the tables, the area to the right of SND = 1.06 is 0.14457 or 14%. So about 14.5% of the boys would be
expected to have SBP greater than 120mmHg.
b)
What proportion of boys would be expected to have SBP less than 120mmHg.
If 14.5% have SBP greater than 120 mmHg, then 100 - 14.5 = 85.5% will have SBP less than 120
mmHg.
c)
What proportion of boys would be expected to have SBP between 85 and 120 mmHg.
calculate SND1 =, 85 - 105.8 = -1.55
13.4
and SND2 = 120 - 105.8 = 1.06
13.4
36
Now look for the area between SND1 and SND2.
The area to the right of SND 1.06 is 0.14457

The area to the left of SND 1.55 is 0.06571
So the proportion with SBP between 85 mmHg and 120 mmHg is
100 - 14.5 - 6.1 = 79.4.
4.
Within what limits would the central 95% of SBPs be expected?

If = 105.8
= 13.4
then
+ 1.96 includes 95% of SPB.
(105.8 - 19.6(13.4) to 105.8 + 1.96 (13.4)
i.e 79.5 to 132.1 mmHg.
i.e. 95% of the School boys have SBPs between 79.5 mm/Hg and 132.1 mmHg.
37
EXERCISE
1.
Suppose the average length of stay in a chronic disease hospital of a certain type of patient is 60 days
with a standard deviation of 15. If it is reasonable to assume an approximately normal distribution of
lengths of stay,
Find the probability that a randomly selected patient from this group will have a length of stay :
a) Greater than 50 days.
b) Less than 30 days.
c) Between 30 and 50.
d) Greater than 90 days.
2.
If the total cholesterol values for a certain population are approximately normally distributed with a
mean of 200 mg/100 ml and a standard deviation of 20mg/100 ml,
Find the probability that an individual picked at random from this population will have a cholesterol
value:
a) Between 180 and 200 mg/100 ml.
b) Greater than 225 mg/100 ml.
c) Less than 150 mg/100 ml
d) Between 190 and 210 mg/100 ml.
38
Chapter 5
INTRODUCTION TO SAMPLING TECHNIQUES
INTRODUCTION
Often in research work we are dealing with groups which are effectively infinite, such as the number of
underfives in a district, for example. In sampling, part of a group (population) is chosen to provide
information which can be generalized to the whole, although in theory it would be possible to investigate the
whole group. Sampling is adopted to reduce labour and hence costs.
Definition:
Sampling is the process of selecting a number of study units from a defined study population. Otherwise, if
the whole population is studied the process is referred to as taking a census. We can illustrate the process of
sampling and the important activities involved with the following diagram:(1)
(2)
sample
n
Study population
N
(5)
Parameters
(3)
(4)
Statistics
The diagram depicts drawing a sample of size n using a particular sampling method from a study population
with N units (subjects). Inferential statistics techniques are then used to make inferences about the study
population on the basis of results from the sample.
The steps:
1) Identifying the study population (note: it is possible to have different study populations in one study).
2) Drawing a sample from the study population.
3) Describing the sample (e.g. by calculating relevant statistics).
4) Making inferences about the parameters.
5) Drawing conclusions about the study population.
39
Random against biased sampling:Selection of the study units can be purposive or random. When it is purposive, no valid assessment of
sampling error can be made and in many instances this will lead to some bias. We will come back to "bias" in
detail later under "other aspects of sampling".
If conclusions that are valid for the whole population are to be drawn on the basis of a sample, then the
sample should be representative of that population. A representative sample is one that has all the important
characteristics of the population from which it is drawn. Selection of sample on random basis is a necessary
but not always sufficient condition to achieve representativeness.
We shall consider two main aspects of sampling, namely:i. the sampling methods
ii. sample size.
Moreover, in the discussion, we shall confine ourselves to surveys designed to provide estimates (particularly
the mean and proportion) of certain characteristics of populations as opposed to other study types.
SAMPLING METHODS
The choice of a particular sampling method is influenced by the availability of a list of all the units that
compose the study population. This is called the sampling frame.
Examples could be a list of villages, a list of eligible users of family planning methods, a list of University
students, etc.
Types of sampling:
We can classify sampling methods into two types:
i. non-probability sampling and
ii. probability sampling.
Non-probability sampling:
There are two common methods which fall under this method: These are (i) convenience sampling and (ii)
quota sampling.
i. Convenience sampling - sample is obtained on convenience basis, e.g. the study units that happen
to be available at the time of data collection are selected (many hospital based studies use
convenience samples).
A major limitation of this approach is that the sample drawn may be quite unrepresentative of the
study population.
ii. Quota sampling - a fixed predetermined number of sample units from different categories of the
study population is obtained. Obtaining a sample in this manner ensures that a certain number of
sample units from different categories with specific characteristics (such as sex, religion, age) are
represented in the sample. It is useful when one desires to provide a balance of study units
according to some characteristics of interest. Convenience sampling would not achieve this sort of
balance.
40
Probability sampling:
In this type of sampling the selection procedure has some element of probability/chance. In particular, a study
unit has some known probability of being selected into the sample. We shall discuss five forms of probability
(also known as random) sampling.
(1)
Simple random sampling:

This is the simplest form of random sampling and forms the model for all the basic results of
sampling theory. Units in the study population have an equal chance of being selected. The steps
involved in simple random sampling include:(a) Obtaining a numbered list of all units in the study population (i.e. availability of complete
sampling frame).
(b) Deciding on the size of the sample.
(c) Selecting the required number of units using either the "lottery" system or tables of random
numbers.
Use of random sampling numbers:
The following steps are important when one uses tables of random numbers in selecting a sample on
a random basis. To illustrate the steps we shall use the random number table (found at the end of this
manual) which has been taken from Hill, A.B. (1984), A Short Textbook of Medical Statistics, (page
290), 11th Edn. London: Hodder and Stoughton.
i.
First, determine how many number digits you need. That is, see whether the sampling
frame is a one, two, or more digits. For example, if your sampling frame consists of 10 units,
this implies you will be choosing from a frame of a two-digit number in size. You must use
two digits from the random number table to choose from numbers 1-10.
If, however, your sampling frame is of three digits in size, then you obviously need to
choose from three digits. For example, the number 43 in columns 10, 11, and row 27, would
become 431. Going down the next numbers would be 107, 365, etc.
You would follow same reasoning if you needed a four digit number, for a sampling frame
which is of four digits. In our example of the number 431 on columns 10, 11,12, row 27, this
would now become 4316, the next down being 1075, and so on.
ii. Decide before hand whether you are going to go across the page to the right, down the page,
across the page to the left, or up the page.
iii. Without looking at the table, and using a pencil, pen, or any sharp-ending object, pin-point a
number to establish your starting point.
iv. If this number (in step 3) is among those on your sampling frame, take it. If not, continue to
the next number in the direction you decided before hand in step 2 until you find a number
that is within the range you need. This process goes on until you achieve enough units for
your sample.
41
(2)
Systematic sampling:
As the name suggests, this sampling method is such that elements in the sample are obtained in a
systematic way.
In carrying out systematic sampling, the following steps are important:
i) Obtain the sampling frame (and the size of the study population N,
ii) Decide on the sample size, n
iii) Calculate the sampling interval, k = N/n
iv) Select the first element at random from the first k units
v) Include every kth unit from the frame into the sample.
say)
For example, suppose a sample of 80 individuals is to be selected from a population comprising of

720 people.
Then n = 80, N = 720 and the sampling interval
k = N = 720 = 9 This is step (iii).
n 80
Step (iv) requires us to determine the first unit in the sample by selecting randomly one individual
from the first 9 individuals on the list. If, using simple random sampling for example, the initial
selection was 7 the selected individuals would be those occupying positions 7, 16, 25, 34, ..., etc,
according to step (v). This continues until 80 individuals have been obtained.
(3)
Stratified sampling:
In this method the population is divided into subgroups, or strata whereby each stratum is sampled
randomly with a known sample size. Strata may be defined according to some characteristics of
importance in the survey. These could be occupation, religion, age groups or even locality whereby
regions of the country may be taken as strata in a national health survey.
The steps involved in stratified sampling are as follows:i. Divide the population into subgroups (strata)
ii. Draw a sample (of predetermined size) randomly from each of the stratum.
An important stratification principle is that the between-strata variability should be as high as
possible, or equivalently that each stratum should be as homogeneous as possible (i.e. units within a
stratum should be as much alike as possible and units in different stratum should be as much different
as possible).
(4)
Cluster sampling:
There are situations in which obtaining a complete list of individuals in the study population is
practically not feasible or a complete sampling frame is not available before the investigation starts.
In such cases it would be easy and convenient to talk of a sampling frame in which the sampling units
are a collection (cluster) of study units.
42
Examples of such clusters would be schools, hospital wards, villages, etc. Since in this case the
sampling unit is a cluster (e.g. a school) the sampling method is known as cluster sampling. The
selection steps will be exactly the same as those for any of the above random sampling methods but
the sampling unit being the cluster.
Unlike in stratified sampling, an important principle in cluster sampling is that units within a cluster
should be as much heterogeneous as possible while the between-cluster variability should be as low
as possible.
(5)
Multistage sampling:
Multi-stage (originating from the Latin word "multus" meaning "many") sampling is carried out in
many (more than 1) stages, and different sampling techniques can be employed at every stage. In this
method the sampling frame is divided into a population of first-stage sampling units, of which a
first-stage sample is taken. Each first-stage unit selected is subdivided into second-stage sampling
units, which are then sampled. The process continues till it is convenient to stop.
To illustrate multistage sampling consider a health survey of primary school children in Tanzania
mainland. An immediate problem to taking a sample of these children is that it is almost impossible
to construct a complete sampling frame. A multistage sample might be:
(a) to take a sample of regions;
(b) within each selected region take a sample of districts;
(c) within each selected district, take a sample of schools;
(d) within each selected school, take a sample of school children, and carry out the
investigation.
The sample would thus be accomplished in four stages and notice that the construction of a complete
sampling frame for each stage is relatively easy.
Apart from this advantage (of coming up easily with complete sampling frames), multistage sampling
procedure is likely to result in an appreciable saving in cost by concentrating resources at selected
schools instead of a sample made up of children scattered in all parts of the country.
Sometimes, in the final stage of sampling, complete enumeration of the available units is undertaken.
In the above example, once a survey team has reached the level of a school it may cost little extra to
examine all the children in the school; it may indeed be useful to avoid complaints from children not
included in the study within the same school.
Other aspects of sampling

(a)
Bias in sampling
Bias in sampling refers to the systematic error in sampling procedures that may lead to distortion in
results. Sources of bias in sampling include the following:
i)
Non-response:
This is encountered mainly when subjects refuse to give a reply during interview, or when
they (the subjects) forget to fill in a questionnaire. The non-respondents (particularly those
due to refusal) may differ systematically from those who respond.
43
ii)
iii)
iv)
v)
vi)
(b)
Studying volunteers only:

The fact that some people volunteer to participate in a study may mean that such people are
different from the general population on the factors being studied.
Sampling registered patients only:
Patients going to a hospital are likely to differ from those being treated elsewhere.
Missing cases of short duration:
In prevalence studies, cases of short duration (e.g. fatal cases, cases with short episodes, and
mild cases) are more likely to be missed.
Seasonal bias:
If the condition under study exhibits different characteristics in different seasons of the year,
this may lead to a distortion in the results, depending on the period of data collection.
Tarmac bias:
Selecting a study area on the basis of "accessibility" will generally constitute a selection bias.
Ethical considerations
If recommendations from a study are intended for the entire study population (e.g all relevant
individuals in a region), then one is bound ethically to ensure the sample studied is representative of
that population.
Remember that random selection of a sample does not guarantee representativeness.
SAMPLE SIZE
(NOTE: This sub-section can be skipped without loss of continuity until variance and standard deviation has
been covered).
In the planning of a study in almost any subject, one of the first and fundamental questions to be considered is
the size of the study. The trivial answer to the question how big a sample do I need?' would be make as large
a sample as possible, since in a given study an increase in sample size will increase the precision of the
sample results'.
Clearly issues of cost of collection and processing of data come in with a potential limiting effect on the
sample size. We shall discuss the aspect of sample size in the simplest situation whereby a study is designed
to estimate a parameter such as the mean or the proportion and confine ourselves to the statistical problems
involved in the calculations.
(1)
Sample size for estimation of a mean

A study is to be designed to estimate the mean, say , of a population. Suppose the sample mean is x
, then the investigator is required to specify the maximum likely error, say =( - x ) he can accept.
From sampling distribution theory, we know that the interval + 2/n will include x 95% of the
time, where is the population standard deviation, and n is the sample size. (Note that the critical
value 1.96 has been approximated to 2). That is, the maximum likely error is 2/n.
2
Thus 2= 22/n. Hence 2 = 4 /n, and the required sample size, n, is given by:
n = 4 /
44
This formula implies knowledge of the population standard deviation , and in almost all
surveys this is unknown. It is necessary to replace with an estimate. This estimate may be
obtained from results of previous studies on the variable or alternatively be obtained as a
direct result from a pilot study.
(2)
Sample size for estimation of a proportion

Estimating a proportion is the most common thing in surveys; that is, prevalence studies.
Like in the problem of estimating the population mean, the researcher is required to specify
the maximum likely error (sometimes also called the margin of error) he can tolerate.
Suppose the total proportion of a population with the characteristic of interest is and a
sample of size n is to be taken to estimate , then we know the standard error of a proportion
estimated from the sample is
/(1 - ) or (100-) if is in percent.
n
n
If the maximum likely error which can be tolerated is , then assuming approximate
normality, this is given by
= 2(1-) (i.e. as before, approximating 1.96 to 2)
n
Hence 2 = 4(1-), and
n
n = 4(1-) or n = 4(100-) when p is expressed in percent form, in which case must also be
2
2
in percent.
Here, as before, it is necessary to know the population value in order to calculate the
sample size n.
But it is that the study aims at estimating! Again for the purpose of calculating the sample
size, should be replaced with a value obtained either from previous studies or directly from
a pilot study. Alternatively, if neither of the two sources works, plugging in a value of 0.5
(or 50%) for is justifiable since it can be shown mathematically that this value maximizes n
when is taken to be constant.
As a word of caution we need to point out that the formulae for determining the desired
sample sizes lead to rough approximations only giving us the order of the minimum numbers
required.
In reality we will find that we need to strike a balance between what we desire (as reflected
by the calculations) and what is feasible (as dictated by available resources such as the time,
transport, manpower, money, etc).
Moreover, it is not always necessary or possible to apply formulae to calculate the desired
sample size. In an exploratory study on shortage of drugs in Health Centres, for example, one
may simply choose the few clinics which are best-off and few others which are worst-off in
terms of availability of drugs, and go on with the study. Nevertheless, if no formulae can be
45
applied the following considerations are important in coming-up with a reasonable sample
size:-
(a) The type of study: -in exploratory studies, you usually have relatively small
samples.
(b) The number of variables to be used in the study: -the more variables, the smaller
the sample size, for practical reasons.
(c) The expected variation in the study population with respect to the most important
variables: -the bigger the variation, the larger the sample one needs.
(d) The scale on which the findings and recommendations from the study will be
used: -the larger the scale, the larger the sample.
Finally, we wish to point out that it is not generally true that the bigger the sample size, the
better the study becomes! In general, it is much better to increase in accuracy of data
collection (e.g. through careful pre-testing of the tools, or improving the quality of
interviewers, if any) than to increase on the sample size.
EXERCISE:
1.
A study is being planned to determine the mean birthweight of babies born at Muhimbili
Medical Centre. Birthweights are approximately normally distributed and 95% of the
weights are probably between 2000g and 4000g.
Determine the required sample size so that there is a 95% chance that the estimated mean
birthweight does not differ from the true value by more than 50g. (Hint: calculate the
standard deviation of the birthweights, first).
2.
You have been assigned to conduct a study in order to determine the prevalence (i.e.
proportion of people affected with) of bancroftian filaria infection in Dar-es-Salaam region.
A review of literature on the subject reveals that, studies done along the East African coastal
strip some years back, showed the prevalence to be in the order of 30%. What sample size do
you require in order to come up with a reasonable estimate in your study? Give a complete
answer including describing any assumptions or prior decisions that you undertake.
46
Chapter 6
ESTIMATION
In Chapter 5 we mentioned that we study a sample with the view to learning something about the
population as a whole.
In general, we wish to estimate characteristics of the population such as:
i. the mean value of some measurement;
ii. the proportion of the population with some characteristic.
STATISTIC AND PARAMETER

When we take a sample the quantity that we get is called a statistic. This quantity is an estimate of
some population value known as a parameter.
That is, quantities obtained by studying a sample are referred to as statistics, while population values
are referred to as parameters. Usually Roman letters are used to denote a statistic while Greek letters
denote a parameter.
Example:
Quantity
Sample (Statistic)
mean
variance
proportion
S2
p
Population (Parameter)
Thus, the sample mean x estimates the population mean , for example.
In general, the sample mean or sample proportion is unlikely to be exactly equal to the mean or
proportion in the population, although the former is intended to estimate the latter. If the two are
exactly equal to one another, it is just by coincidence.
This amounts to saying that almost always our conclusion about a population on the basis of the
sample we have taken will have some error.
We distinguish between two sorts of error:
(i) Sampling errors and
(ii) non-sampling errors
Sampling errors are those which arise due to the fact that we have observed only part of the whole
population, and they get less important as the sample size increases.
For example, an estimate of the mean number of children per household in a certain district based on
two households only (in the district) will certainly be poorer than that based on a sample of say 100
households.
47
We say there is less sampling error in the latter situation than in the former. If we investigated the
whole population (i.e. all households in the district) the sampling error would be zero because we
would know the population mean exactly.
Non-sampling errors are due mainly to fault in the sampling process which is likely to create room
for the potential sources of bias (sometimes also referred to as systematic errors) highlighted in
Chapter 2. These errors are potentially serious since the bias they cause may lead to invalid
conclusions being drawn. Increasing the size of a sample will not necessarily reduce the nonsampling errors.
For example, subjects may refuse to give a reply during interview or they may forget to fill in a
questionnaire. These non-respondents may differ systematically from those who respond.
Non-sampling errors also occur through equipment faults, observer errors and during data processing
through coding, data entry, etc.
However, in this section we will direct our attention to sampling (also known as random) errors.
THE STANDARD ERROR OF A MEAN

Consider the variable X. Suppose we take a sample of n units and measure this variable. We know
the sample mean x (given by X/n) may be different from the population value, simply because we
have taken a sample. The question is, how do we measure this sampling variation.
Ideally, we could take several samples of size n each and calculate several values of x . It is unlikely
that the values of x will be the same, but if they were all similar (i.e. at least close) this would imply
that the sampling error is small. If, in contrast, the values differed markedly, we would reasonably
conclude that the sampling error is large.
Let us revisit the issue of the sampling error in the situation of a sample taken once. Two properties
about the sampling error are apparent:1.
The larger the sample size the better the precision in estimating (i.e. large samples are more
likely to produce closer estimates than small samples).
2.
If the variability of the observations in the parent (study) population is small we would
expect the error to be small also, and vice-versa. Thus the sampling error depends on the
variability of observations in the population.
We mentioned earlier on the idea of taking repeatedly a random sample of size n and calculating each
time the sample mean x . This would lead to obtaining a series of values of x , and the natural
questions relating to this (new) variable x will be on its distribution as well as the mean and
variance of the variable. It can be shown mathematically that:i.
x tends to follow a normal distribution irrespective of the type of the parent

distribution (i.e. the distribution of X). In fact the distribution of x becomes closer to
the normal distribution as n increases.
48
ii.
iii.
iv.
The mean of the distribution of x is the same as that of X (i.e. the mean of the
sample means is the same as the mean of the parent population).
The variance of x is 2/n where 2 is the variance of X. It is easy to see that as the
sample size n increases, the variance of x decreases. From an earlier explanation,
this observation is expected.
The standard deviation of x is the square-root of its variance, and is often referred to
as the standard error of the mean. That is, the standard error of the (sample) mean,
usually written as SE( x ), is given by /n.
Note: In practice, the value of 2 will be unknown. It can be replaced by the sample value,
s2, and the expression for the standard error SE( x ) applies accordingly.
The fact x that tends to follow a normal distribution is remarkable, since this implies that the
properties of normal distributions apply to the distribution of the sample mean. In particular, we now
know that x follows a normal distribution with parameters and 2/n as the mean and variance,
respectively.
Hence, it follows, for example, that 95% of the sample means lie within the interval 1.96SE( x ).
This implies that there is a 95% chance of getting a sample mean within the interval 1.96SE( x ).
Equivalently, we are saying that the probability of having a sample mean in the interval 1.96
SE( x ) is 0.95.
Note: The limits of the interval 1.96SE( x ) are -1.96SE( x ) and +1.96SE( x ). That is,
alternatively, we are talking of the interval ranging from -1.96SE( x ) to +1.96SE( x )
We can express the above statements mathematically as follows:Pr{-1.96SE( x ) < x < +1.96SE( x )} = 0.95, where Pr{x} means "probability of x"
Re-arranging the left-hand side of the above equation, we obtain the following equivalent equation:
Pr{( x -1.96SE( x ) < < x +1.96SE( x )} = 0.95.
In words, this says that the probability that the interval
x -1.96SE( x ) to x +1.96SE( x ) includes the population value is 0.95.
When the value of x (and that of SE( x )) is known, then the interval x -1.96SE( x ) to x +1.96SE( x
), often written also as ( x -1.96SE( x ), x +1.96SE( x )), is called the 95% confidence interval of .
The logic of this is that, for known values of x and SE( x ), the interval ( x -1.96SE( x ), x +1.96
SE( x )) is known and fixed. Hence, it no longer makes sense to talk of the interval including with
0.95 probability since the probability is definitely either 1 or 0. That is, either the interval includes
or does not include .
Wider intervals, and therefore higher "confidence" can be set if required. For example, the value
2.58 can be used in the place of 1.96 to set 99% confidence intervals. Indeed an appropriate
standardized normal deviate, z, can be used to obtain desirable confidence intervals.
49
While we have used a property of the normal distribution (notably, the one which states that 95% of
the values lie within 1.96 standard deviations about the mean) to define a confidence interval, it is
important to distinguish between the 95% spread (or tolerance) interval/limits and the 95%
confidence interval/limits. The former is a descriptive measure while the latter is used in estimation
problems as a measure of precision. In particular, the limits 1.96 include 95% of the values in
the population whereas the limits 1.96/n include 95% of the sample means.
THE STANDARD ERROR OF A PROPORTION

Recall that if in a random sample of size n there are r units with some characteristic of interest (and
n-r do not have the characteristic), then the proportion p in the sample with the characteristic is
given by p=r/n. This estimates the parameter (i.e. the population proportion). Like in the sample
mean, the fundamental question is "how precise is p in estimating ?" Again, this can be measured
by the standard error of p, SE(p). This can be arrived at through the theory of repeated sampling.
In strict terms, and particularly with small samples, p follows the binomial distribution: we are not
going to learn about this distribution in this course! But for reasonably large samples, the
distribution of p can be approximated by the Normal distribution. It can be shown that the standard
error of p is given by SE(p)=(1-)/n. Because the population proportion, , will generally be
unknown, the standard error of p can be estimated by p(1-p)/n.
Arguments used to develop interval estimates of a population mean above, also hold true for
estimating a population proportion . Hence the 95% confidence interval of the population
proportion, is given by p1.96SE(p)(p-1.96SE(p), p+1.96SE(p)).
EXERCISE
1.
The distribution of the duration of stay in a hospital for a certain condition is known to be
skewed to the right. The mean length of stay is 10 days and the standard deviation is 8 days.
It is proposed to study a sample of 100 patients admitted in hospital for that condition.
(a)
(b)
(c)
(d)
2.
What kind of distribution will the duration of stay of the patients in the sample
follow?
Comment on the suitability of the use of the mean duration of stay as a summary
measure of central tendency in this case.
If you took many such samples (i.e. repeatedly) what kind of distribution would the
sample means follow?
What would be the mean and the standard deviation of the distribution of the sample
means in (c) above? Give a complete numerical answer.
In a random sample of 150 University of Dar-es-Salaam students it was found that 38 of them
received or needed to receive treatment for defective vision.
(a)
Estimate the proportion (in percentage) of students at the University who receive or
need to receive treatment for defective vision.
50
(b)
Estimate 90%, 95% and 99% confidence intervals for the true proportion of
University of Dar-es-Salaam students who receive or need to receive treatment for
defective vision.
51
Chapter 7
SIGNIFICANCE TESTS: ONE SAMPLE
INTRODUCTION
Chapter 6 dealt with the estimation of population parameters by sample statistics. These
sample statistics may further be idealized to answer questions about the population
parameters. In the framework of statistical inference the question is reduced to a hypothesis
and the answer to it expressed as the result of a test of the hypothesis.
Definition of terms
1.
Statistical hypothesis: This is a statement about the parameter(s) or distributional

form of the population(s) being sampled.
2.
Null hypothesis, Ho: This term relates to the particular hypothesis under test. In many
instances it is formulated for the sole purpose of being rejected or nullified. It is often
a hypothesis of 'no difference.
3.
Alternative hypothesis, H1: This is a statistical hypothesis that disagrees with the
null hypothesis.
The null hypothesis H0 and the alternative hypothesis H1 concern populations but our
conclusions are based on samples taken from these populations. Generalization from
sample to population is dangerous since sampling errors are involved. Therefore we
are unable to say that H0 or H1 are definitely true because of this sampling effect.
If sampling errors are taken into account, it can be investigated how likely each of these
hypothesis is. We have to measure the relevant information in the sampled data and weigh
this information in relation to the sampling errors involved.
4.
A statistic: is a value which depends on the outcomes on a variable for the sampled elements.
5.
A test statistic: is a statistic which represents the relevant sample information for the
question under investigation. It provides a basis for testing a statistical hypothesis and has a
known sampling distribution with tabulated percentage points (e.g. standard normal, 2, t
etc). The value of a test statistic differs from sample to sample.
6.
Significance level: This is the probability of rejecting H0 when it is true. It is often

expressed as a percentage, i.e. the probability is multiplied by 100. Often the 5% and 1%
levels (i.e. =0.05, 0.01 respectively) are chosen as important, but the selection is fairly
arbitrary.
7.
Critical value: This is the value of the test statistic corresponding to a given significance
level as determined from the sampling distribution of the test statistic (by using statistical
52
tables which will be explained later). The critical value is the boundary value such that if the
value of the test statistic is more extreme (i.e. more unlikely) than the critical value, then H0
is rejected and the probability of rejecting H0 when it is true is less than the significance
level.
CONCEPT OF P-VALUES
The p-value is a probability associated with the observed test statistic value.
The p-value of an observed test statistic value is the probability to obtain a test statistic value as
extreme as, or more extreme than, the observed test statistic value, if H0 is true. For example, in a
clinical trial this statement refers to the observed difference between the treatment groups. We are
therefore relating our data to the likely variation in a sample due to chance when the null hypothesis
is true in the population.
Interpretation of p-value
Large p-value points the null hypothesis
Small p-values are evidence for the alternative hypothesis.
A proposed guideline is:
p > 0.05
0.01 < p < 0.05
0.001 < p < 0.01
p < 0.001
No evidence against Ho
Evidence in favour of H1, but be careful
Substantial evidence in favour of H1
Very strong evidence in favour of H1. The possibility that Ho is true can be neglected.
However, for a proper interpretation of the p-value the sample size should be considered. If the
sample size is too small the sampling error will be large. This will prohibit us to find evidence against
Ho and result in high p-values, even if Ho is not true.
Relationship between p-values and sample size
Sample size is important in the interpretation of p-values.
p-value
Small
Large
Sample size
Small
Large
- evidence against Ho
- evidence against Ho
- results point away from Ho
- results support H1
- difficult to interpret
- no evidence against H1
- can't distinguish between Ho and H1
- results point at Ho
The following results relating to malnutrition among underfives in Dodoma and Mwanza using
different sample sizes confirm the above explanation.
n
50
500
50000
Dodoma
40%
40%
31%
Mwanza
30%
30%
30%
P
0.29
0.0098
0.0012
53
Conclusion
No significant difference
Highly significant
Highly significant
Statistical significance and practical significance

There are many situations in which a result may reveal a statistically significant difference which
might be quite unimportant clinically. For example, in a study to compare blood pressure in the left
and right arms, a small difference of about 1 mmHg was found. This difference was highly
statistically significant but of no importance clinically. Similarly, it is not reasonable to take a nonsignificant result as indicating no effect, just because we cannot rule out the null hypothesis.
ONE SAMPLE SIGNIFICANCE TEST FOR A MEAN (standard deviation, known)

Problem: Is it reasonable to conclude that a sample of n observations, with mean x could have been
from population with mean and standard deviation ?
Null hypothesis: The difference between and x is merely due to sampling error.
Calculate SND = x - and consider the numerical value of SND.
/n
If SND<1.96 we have no strong evidence against the null hypothesis and cannot convincingly show
that it is wrong. i.e p>0.05
If SND>1.96 we have evidence that the null hypothesis is false. It is unlikely that the difference
between x and is due to sampling error only, i.e.p<0.05. If SND>2.58 we have strong evidence
against the null hypothesis p<0.01. If 1.96 < SND < 2.58, we write 0.01<p<0.05.
Example:
A large number of patients with cancer at a particular site, and of particular clinical stage, are found
to have a mean survival time from diagnosis of 38.3 months with a standard deviation of 43.3 months.
100 patients are treated by a new technique and their mean survival time is 46.9 months. Is this apparent
increase in mean survival time associated with the new technique?
Solution:
Null hypothesis: There is no increase in mean survival time in the patients treated with the new
technique.
We have the standard normal deviate as
SND = 46.9-38.3 =
(43.3/100)
8.6 = 2.0
4.33
This value just exceeds the 5% value of 1.96, and the difference is therefore significant, i.e p<0.05
Thus we conclude that it is likely that there is an increase in the mean survival time among patients
who were treated by a new technique.
54
ONE SAMPLE SIGNIFICANCE TEST FOR A PROPORTION

Problem: Is it reasonable to conclude that a sample of n observations in which r have a characteristic,
could have been taken from a population in which the proportion with the characteristic is ?
Null hypothesis: The difference between the number with the characteristic, n, expected in the
sample, and the number observed, r, is merely due to sampling error.
If n is reasonably large, then calculate
SND = r - n
n(1-)
If SND<1.96 we have no strong evidence against the null hypothesis and cannot convincingly show
that it is wrong. i.e p>0.05
If SND>1.96 we have evidence that the null hypothesis is false (p<0.05). It is unlikely that the
difference between p and is due only to sampling error.
If SND>2.58 we have strong evidence against the null hypothesis p<0.01. If 1.96 < SND < 2.58 we
write 0.01<p<0.05.
Example:
In a clinical trial to compare two analgesics A and B, 100 patients were each given the two drugs on
different occasions. Of the 100 patients, 65 say they prefer A, 35 prefer B. Is this reasonably good
evidence that more patients prefer A than B?
Solution:
If patients in general showed no preference for A or B, the proportion of A preferences, , would be
0.5.
Null hypothesis: The proportion of all patients of this type who prefer A is = 0.5.
Let r be the observed number of A preferences out of n=100 patients. Then r=65 and SD(r) = n(1-
) = 100x0.5x0.5 = 5.
Therefore SND = 65-50 = 3
5
Since SND>2.58, then p<0.01. Therefore we have strong evidence against the null hypothesis and we
can conclude that there is good evidence of more patients preferring drug A.
Significance tests from confidence intervals

Recall that a confidence interval for a population parameter ( or ) provides limits which have a
high probability, for example 95%, of including the unknown parameter.
In a significance test of a hypothesis the question of whether the population parameter takes a
particular value is posed.
55
Clearly these two approaches are related. If, for example, the 95% confidence interval includes the
value of the parameter proposed by the hypothesis then the result of the test must be non significant
at the 5% level (i.e. p>0.05).
If, on the other hand, the 95% confidence interval does not include the value of the parameter
specified in the null hypothesis, then the result of the test must be significant at the 5% level (i.e.
p<0.05).
For example, in the test of a sample mean (example on mean survival time of patients after being
treated by a new technique), x =46.9 months and SE( x )=4.33 months.
Thus a 95% confidence interval for the true mean survival time due to this new technique is
x 1.96 x SE( x )
= 46.9 1.96 x 4.33
= 46.9 8.49
= 38.4 to 55.4
The value proposed in the null hypothesis is 38.3 months and we note that it is not included in the
confidence interval. It would thus be concluded that 38.3 is an unlikely value for the mean survival
time of cancer patients after treatment. Equivalently, we are saying, the null hypothesis is rejected at
5% level (i.e. p<0.05).
The t-test
As already shown above, the standard normal deviate test involves the calculation of
SND = x - = x -
SE( x ) /n
The SND is then compared with the critical values 1.96 or 2.58. This was applied since the
population standard deviation, , was known. If, as is usually the case, is unknown, the SND cannot
be calculated. However, the value of can be estimated from the sample by the standard deviation s.
Replacing in the above formula by s, we obtain a new quantity t, given by
t = x-
s/n
t follows the t-distribution on n-1 degrees of freedom.
As the sample size increases, s should be nearly equal to and t will be very close to the standard
normal deviate.
At the end of this manual, we find a table which shows the critical values of t, for each number of
degrees of freedom.
56
Example:
The following data are uterine weights (in mg) of each of 20 rats drawn at random from a large stock.
Is it likely that the mean weight for the whole stock could be 24 mg, a value observed in some
previous work?
9
15
18
19
21
22
26
29
14
15
18
19
22
24
27
30
16
24
20
32
Here n=20, =24, x=420, x =420/20 =21 and s=5.91.

The null hypothesis is that the mean weight for the whole stock is 24 mg.
Therefore, t = 21 - 24 = -2.27
1.3219
The degrees of freedom, df, are 20-1=19. From the t-Table, t(0.05, 19)= 2.093. Since 2.27>2.093,
then p<0.05.
Thus there is sufficient evidence to suggest that the mean uterine weight of the stock is different from
24 mg.
The 95% confidence interval for the true mean is
x t(0.05,19) x SE( x )
i.e. 21 2.093 x 1.3219
= 18.2 to 23.8
The exclusion of the value 24 corresponds to the significant result of testing this value at the 5%
level.
EXERCISE
1.
The mean level of prothrombin in the normal population is known to be 20.0 mg/100 ml of
plasma and the standard deviation is 4 mg/100 ml. A sample of 40 patients showing vitamin
K deficiency has a mean prothrombin level of 18.5 mg/100 ml.
(a).
How reasonable is it to conclude that the true mean for patients with vitamin K
deficiency is the same as that for a normal population?
(b).
Within what limits would the mean prothrombin level be expected to lie for all
patients with vitamin K deficiency? (Give the 95% confidence limits).
57
Chapter 8
SIGNIFICANCE TESTS: TWO SAMPLES
COMPARISON OF TWO MEANS
We shall distinguish between two situations: the unpaired case, in which the two samples are of equal
size and the individual members of one sample are paired with particular members of the other
sample; and the unpaired case, in which the samples are quite independent.
Matched/paired observations
So far, the problem arising from the comparison of a single sample mean with some value proposed
under the null hypothesis has been considered. We had only one sample which was compared with a
fixed value , that is has no sampling error.
A common problem which normally arises in medical trials is the comparison of the responses to 2 or
more treatments. It is sometimes possible to reduce this problem of comparing 2 sets of responses to
treatments to a single sample problem previously described.
Suppose we have 10 patients as experimental units and they each have responses to 2 treatments, i.e
we are using the same patient as his own control assuming the order of administration has no effect.
Example: The following are anxiety scores recorded for 10 patients receiving a new drug and a
placebo in random order.
Patient
1
2
3
4
5
6
7
8
9
10
Anxiety score
Drug
19
11
14
17
23
11
15
19
11
8
Placebo
22
18
17
19
22
12
14
11
19
7
Difference (d)
(drug-placebo)
-3
-7
-3
-2
1
-1
1
8
-8
1
-13
Null hypothesis: The mean difference in the anxiety scores in the population from which this sample
was taken is zero. (i.e the mean difference observed in the sample is merely due to sampling error).
n=10,
d = -13,
d =-1.3,
s=4.548
Estimated standard error of d =s/n = 1.438
Calculate t = d - 0
58
SE(d )
Therefore, t = -1.3 - 0 = -0.90 on 9 df.

1.438
From the table of the t distribution, the critical value of t, with 9 d.f. at the 5% level is 2.26. Since
0.96<2.26, then p>0.05. Therefore the difference is not statistically significant.
Conclusion: These data show no strong evidence of a difference between the drug and the placebo in
their effect on anxiety scores.
The 95% confidence for the 'population' mean difference is
d t(0.05,9)xSE(d ) = -1.30 2.26 x 1.438
That is, -4.55 to 1.95, where t(0.05, 9) denotes the critical value of t corresponding to the 5% level of
significance with 9 df.
Difference between two independent sample means (

1, 2 known)
It is often of interest to determine whether two populations differ from one another with respect to
certain characteristics which summarize the values of the variable, such as their means. A test is
performed by selecting a random sample of ni observations from population i (i=1, 2). Two samples
are independent when an individual in one sample is unrelated to any particular individual in the
other sample.
The null hypothesis is: The two samples were taken from populations whose means 1 and 2 are
equal.
i.e. 1 = 2 or 1-2=0
This is equivalent to saying that the observed difference, x 1- x 2 is merely due to sampling error.
To test this hypothesis we calculate the standard normal deviate (SND) which is
SND =
( x1 x 2 )
SE ( x 1 x 2 )
where
SE ( x1 x2 ) =
2
1
n1
2
2
n2
SE 2 ( x1 ) + SE 2 ( x2 )
Example:
In a study of the age of menarche in women in the USA the following distributions were observed for
samples of women aged 21-30 and 31-40 years.
Age of menarche
women now aged 31-40
59
women now aged 21-30
10
11
12
13
14
15
16
17
18
3
11
28
23
12
1
2
8
14
27
5
8
1
1
Total, n
x
x2
(x2)/n
(x-)2= x2-(x)2/n
s2
s
SE( X )
66
916
13.88
12838
12713
125
78
969
12.42
12127
12038
89
1.923
1.387
0.171
1.156
1.075
0.122
SE( x1 - x2 )= 0.1712+0.1222
= 0.04413 = 0.2101
Therefore, SND = 13.88 - 12.42 = 1.46

0.2101
0.2101
= 6.95, p<0.001
Conclusion:
There is very strong evidence that on average, younger women's age of menarche is
less than the older women's age.
The 95% confidence limits for the true difference are:
(x1 - x2 ) 1. 96SE ( x1 - x2 )
i.e.
i.e.
13.88-12.42 1.96 x 0.2101

1.05 to 1.87 years
Difference between two sample means (

1, 2 unknown)
Null hypothesis: 2 samples taken from normal populations whose means 1 and 2 are equal and
whose variances s12 and s22 are also equal i.e the two samples are taken from identical populations.
60
Suppose as proposed under the null hypothesis 12 = 22 =

2
estimates
of
the
same
quantity
. So they
But our two sample variances s12 and s22 are two2 separate
2
are combined to give a single best estimate of , s with degrees of freedom equal to (n1-1)+(n2-1) or
n1+n2-2. Therefore,
2
s =
1
s2 =
2
S(x1- x1 )2
n1-1
and
( x2 x2 ) 2
n2 1
Hence, s2 = (n1 - 1)s12 + (n2 -1)s22

(n1 -1) + (n2 -1)
Therefore the standard error of the difference between the means is estimated by:
SE( x 1 - x 2 ) = s (
1
1
+ )
n1 n 2
and the t value is given as:
x1 - x2
t=
SE( x 1- x2 )
, d.f.=n1+n2-2
Example:
The following data show the abrasiveness of two brush-on denture cleaners A and B, measured by
weight loss in mg.
A: 10.2, 11.0, 9.6, 9.8, 9.9, 10.5, 11.2, 9.5, 10.1, 11.8
B: 9.6, 8.5, 9.0, 9.8, 10.7, 9.0, 9.5, 9.9
x
n
x2
(x)2/n
(x- X )2
A
103.6
10
10.36
1078.44
1073.296
5.144
B
76.0
8
9.50
725.20
722.0
3.20
61
s2 = 5.144+3.20
(10-1)+(8-1)
= 0.5215
s = 0.5215 = 0.7221
The standard error of the difference between the two groups , A and B is estimated by
SE ( x A x B ) = 0. 7221
1 1
+ = 0. 3425
10 8
Therefore t = 10.36-9.50 = 0.86 = 2.51

0.3425
0.3425
Here we have (10-1)+(8-1) = 16 d.f. From the table of the t distribution, t(0.05, 16) = 2.12.
Since 2.51>t(0.05,16), then p<0.05.
Thus, the difference is significant.
The 95% confidence interval for the true mean difference is
xA - xB +- 2.12xSE( x A- xB )
The 95% confidence interval for
the true mean difference is
i.e. 0.86 2.12 x 0.3425 = 0.13 to 1.59 mg.
COMPARISON OF TWO PROPORTIONS

Remember that when the variable under investigation is qualitative, i.e when we are interested in the
presence or absence of some characteristic we take a random sample of n individuals, observe the
number r with the characteristic and calculate the sample proportion p =r/n. If n is large and p not too
small, we use the normal approximation of the binomial. Now with two samples the approximation is
similarly used.
Suppose there are two populations in which the probabilities that an individual shows a certain
characteristic are 1 and 2. A random sample of size n1 from the first population has r1 individuals
showing the characteristic and the proportion is p1 = r1/ n1. The values for a sample from the second
population are n2, r2, and p2 = r2/ n2 .
Suppose we wish to test the hypothesis that 1=2.
If the null hypothesis is true, both samples are from the same population, and the best estimate of
will be obtained by p which is given by
r1+r2
p = n +n
1 2
The standard error of p1-p2 is
62
SE ( p1 p2 ) =
p(1 p)(
1 1
+ )
n1 n 2
The null hypothesis is thus tested approximately by using the standard normal deviate which is
SND =
p1-p2
SE(p1-p2)
Example:
A clinical trial was undertaken to assess the value of a new method of treatment A, in comparison
with the old treatment B. The patients were divided into two groups randomly.
Of 257 patients treated with treatment A 41 died.
Of 244 patients treated with treatment B 64 died.
The two proportions of patients dying are:
p1 = 41/257= 0.1595 and p2 = 64/244= 0.2623
Null hypothesis: The two treatments are equally effective. ie. the population proportions 1 and 2
are equal.
If the null hypothesis is true, then the two equal population proportions 1 and 2 can be written
simply as , i.e. 1 = 2 = .
We replace by the best single estimate available. This estimate is the proportion say p, obtained by
pooling the two samples.
This gives
p = 41+64 = 105 = 0.21
257+244
501
Therefore SE(p1-p2) = 0.21x.79(1/257 +1/244) = 0.001327= 0.0364
Standard normal deviate, SND =(p1-p2) - (1-2) = 0.1595 - 0.2623 -0
SE(p1-p2)
0.0364
SND = -2.82, p<0.01
The result is highly significant and suggests that treatment A (with a smaller proportion of patients
dying) is better than treatment B.
95% confidence limits for the true difference in the proportions dying are
p1 - p2 1.96 x SE(p1-p2)
-0.1028 1.96 x 0.0364
i.e. -0.174 to -0.031
63
64
EXERCISE
1.
A clinical trial to test the effectiveness of a sleeping drug was conducted among 11 patients.
They were observed during one night with the drug and one night with a placebo. One patient
died before the placebo reading was taken.
The following are the results of testing the effectiveness of the drug:
Patient number
1
2
3
4
5
6
7
8
9
10
11
x
x2
(a).
(b).
2.
Hours of sleep
Drug
Placebo
6.1
5.2
7.0
7.9
8.2
3.9
7.6
4.7
6.5
5.3
8.4
5.4
6.8
(died)
6.9
4.2
6.7
6.1
7.4
3.8
5.8
6.3
77.4
52.8
551.16
292.88
Establish whether there is or there is no real difference in sleeping time between the
drug and the placebo.
Determine the 95% confidence interval for the difference in the mean sleeping time.
Comparison of birth weights of children born to 15 non-smokers with those born to 14 heavy
smokers gave the following results:
Mean
Standard deviation
Non-smokers
3.5933
0.3707
Heavy smokers
3.2029
0.4927
Is there enough evidence that on average children born to non-smokers are heavier than
children born to heavy smokers? Confirm your results by a 95% confidence interval of their
difference in birth weights.
3.
In a study of the cariostatic properties of dentrifices 423 children were issued with dentrifice
A and 408 were issued with dentrifice D. After 3 years, 163 of the children on A and 119 of
the children on D had withdrawn from the trial. The authors suggest that the main reason for
withdrawal from the trial was because the children disliked the taste of the dentrifices. Do
these data indicate that one of the dentrifices is disliked more than the other?
65
Chapter 9
THE CHI-SQUARED (
2) TESTS
INTRODUCTION
The 2 (Greek letter chi, pronounced kye) squared test is used to determine whether a set of
frequencies follow a particular distribution (e.g. Binomial, Normal, Poisson, etc). In its basic form it
tests whether the observed frequencies of individuals with some characteristics are significantly
different to those expected on some hypothesis.
THE 2X2 TABLE

Considering our previous example which arises from the comparison of two proportions. The results
of the clinical trial in which the proportion of patients dying who received either treatment A or B
were compared, are presented in the following table:
Treatment A
Treatment B
Total
Outcome
Died
Survived
41
216
64
180
105
396
Total
257
244
501
Such a table is called a 2x2 contingency table since there are 2 rows and 2 columns. (In general we
can have an "rxc" contingency table, i.e. a table with r rows and c columns).
From the above table, the observed frequencies are 41, 216, 64 and 180. We need to obtain the
expected frequencies under the null hypothesis that "the row treatments have the same effect on
the outcome".
The expected frequencies are calculated in the following way:Expected frequency, E = row total x column total
grand total
For example, in the top left cell, where we observe 41 deaths the expected frequency under the null
hypothesis is
105x257 = 53.86
501
These expected frequencies are shown in the table below. They add up to the same grand total as the
observed frequencies.
We can then compare between the observed and the expected frequencies by looking at their
differences. We need also to consider the importance of the magnitude of the differences (eg. a
difference of 5 between 995 and 1000 is not as important as the "discrepancy" of size 5 between 2
and 7).
66
E
53.86
203.14
51.14
192.86
501.00
41
216
64
180
501
O-E
-12.86
12.86
12.86
-12.86
0.00
(O - E)2
3.07
0.81
3.24
0.86
7.98
The chi-squared value is obtained by calculating (observed-expected)2/expected for each of the four
cells in the contingency table and then summing them.
The general formula for 2 is
2 = (O-E)2
E
The percentage points of the chi-squared distribution are given at the back of this manual. The values
depend on the degrees of freedom.
If a contingency table has r rows and c columns then the degrees of freedom are given by
df = (r-1)(c-1). From our example the degrees of freedom is, df= (2-1)(2-1) = 1
Therefore from the above table, 2 = 7.98 on 1 df.
The Chi-Squared Table at the end of this manual shows that the observed value of 7.98 is beyond the
0.01 point of the chi-squared distribution. Therefore p<0.01. We conclude as before that the
differences between the two treatments are highly significant.
Note that previous analysis yielded Z = -2.82. It can be shown that for d.f.=1, Z2 = 2 i.e. 2.822 =
7.978. A short-cut formula for computing 2 for a 2x2 table is given as follows.
Variable y
y1
y2
Column marginal total
Variable x
x1
x2
a
b
c
d
s1=a+c s2=b+d
Row marginal total

r1 = a+b
r2 = c+d
n = a+b+c+d
2 = (ad -bc)2n
r 1r 2s 1s 2
67
LARGER CONTINGENCY TABLES (rxc)

The following data show a sample of 10 years old children classified according to the state of oral
hygiene and type of school attended.
Type of school
Below average
Average
Above average
Total
Good
62
50
80
192
Oral hygiene
Fair+
Fair103
57
36
26
69
18
208
101
Bad
11
7
2
20
Total
233
119
169
521
Null hypothesis (Ho): There is no association between oral hygiene classification and type of school
attended. i.e the proportions of children attending below average, average and
above average schools are the same in children with good, fair+, fair- or bad
oral hygiene.
The expected number of children attending below average schools in a sample of 192 children with
good oral hygiene is
233 x 192 = 85.9
821
Similarly, the expected number of children attending below average schools out of 208 children with
fair+ oral hygiene is
233 x 208 = 93.0
521
Thus the expected frequencies are given in the table below:-
Type of school
Below average
Average
Above average
Total
Good
85.9
43.9
62.3
192.1
Oral hygiene
Fair+
Fair93.0
45.2
47.5
23.1
67.5
32.8
208.1
101.1
Bad
8.9
4.6
6.5
20.0
Total
233.0
119.1
169.1
521.2
We then apply the general formula for 2 which is

2 = (O-E)2
E
2 = (62-85.9)2 + (103-93.0)2 + ........ + (2-6.5)2

85.9
93.0
6.5
2 = 6.6 + 1.1 + 3.1+ .... + 0.0 + 6.7 + 3.1

2 = 31.43; df = (3-1)(4-1) = 6. Therefore, p< 0.001
Thus, we reject Ho and conclude that it is highly probable that there is an association between oral
hygiene and type of school attended.
68
The observed proportions of children with good oral hygiene:Type of school
Proportion with good oral hygiene
Below average
62 = 0.27
233
Average
50 = 0.42
119
Above average
80 = 0.47
169
From above, we note that a large proportion of children with good oral hygiene attended above
average schools compared to those who attended below average schools.
Comments regarding the use of 2 tests
1.
The Chi-squared test is only valid for comparing observed and expected frequencies. It is not
valid for other variables such as percentages, means, rates, etc.
2.
The Chi-squared test is not valid for cells with expected frequencies less than 5. With very
small frequencies in a 2x2 table, the Fisher's exact test should be used.
EXERCISE
1.
Pregnancies with retarded intra-uterine growth observed in a clinic were classified as to

whether threatened miscarriage occurred and whether the placenta was circumvallate (3
degrees, normal, minor, and major) at delivery.
Miscarriage
Threatened
Not threatened
Total
Types of placenta
Normal
Minor
Major
10
18
14
36
12
8
46
30
22
Total
42
56
98
Investigate the association between threatened miscarriage and the degree to which the
placenta is circumvallate at delivery.
69
Chapter 10
ASSOCIATION BETWEEN QUANTITATIVE VARIABLES
INTRODUCTION
Examples of quantitative variables have been seen in Chapter 2. The methods for analyzing the
relationships between two or more of such variables are linear regression and correlation.
In order to illustrate the methods of linear regression and correlation, we will use data on body weight
and plasma volume of eight healthy men.
The objective of the analysis is to see whether a change in plasma volume is associated with a change
in body weight.
Table 10.1: Plasma volume and body weight in eight healthy men.
Subject
1
2
3
4
5
6
7
8
Body weight (kg)

58.0
70.0
74.0
63.5
62.0
70.5
71.0
66.0
Plasma volume (l)

2.75
2.86
3.37
2.76
2.62
3.49
3.05
3.12
SCATTER DIAGRAM
When two related variables, also called bivariate data, are plotted on a graph in the form of points or
dots, the graph is called a scatter diagram. Each point on the diagram represents a pair of values, one
based on X-scale and the other based on Y-scale. Usually, making a scatter diagram is the first step in
investigating the relationship between two variables, because the diagram shows visually the shape
and degree of closeness of the relationship.
Values on the X-scale refer to the explanatory or independent variable and on the Y-scale refer to the
response or dependent variable. In situations where it is not clear which is the response variable, the
choice of axes is arbitrary.
In the above example, take the independent variable (x) to be body weight and the response variable
(y) to be plasma volume. The scatter diagram would look like the one drawn below.
70
P
l
a
s
m
a
v
o
l
u
m
e
3.6
3.4
3.2
3
2.8
2.6
2.4
2.2
2
56 58 60 62 64 66 68 70 72 74 76
Body weight (kg)
Fig.11.1 Scatter diagram of plasma volume and body weight

Examination of plasma volume and body weight suggests that there is a trend of plasma volume to
increase with increasing body weight.
LINEAR REGRESSION
When a response variable appears to change with a change in values of the explanatory variable, we
may wish to summarize this relationship by a line drawn through the scatter of points.
Geometrically, any straight line drawn on a graph can be represented by the equation:
y = a + bx
y refers to the values of the response (dependent) variable and x to values of the explanatory
(independent) variable. The equation tells us how these variables, x and y, are related. The constant 'a'
is the intercept, the point at which the line crosses the y-axis; that is, the value of y when x = 0.
The coefficient of x variable ('b') is the slope of the line. It tells us the average change (increase or
decrease) due to a unit change in x. It is sometimes called the regression coefficient.
Although we could draw the line through these points 'by eye', this would be a subjective approach
and therefore unsatisfactory. An objective approach, and therefore better, way of determining the
position of a straight line is to use the method of least squares. Through this method, we choose a
and b such that the vertical distances of the points from the line are minimized; or, we minimize the
sum of squares of these vertical distances - hence the term 'least squares'.
b is computed as follows:
b = Sxy = (x - x )(y - y )
Sxx
(x - x )2
71
The numerator (Sxy) in the above formula can be simplified to get:

xy -(xy)/n,
while the denominator (Sxx) becomes x2 - (x)2/n
'a' is obtained through a = y - Sxy
Sxx
= y - bx
where y = y/n and x = x/n
Nowadays, the method of least-squares is available in most computer packages (even in some
calculators) and is usually referred to as linear regression. The resultant line is called the regression
line, which, estimates the average value of y for a given value of x. This line passes through the point
defined by the mean of the y and the mean of x.
Example:
When the above data on body weight and plasma volume were analyzed using the above formula, the
intercept and slop values are obtained as follows:
n =8
x = 535
x2 = 35983.5
y = 24.02
y2 = 72.798
xy = 1615.295
Using the above formula we get:
b = 1615.296 - (535)(24.02)/8
35983.5 - (535)2/8
= 8.96/205.38
= 0.043615
and
a = 3.0025 - 0.043615 x 66.875
= 0.0857
The resultant regression line is therefore
Plasma volume = 0.09 + 0.04 x Body weight.
This can be superimposed on the scatter diagram as shown below:
72
P
l
a
s
m
a
v
o
l
u
m
e
3.6
3.4
3.2
3
2.8
2.6
2.4
2.2
2
56 58 60 62 64 66 68 70 72 74 76
Body weight (kg)
Fig.11.2 Scatter diagram of plasma volume and body weight showing the
regression line.
linear
CORRELATION
Linear regression provides us with a straight line with which to summarize the relationship between
two variables. However, it does not tell us how closely the data lie on a straight line. The closeness
with which the points lie along the straight line is measured by the (Pearson's) correlation coefficient,
r.
r = Sxy = (x - x )(y - y )
SxxSyy [(x - x )2 (y - y )2]
As we noted with the regression coefficients' calculation, here also further simplification in the
denominator can be made when calculating the correlation coefficient:
(x - x )2 = x2 - (x)2/n
(y - y )2 = y2 - (y)2/n
Considering the above example,
(x - x )(y - y ) = 1615.295 - 535 x 24.02/8 = 8.96
(x - x )2
(y - y )2
= 35983.5 - 5352/8 = 205.38

= 72.798 - 24.022/8 = 0.678
Therefore,
r=
8.96
(205.38 x 0.678)
= 0.76
73
The (Pearson's) correlation coefficient has the following properties:1.

2.
3.
4.
5.
It must lie between -1 and +1.

Positive values of r are obtained from upward sloping lines (b > 0) ie. y increasing with
increasing x values. Negative values of r are obtained from downward sloping lines (b<0) ie.
y decreasing with increasing x values.
If |r| = 1, the relationship is perfectly linear, ie. all points lie exactly on the regression line.
For perfect positive correlation, r = +1 and for perfect negative correlation r = -1.
If r lies between 0 and +1, or between 0 and -1 there is some scatter about the line. The less
the scatter, the closer |r| is to 1.
If r = 0, there is no linear relationship between the explanatory and the response variable.
This does not necessarily mean that there is no relationship at all, what it suggests is that if
the relationship exists is NOT linear. Example, a curved relationship between the
independent and dependent variables.
LOGISTIC REGRESSION
Introduction
We have so far dealt with simple linear regression with a continuous dependent variable. We can extend
the methods of simple linear regression to deal with more than one independent variables in the form of
multiple linear regression. That is the multiple regression model yields an equation in which the
dependent (outcome) variable is expressed as a combination of the independent (explanatory) variables.
This takes the following form:
y = 0+1x1+...+kxk, where
y is the dependent variable,
x1, 1 x2, ...xk are the k explanatory variables (sometimes called predictor variables or covariates), and
0, 1, ... k are the regression coefficients.
As stated earlier on, these methods assume that the outcome variable of interest is numerical ( and
measured on a continuous scale), although the explanatory variables do not necessarily have to be
continuous.
It is very common, however, that in many kinds of medical research the outcome variable of interest is a
proportion ( or a percentage) rather than a continuous measurement.
We cannot use the ordinary multiple linear regression for the analysis of the individual and joint effects
of a set of explanatory variables on the outcome variable which is in the form of a proportion. Two
features of proportions based on counts (proportions based on measurements do not come in here) are
important when considering a statistical analysis:
(a) if the denominator of the proportion is n and the population value is , the variance of this
proportion is (1 - )/n and for a given n, this depends upon the value of ,being largest when =1/2
and smaller when is in the neighbourhood of 0 or 1. Hence the usual assumption of constant
variance 2 can no longer hold.
(b) when we relate a proportion variable to other quantities by some form of a regression model, we
need to take seriously the fact that the true proportion cannot go outside the range 0 to 1. Because of
74
this the parameters have a limited interpretation and range of validity. We can instead use a similar
approach known as multiple linear logistic regression or just logistic regression.
Transformed proportions
We can overcome some of the problems in (b) above by looking at the response proportion on a
transformed scale which does not have the fixed boundaries at 0 and 1. Suppose p is the proportion of
individuals with some characteristic of interest. Or equivalently, let p be the probability of a subject
having a disease, then 1-p is the probability that the individual does not have the disease, and the odds
of having the disease is p/(1-p). As p changes from 0 to 1, the corresponding odds (i.e the ratio
p/(1-p)) change from 0 to . So this transformation removes one of the boundaries. To remove the
other, we consider the odds on a logarithm (log) scale: the log odds will go from - to + as p goes
from 0 to 1. If we use the natural logs (i.e. logarithms to the base e), the transformation loge(p/1-p)) is
called the logit of p.
p
2, and this is the log odds. The estimated value of p can be derived from
1 - p
That is, logit(p) = loge
logit(p), and always lies in the range 0 to 1.

If y =logit(p), then we have ey=p/(1-p) and p=ey/(1+ey).
If we wish to compare risks of having some disease between individuals who are exposed to some
factor and those who are not exposed to the factor, we can do that using our model. We estimate
y1=logit(p1) for the group with the factor present, and y0=logit(p0) for the group without the factor.
p1
p
p (1- p0 )
3-loge 0 = loge 1
4, which is
1- p1
1- p0
p0 (1- p1 )
Then we have y1-y0 = logit(p1)-logit(p0) = loge

the log of the odds ratio.5
Regression with transformed proportions

Just as with ordinary regression, we can develop regression equations with transformed proportions as
the y-variate. When the logit transform is used, this procedure is called logistic regression. The
mathematical calculations involved are generally heavy, but this is taken care of by several computer
packages such as GLIM (Generalised Linear Interactive Modelling), SAS (Statistical Analysis
System), SPSS (Statistical Package for the Social Sciences), etc., which are available on a wide range
of computers, from mainframes to micros. Other computer packages that can handle logistic regression
analysis include Egret and a less familiar one known as Logxact which is particularly useful with small
samples as it employs exact (as opposed to asymptotic) methods. With the exception of SAS, these
packages are fully available on some PC's in the Department of Epidemiology and Biostatistics,
although the GLIM version that we have appears only in limited features.
Simple logistic regression

The simple logistic regression model takes the form:
logit(p) = 0+1x1, where 's are the regression coefficients and x a covariate.
Suppose we treat batches of about 50 mosquitoes with a series of concentrations of an insecticide,
record the number of mosquitoes killed, and obtain the following results:
75
Table 10.2 Number of mosquitoes killed in a batch a the dose of insecticide used
Dose of insecticide
Number of mosquitoes
Number of mosquitoes in a batch
killed
10.2
44
50
7.7
42
49
5.1
24
46
3.8
16
48
2.6
6
50
Plotting the proportion killed in each batch against the dose of insecticide (a log scale for the dose or
concentration is usually appropriate) is a recommendable starting point. The simple linear regression
model will not fit very well the data, and it will lead us to expect responses which are negative for very
low doses or greater than 1 for high responses. In fitting a logistic regression model to these data,
p
6 = -4.887 + 3.104 ln(dose)
1 - p
working with ln(dose), the model is such that loge
Multiple logistic regression

The main difference between multiple logistic regression and ordinary multiple regression is that in the
former, we use a combination of the set of values of covariates to predict a transformed dependent
variable rather than the dependent variable on its original scale. Hence multiple logistic regression
model is represented in a similar manner as follows:
logit(p) = 0+1x1+...+kxk
An example in which multiple logistic regression can be used is provided by the data below, from an
article by Norton, P.G. and Dunn, E.V. (1985), Br. Med J., 291, 630-632. These relate hypertension to
smoking, obesity and snoring among men aged 40 years or over. In such a case logistic regression can
be used to see which of the factors smoking, obesity and snoring are predictive of hypertension.
Table 10.3 Hypertension in men aged 40+ in relation to smoking, obesity and snoring
Smoking Obesity Snoring
No. of men
No (%) with hypertension
0
5 ( 8%)
0
0
60
1
2 (11%)
0
0
17
0
1
0
8
1 (13%)
1
1
0
2
0 ( 0%)
0
35 (19%)
0
1
187
1
0
13 (15%)
1
85
0
1
1
15 (29%)
51
1
1
1
23
8 (35%)
Total
433
79 (18%)
Codes 0 and 1 are for No and Yes, respectively.
The full model is as shown in Table 10.4
76
Table 10.4 Logistic regression analysis of the hypertension data shown above in Table 10.3
Constant
Smoking (x1)
Obesity (x2)
Snoring (x3)
Regression coefficient
(b)
-2.378
-0.068
0.695
0.872
Standard error
se (b)
0.380
0.278
0.285
0.398
0.24
2.44
2.19
p-value
0.810
0.015
0.028
The significance of each variable can be tested by treating z=b/se(b) as a standard normal deviate. Can
see that the P-value for smoking is very large (0.81) and hence we can say that, smoking has no
association with hypertension. Obesity and hypertension have a significant association with
hypertension. (in both cases P<0.05).
The analyses presented relate only to the main effects of obesity, smoking and snoring. We need to
consider also the possible presence of any important interaction between two of these factors. That is,
we should investigate whether the effect of a factor depends on the level of another factor. In fact this
was done, and no interaction term was found to be statistically significant at any interesting level.
Omission of smoking in the model produced only minimal changes in
coefficients. Hence the regression equation for this model is
logit(p) = -2.378 - 0.068x1 + 0.695x2 + 0.872x3, where
x1, x2, and x3 are codes for smoking, obesity, and snoring, respectively.
the values of the other
The above equation enables us to calculate the estimated probability of having hypertension, given
values of the three variables. In particular, we can obtain the odds ratio of hypertension associated
with any of the three factors. For example, let us consider variable x2, obesity:
putting x2 = 1 (for presence of obesity), gives:
logit(p1) = -2.378 - 0.068x1 + 0.695 + 0.872x3, and
putting x2 = 0 (for non-obese), gives:
logit(p0) = -2.378 - 0.068x1 + 0.872x3.
As discussed earlier, the difference logit(p1)-logit(p0) = 0.695 is the log odds ratio. Hence the odds
ratio for hypertension associated with obesity = e0.695 = 2.00. In general, for any binary variable the
odds ratio (OR) can be estimated directly from the regression coefficient b as OR = eb. Confidence
limits follow immediately from the standard error of b and on taking b to have an approximate Normal
distribution.
77
EXERCISE
1.
In the following data, four doses (on a long-scale) of vitamin D were tested, each on a
number of rats, and the results were assessed by means of a line test on bones in terms of
arbitrary scores.
Dose, x
(Mean) response, y
-0.45
2.64
0.25
0.77
1.46
7.36 12.29 16.56
x = 2. 032; y = 38.85
x = 0.51; y = 9.71
x2 = 2.9895; y2 = 486.4169
xy = 34.2929
(a)
(b)
What diagram would be appropriate to examine these data?

Calculate the regression equation that explains the dose-relationship over the range
of doses considered.
78
Chapter 11
VITAL STATISTICS AND DEMOGRAPHY
SOURCES OF DEMOGRAPHIC INFORMATION
Quality of data depend on many factors, one of which is the source of data. Sources of data have a
direct implication to the quality in terms of coverage, completeness and cost.
In this chapter we will concentrate on the following sources of demographic data:
(a)
Census
(b)
Vital registration
(c)
Sample surveys
Census
Census is a systematic, routine way of counting subjects in a defined boundary or limits of land.
Census produces reports of individuals, population size and structure at a point in time.
Originally, census was limited to people only; but very recently we find censuses of agriculture,
business, livestock, housing, etc and sometimes done concurrently with population census.
The main characteristics of census is that it covers the whole population. No sampling is involved and
each person should be enumerated separately. Census must have a legal basis to make it complete and
compulsory. It reflects a single point in time although the whole process can take a longer time.
Basic questions which should appear on the questionnaire are name, age, sex, relationship with the
head of household, marital status, race/religion/ethnicity, education, occupation, employment status,
migration and amenities. Additional questions would depend on the availability and quality of vital
registration.
Population census can be carried out using either of the below mentioned methods:
1.
De facto method:
This method designates persons to an area or location they are found during enumeration.
The population "in fact" there. The question of originality does not count here. It is
considered that, say, in 1988 Tanzania Population Census, Zanzibar had a population of
641,000. This implies that, these people spent a night in Zanzibar before a census night.
Tanzania follows this method of enumeration.
2.
De jure method:
De jure method of enumeration allocates persons to their normal residence. Meaning "people
who belong to the area or have the right to live there through citizenship, legal residence or
whatever". For example, a businessman working in Dar es Salaam but living in Arusha would
be assigned to Arusha on a de jure type of enumeration.
In Tanzania census is normally conducted after every ten years (decennial). This has a planning setback implication in a sense that population is changing rapidly because of births, deaths and
79
movements. To overcome this problem, normally inter-censal surveys or mini-surveys are conducted.
Example of such surveys is a 1991 Tanzania Demographic and Health Survey (TDHS). However,
further surveys on morbidity and for specific diseases can be conducted whenever a need arise.
Vital registration
Vital registration system is very common to developed counties where information on births,
marriages, deaths and migrations are collected. In developing countries the system whenever
employed is prone to incompleteness otherwise they are non-existent.
Questions in the vital registration system are always very simple and few. Consider hospital or health
service data here in Tanzania. Examples of such registrations are information on deaths found in
hospitals (death certificates). Birth and marriage data found in churches, mosques, Area
Commissioner's offices and migration data found at airports and borders.
The short-fall of vital registration system is that they are normally incomplete, selective samples,
diverse and are practically unreliable. This does not mean that the system should be discarded,
instead it should be improved to remove these errors.
Sample surveys
Sample surveys give the same information in a more detailed form where vital registration system
does not exist. Only a sample of a population is involved. Sample surveys are thus, less costly when
compared with census.
The other advantages of surveys include the pace of collecting the information. They are relatively
quicker, more detailed than other systems like census. The cost of surveys are the errors introduced
through sampling.
COMMON RATES IN PUBLIC HEALTH

(a)
Measures of fertility:
There are four common measures of fertility. These are crude birth rate, general fertility rate,
gross reproductive rate and the total fertility rate.
i.
Crude birth rate:

It is called the 'rate' but in practice it is the ratio defined as:
number of livebirths in a year x 1000
Total population
The rate is 'crude' because it does not take into account the risk of giving birth
according to age and sex differences.
ii.
Fertility rate (General fertility rate):
80
The modern, conventional and much more acceptable 'rate' is the general fertility rate
or simply known as 'fertility rate'. The denominator is restricted to women at risk of
child-bearing rather than the general population. It is thus, defined by:
number of livebirths in a year x 1000
mid-year population of women aged 15-49
iii.
The total fertility rate:

Total fertility rate means the average number of children a woman would have during
her reproductive life time given that the current specific fertility rates would still be
applicable at that time.
The total fertility rate is calculated from age-specific fertility rates (ASFRs). We get
the ASFRs when we divide the number of livebirths by the number of women in each
age interval. The following example shows required steps to calculate total fertility
rate (TFR).
Table 11.1:
Age
15-19
20-24
25-29
30-34
35-39
40-44
45-49
Total
Number of livebirths and maternal age, Tanzania, 1988.
Number of women
Number
births
of
665000
516000
459000
344000
310000
229000
218000
2741000
live Age
rate
21000
114000
118000
123000
37000
6000
5000
424000
specific
fertility
0.0316
0.2209
0.2571
0.3576
0.1194
0.0262
0.0229
1.0357
Total fertility rate (TFR) equals the sum of all age specific fertility rates. In this case,
TFR = 1.0357 x 5 = 5.1785.
The sum of all ASFRs is multiplied times 5 because of the 5 year age group interval. If ages
are in single years, then there is no need to multiply this sum times 5.
The figure 5.1785 means on average each woman will have 5 children during her
reproductive period given that these age specific fertility rates will still apply until she
finishes her reproductive life.
Unlike the CBR and GFR, the calculation of TFR greatly depends on the age composition
although its use is independent of age distribution.
81
iv.
Gross reproduction rate (GRR):
The gross reproductive rate is similar to the total fertility rate only that it considers
female live births rather than all births. This implies that, ASFR for GRR is based on
females.
GRR is interpreted as the average number of daughters a woman would have if she
survived to at least age 50 and experienced the given female ASFRs. A figure of 1
means that women are able to replace themselves while a figure of 2.0 means that the
population is doubling itself: each woman is on average producing two daughters.
Like the TFR, GRR is also a hypothetical measure. It is a period measure which does
not take into account the effect of female mortality either before age 15 or 15 to 50
years.
Referring to Table 11.1 above, given the number of female livebirths the GRR is
computed as follows:
Age
Number of women Number

births
665000
516000
459000
344000
310000
229000
218000
2741000
15-19
20-24
25-29
30-34
35-39
40-44
45-49
Total
of
live Female births Female

ASFR
21000
11000
0.0165
114000
58000
0.1124
118000
60000
0.1307
123000
63000
0.1831
37000
19000
0.0613
6000
3000
0.0131
5000
3000
0.0138
424000
217000
0.5309
Then GRR = 0.5309 x 5 = 2.6545

If the true sex ratio at birth is known, the GRR can be calculated using the TFR.
Thus, GRR = 5.1785 x 217/424 = 2.65
(b)
Measures of morbidity:
i.
Incidence rates:
Incidence measures the occupance of new cases of a disease in a population relative
to the number of persons at risk of contracting the disease. Therefore, the incidence
rate is the rate of contracting a disease among those still at risk. It should be noted to
make a difference between being at risk of contracting the disease at the beginning of
a period and being at risk during the entire period. The former would refer to the
incidence risk and the latter to the incidence rate. Incidence rate is expressed as:
number of new cases of disease in a period of time x 10k
number of person-years of exposure in a period
where k = 2, 3, 4, 5 or 6 depending on the convenience or convention.
82
83
ii.
Prevalence rates:
The prevalence measures the extent to which a disease exists in a population. It is
based on the total number of existing cases among the entire population. It can be
measured at an instant time (point prevalence) or looking for cases over a stretch
period of time (period prevalence).
Point prevalence 'rate' = number of subjects with the disease at time t x 10
Total population at time t

k = 2, 3, 4, 5 or 6 depending on the convenience or convention.
This index is prone to bias because cases with long duration have a higher
probability of being in the sample than those with short duration (Refer to screening
tests).
Period prevalence 'rate' =
x10k
total number of persons with disease during a period
Total population at mid-point of the interval

k = 2, 3, 4, 5 or 6 depending on the convenience or convention.
iii.
Case fatality rate:

This is defined as the proportion of persons whose deaths is due to that particular
disease. In practice, a time limit is imposed, say, a proportion of cases who die from
malaria within the last two years.
iv.
Specific rates:
These are rates which apply only to different geographical areas, to specific age
groups, to sex separately, to educational or marital stratification, etc. They are called
rates to that specification.
(c)
Measures of mortality (death):

i.
The crude death 'rate' = number of deaths in a year x 1000

mid-year population
When the denominator is approximated by the 'total population', then the index
obtained is not the actual rate but it turns out to be a 'crude mortality ratio'
ii.
Infant mortality rate = number of deaths in a year under 1 year of age x 1000
number of livebirths in the same period
Infant mortality rate is often broken down into several indices depending on the age
categories of an infant.
Neonatal mortality rate = number of deaths in a year under 28 days of age x1000
number of livebirths in a year
84
The early neonatal mortality 'rate' =
number of deaths aged under one week in a year x 1000

The late neonatal mortality 'rate' =
number of deaths between 1 - 4 weeks in a year x 1000

The post neonatal mortality 'rate' =
number of deaths between 4 - 52 weeks in a year x 1000

iii.
Stillbirth 'rate' = number of stillbirths in a year x 1000

Total number of live and still-births in a year
iv.
The perinatal mortality 'rate' =

number of still and neonate deaths in a year x 1000
Total number of live and still-births in a year
This index is important because it removes ambiguity when an outcome of pregnancy
dies very soon after delivery that it was born dead or alive.
STANDARDIZATION OF RATES
There are situations in which one intends to compare two or more different populations (geographical
areas, different hospital populations, experimental groups, etc.) using the already mentioned crude
rates (mortality, morbidity, fertility, etc). For instance, consider the crude mortality rate. The risk of
dying depends very much to age, and sometimes differs according to sex. Age specific death rates are
high for infants and very old people and low for middle age groups.
It is therefore true that the crude mortality rate and overall incidence rates will depend on age-sex
composition of the population concerned. Crude rates may be misleading indicators of the level of
mortality, morbidity, fertility, etc.when comparing two different populations if the populations do not
have the same age and sex structure.
Standardization provides an overall summary measure of the event occurrence which does not depend
on the age, sex, race or other distribution of the group. It therefore permits comparisons of those
events occurrence in two or more study groups which are adjusted for differences in the variable of
interest of the groups.
Two methods of standardization which are commonly used are: (1) Direct standardization and
(2) Indirect standardization.
1. Direct standardization:
85
The direct standardization is applied when, for example, age and sex rates from each of the
populations under study are applied to the standard population. The outcome is the age-sex adjusted
mortality, morbidity or fertility rates.
To use the indirect method, age and sex specific rates from the standard population are treated to the
study populations to give the standardized mortality, morbidity or fertility ratios.
The choice of which method to use depend very much on the availability of data. However, it has
been found that in general, direct standardization is used for prevalence while indirect method is used
for incidence.
The following information should be available when one intends to use direct standardization
method:
(a) Study population(s) characteristics eg. age, sex rates.
(b) Standard population characteristics composition.
Once the two data have been obtained, (a) is applied to (b) to get, say, age and sex adjusted rate.
Since the standard population may or may not be one of those populations to be compared, it has to
be defined arbitrarily. But a common choice of standard population is the larger population from
which the index (study) population(s) came.
The detailed steps in calculating the standardized rate for the index population are:(a) Define your standard population.
(b) Apply the age, sex (or any other characteristic) specific rates of the index population to
the standard population to get what we would expect if the index population rates would
be rolling in the standard population.
(c) Add these cases to get the total expected number of subject in all age groups.
(d) Divide the total expected number of cases by the total in the standard population to get
the crude rate known as the standardized incidence rate for the index population.
Table 11.2a:
Use of direct standardization to compare the prevalences of malaria in two villages,

A and B. The total population of the two villages has been taken as the standard.
VILLAGE A
Age
0- 4
5- 9
10-14
15-29
30-49
50+
Total
Number examined
Table 11.2b:
71
94
27
30
36
29
287
VILLAGE B
Number of cases
Number examined
3
8
6
18
28
23
86
31
43
19
22
28
15
158
Proportion of A and B villagers with malaria.
86
Number of cases
2
6
13
21
28
15
85
Total examined
102
137
46
52
64
44
445
Age
0-4
5-9
10-14
15-29
30-49
50+
Total
Percent of villagers diseased Expected malaria cases using proportions of

Village A
Village B
Village A
Village B
4.23
6.45
4.31
6.77
8.51
13.95
11.66
19.11
68.42
10.21
31.47
22.20
60.00
95.45
31.20
49.63
77.78
100.00
49.78
64.00
79.31
100.00
34.90
44.00
29.97
53.80
142.06
214.98
Total malaria cases for each age group are obtained by multiplying the proportion of villagers
diseased by the "standard" population (2 villagers total examined per age group). The results are
expected cases of malaria if the prevalence of malaria in village A and B respectively were applying.
The Age adjusted prevalence of malaria = Expected cases
Total standard population
Thus,
Village A: 142.06 = 31.92
445
Village B: 214.98 = 48.31
445
Conclusion: Village B has higher prevalence (%) of malaria adjusted for age.
Considerations over direct standardization method:
(a) The direct method of standardization requires stratum-specific (eg. age-specific) rates in
the index population(s) which sometimes are not available. In this case the method can
not be applied.
(b) The number of cases observed in the study population should be large enough to give
meaningful stratum-specific rates necessary for direct standardization. Short of this, the
method can not be used.
(c) In general, comparing disease rates in two or more groups via direct standardization is
subject to less bias than the indirect method. The reasons for this will not be discussed
here.
2. Indirect method of standardisation:

Unlike direct standardisation, the indirect standardisation of rates entails the use of known specific rates
applied to the actual (observed) population characteristic of interest being compared to generate the
expected events.
When the observed events are divided by the expected events, the resultant is the "standardised
incidence ratio. In case of death, it will be the standardised mortality ratio (SMR).
87
SMR= Observed events in the index population

Expected events from the standard rates
In order to get the indirectly standardised rates for the index population being compared, the standard
crude rates are multiplied by the SMR.
Using the previous example of malaria cases in two villages, the standard rates (prevalence from two
villages combined) and the expected malaria cases are:
Age
00 - 04
05 - 09
10 - 14
15 - 29
30 - 49
50+
Total
TOTAL
1
Rate per 100
4.92
10.22
41.30
75.00
87.50
86.50
38.43
VILLAGE A
2
3
Population
Expected cases
71
3.5
94
9.6
27
11.2
30
22.5
36
31.5
29
25.0
287
103.3
VILLAGE B
4
5
Population
Expected cases
31
1.5
43
4.4
19
7.8
22
16.5
28
24.5
15
13.0
158
67.7
Dividing the observed number of malaria cases by the expected number would give us the the
standardised morbidity ratio.
Village A: 86 = 0.83 = 83%
103.3
Village B: 85 = 1.25 = 125%
67.7
We conclude using the standardised ratios by computing the actual age-adjusted morbidity rates for each
group to control for the effect of age:
Village A: 38.43 x 0.83 = 0.32 = 32%
100
Village B: 38.43 x 1.25 = 0.48 = 48%
100
The primary advantage of the indirect standardisation lies on the fact that this method does not
necessitate one to know the specific rates on the population being compared with, of which sometimes
are not available.
88
EXERCISE:
Consider the following data for cancer mortality in the US in 1940 and 1986:
Age
00 - 04
05 - 14
15 - 24
25 - 34
35 - 44
45 - 54
55 - 64
65 - 74
75+
All ages
1940
Population (000)
10,541
22,431
23,922
21,339
18,333
15,512
10,572
6,377
2,643
131,670
Deaths
494
667
1,287
3,696
11,198
26,180
39,071
44,328
31,279
158,200
1986
Population (000)
18,152
33,860
39,021
42,779
33,070
22,815
22,232
17,332
11,836
241,097
Deaths
666
1,165
2,115
5,604
14,991
37,800
98,805
146,805
161,381
469,330
(a) Compute the crude cancer mortality rates for 1940 and 1986 and compare these rates.
(b)Using the US population in 1940 as the standard population, apply the direct method of
standardisation. What are the age-adjusted cancer mortality rates for 1940 and 1986?
(c)Using the age-specific cancer mortality rates for 1940 as the standard, apply the indirect method to
compute the standard mortality ratios for 1940 and 1986.
(d) How does the 1986 population compare with the 1940 population in terms of cancer mortality rate?
89
LIFE TABLES
Standardizes death rates which have been discussed above can be used to study the levels of mortality of
a population and also they can be used to compare the mortality experience of two or more populations.
The standardized death rates are however a single figure index of the level of mortality. They contain no
direct information about mortality levels of different age groups. Life tables on the other hand can
summarize the mortality experience of a population at every age. They provide answers to questions
like, Suppose in a population 100,000 babies are born on the same day, how many babies will survive to
celebrate their 1st, 2nd etc birthdays assuming that babies die at the current rates of mortality?
However, the use of current mortality rates for this calculation is unreal since the babies would die at
the rates existing at the time when they die.
There are two distinct ways in which a life table may be constructed from mortality data:
In the current life table the survival pattern of a group of individuals is described subject through out
life to the age specific death rates currently observed in that particular community. This kind of a life
table is more often used for actuarial purposes and is less common in medical research.
On the other hand the cohort life table describes the actual survival experience of a group or 'cohort' of
individuals through time. The cohort may be babies born at the same time or an occupational group or
patients following a particular treatment etc. This type of life table has its most useful application in
medical research in follow-up studies eg an IUD retention study or more generally survivorship studies.
There are two types of life tables:
1. Full life table: Includes every single year of age from 0 to the highest age to which any person
survives.
2. Abridged life tables: usually considers only 5 year age groups except that the first five years of life
may be considered singly.
THE FULL (COMPLETE) LIFE TABLE:

The number of imaginary births considered in the life table is called the radix. This is usually the power
of ten but its value is determined by convenience and accuracy. A life table comprises a set of six
columns headed x, lx, dx, px, n qx and ex0
x - The age to which the numbers in other columns relate.
lx - The number still surviving at axact age x.
dx - The number of deaths occurring between exact age x and exact age x+1, i.e dx =lx - lx+1
Px - This gives the probability of surviving from exact age x to exact age x+1.
Px = lx+1
lx
qx = The probability of dying between exact age x and exact age x+1.
qx = 1 - Px = 1 - lx+1 = dx
lx
lx
ex0 gives the expectation of life at age x. i.e the average number of years to be lived by persons who
reach age x.
90
Example:
The intra-uterine device (IUD) is a method of contraception which is not well tolerated by all women
because of medical side effects such as abdominal pain, excessive bleeding, infection etc. If such side
effects occur the IUD is removed, although the IUD may also be removed for non medical reasons such
as the woman's wanting to become pregnant.
In an IUD retention study 2,479 women who had an IUD insertion during the month of January were
interviewed. They were asked whether they had retained their IUD until the 24th month during which
month it was the practise to arrange a special medical check-up and remove the IUD. For those who
their IUDs were removed, reasons for the removal and the duration of use were determined. The results
of the survey indicated that 180 women lost their IUD during the first month after insertion and 162
during the second month after insertion. The corresponding figures for the third to the twenty third
month were 90, 85, 76, 180, 162, 90, 85,76, 63, 51, 72, 85, 87, 72, 78,70, 65, 90, 92,89, 88.
This information can be represented on a life table as follows:
x
Px
qx
lx
dx
ex
0.073
0.927
2479
180
11.75
0.07
0.93
2299
162
11.63
0.04
0.96
2137
90
11.47
0.04
0.96
2047
85
10.95
0.04
0.96
1962
76
10.41
0.09
0.91
1886
180
9.79
0.09
0.91
1706
162
9.79
0.06
0.94
1544
90
9.76
0.06
0.94
1454
85
9.34
0.06
0.94
1369
76
8.88
10
0.05
0.95
1293
63
8.38
11
0.04
0.96
1230
51
7.78
12
0.06
0.94
1179
72
7.09
13
0.08
0.92
1107
85
6.52
14
0.09
0.91
1022
87
6.02
15
0.08
0.92
935
72
5.73
16
0.09
0.91
863
78
4.96
17
0.09
0.91
785
70
4.40
18
0.09
0.91
715
65
3.78
19
0.13
0.87
650
90
3.11
20
0.16
0.84
560
92
2.50
91
21
0.19
0.81
268
89
1.93
22
0.23
0.77
379
88
1.27
23
1.00
0.00
291
291
0.50
We can use the life table above to calculate the following probabilities:
a) What is the probability that a woman who retained an IUD for the first six months will have it by the
end of the 20th month
=l20
l6
=560
1706
=0.33
b) What is the probability that a woman who retained an IUD up to the beginning of 10th month will
lose it after the 18th month?
= l18
l9
=715
1369
=0.52
ABRIDGED LIFE TABLES

These are constructed in a very similar way to that of complete life tables but instead of calculating dx
one calculates ndx where n is the length of the interval, not its start. For example, 5d10 refers to the
interval 10-15 years and not 5 - 10 years. ndx denotes the number of deaths occurring between exact age
x to exact age x + n.
Px and qx need to be altered similarly. npx is the probability of surviving between exact age x to exact
age x+n and nqx is the probability of dyeing between exact age x to exact age x + n.
lx remain unchanged and specifies the number of survivors from the radix who reach exact age x. e0x
remains unchanged as it refers to exact age x.
Example:
92
The following is an abridged life table for a certain country in a given year.
x
lx
10dx
10qx
10px
0
100000
2938
0.029
0.971
10
97062
847
0.009
0.991
20
96215
1489
0.015
0.985
30
94726
1867
0.012
0.988
40
92859
4386
0.047
0.953
50
88473
11017
0.124
0.876
60
77456
22512
0.291
0.709
70
54944
30275
0.551
0.449
80
24669
20869
0.846
0.154
90
3800
3720
0.979
0.021
100
80
80
1.000
0.000
e0x
68.03
59.94
50.42
41.13
3.86
23.15
15.7
10.20
6.57
5.21
5.00
THE POPULATION PYRAMID

Both age and sex compositions can be represented by a special type of bar graph called a population
pyramid. Population pyramids provide graphic statements of the age and sex distribution of a
population for a given year. It also shows the history of a population including effects of war, waves of
in- or out-migration, fluctuations in fertility and mortality, etc.
The pyramid is a two-way histogram with the X and Y axes reversed, so that frequencies are represented
by the horizontal axis and class intervals by the vertical axis. Thus the population pyramid consists of
two bar graphs (or histograms) placed on their sides and back to back (see Figure 11.1). The length of
each bar represents either the total number of the percentage size for each age or age group (it is
conventional to use either single-year or five-year age groups, though other groupings are possible).
Pyramids are drawn showing the male population on the left hand side and the female population on the
right. The young are usually at the bottom and the old at the top. The last open-ended age group is
normally omitted entirely from the pyramid because it is impossible to draw truthfully.
Since every year cohorts normally lose part of their number through death or emigration, each bar is
usually shorter than the previous one, which gives the impression of a pyramid. A vertical comparison
93
of the bars shows the relative proportions of each age or age group in the population, while a horizontal
comparison shows the proportion of males and females in each age or age group.
The population pyramid can be based on absolute numbers or on percentages, the latter is more
common. The percentages are calculated using the total population of both sexes combined as the
denominator. If the percentages are calculated separately for males and females, then the pyramid will
present a false picture.
Types of Pyramids
There are several types of population pyramids but we will discuss the three more frequent forms.
The first class of population pyramid looks like an ordinary triangle. It reflects a population with
relatively high vital rates and a low median age. The age structure in the Netherlands in 1849 (Figure
11.2) fits this category.
The second variety has a broader base than the first. The 0-14 group is larger because this population is
beginning to control mortality but not fertility, and the most impressive gains in reduction are made in
94
the younger age groups. The steeply sloping sides reflect the large proportion of younger people and the
small percentage of aged people. The population structure of Hai District, Tanzania in 1994 fits this
description (Figure 11..3).
The third class of pyramids looks like a beehive. The numbers in this age-sex profile are roughly equal
for all age groups, gradually decreasing at the apex. Many Western populations conformed to this
pattern in the 1930's as seen in the Figure 11.4.
Population pyramids of Immigration

Population pyramids of immigrants reflect the age selectivity of migrant populations, and so tend to take
the form of a cross, with low populations at the top and bottom and a disproportionately large middle
section. The pyramid in Figure 11.5 showing immigrants reaching Dar es Salaam in 1994 is typical in
that the large proportion of young adults (in this case 15-29 years) clearly stands out, although there is
quite a large proportion of under fives too.
95
EXERCISE
1.
Discuss the advantages and disadvantages for each of the systems of collecting data.
2.
During an epidemic of gastro-enteritis the number of cases and deaths in a city hospital and
all hospitals were as shown below:
CITY HOSPITAL
ALL HOSPITALS
Age group (years) Cases
Deaths
Cases
Deaths
Under 1
240
41
1550
341
1-4
140
21
1880
235
Above 4
20
8
500
16
Total
400
70
3930
592
(a)
(b)
Calculate for the city hospital and all hospitals the case mortalities in each age group
and for all ages combined.
Find the standardized mortality rate (Comparative Mortality Ratio) for the city
hospital by the direct method using the case mortalities by age group of all hospitals
as the standard rates.
96
BIBLIOGRAPHY
1.
Armitage, P. and Berry, G. (1994). Statistical Methods in Medical Research, 3rd Edition.
Oxford: Blackwell Scientific Publications. (older versions are just as good for most topics).
2.
Brownlee, A., Pathmanathan, I., Varkevisser, C. (1991). Health Systems Research Training
Series, Volume 2 (Part 1): Designing and Conducting Health Systems Research Projects.
Canada: IDRC.
3.
Healy, M.J.R., Hills, M. and Osborn, J. (1987). Manual of Medical Statistics. Volume II.
London: London School of Hygiene and Tropical Medicine.
4.
Hill, A. Bradford (1984). A Short Textbook of Medical Statistics, 11th Edition. London:
Hodder and Stoughton.
5.
Kirkwood, B.R. (1988). Essentials of Medical Statistics, 1st Edition. London: Blackwell
Scientific Publications.
6.
Newell C. (1988). Methods and Models in Demography.

Blackwell Scientific Publications.
7.
Osborn, J. (1988). Manual of Medical Statistics. Volume I. London: London School of

Hygiene and Tropical Medicine.
8.
Petrie, Aviva (1990). Lecture Notes on Medical Statistics, 2nd Edition. Oxford: Blackwell
Scientific Publications.
97

Biostat Manual

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Biostat Manual

Transféré par

Droits d'auteur :

Formats disponibles

Lecture notes

Prepared, edited and compiled by

Muhimbili University College of Health Sciences

ABOUT THE AUTHORS

The Normal Distribution

Introduction to Sampling Techniques

Significance Tests: One Sample

Significance Tests: Two Samples

The Chi-Squared Test

Association Between Quantitative Variables

Vital Statistics and Demography

NEED FOR BIOSTATISTICS

Results of comparison between two treatments:

Results of comparison between two treatments among females:

Results of comparison between two treatments among males:

APPLICATION OF BIOSTATISTICAL METHODS

1. Qualitative (categorical) variables:

2. Quantitative (numerical) variables:

Difference between 3 and 4 is the same as that between 8 and 9.

DESCRIPTIVE METHODS FOR QUALITATIVE DATA

From the above data:

too congested. Hence a bar chart is more appropriate.

Methods of Birth Control

Fig 2. Distribution of the population using different control methods.

This information can be summarized in a two way table.

PTB infection by HIV1 status

Ratio, proportion and rate

A proportion is often expressed as a percentage.

DESCRIPTIVE METHODS FOR QUANTITATIVE DATA

A histogram showing distribution of age at loss of last tooth

Cumulative no. of cases (Thousands)

D: Cumulative Frequency curve

Note: When making a statistical diagram:

= sum all the values of the variable X from i=1 to i=n.

The arithmetic mean can also be calculated from frequency distributions.

Symmetric and skew distributions:

Frequency distribution of number of males in sibships of eight children.

NO.OF SIBSHIPS. (Thousands)

Distribution of the number of males in sibships of 8 children.

Phenylthiourea taste thresholds in a sample of 104 medical students

If log concentration is plotted against frequency, a binomial distribution is observed.

No. who tasted

1.3 1.6 1.9 2.2 2.5 2.8 3.1 3.4 3.7

Fig.9 Taste threshold for PTC.

The table below shows Accidental deaths by place and time.

Construct for these data:

A frequency distribution, the relative frequency distribution, cumulative relative frequency

PROBABILITY CALCULATIONS (ADDITION AND MULTIPLICATION RULES)

When a die is thrown, what is the probability of a 3 or 5 showing?

Define the following terms:

QUALITY OF HIGH SCHOOL

Calculate the following probabilities:

It is a distribution of a Continuous Random variable.

Calculate SND = 120 - 105.8 = 1.06

Now look for the area between SND1 and SND2.

The area to the right of SND 1.06 is 0.14457

Within what limits would the central 95% of SBPs be expected?

Simple random sampling:

For example, suppose a sample of 80 individuals is to be selected from a population comprising of

Other aspects of sampling

Studying volunteers only:

Sample size for estimation of a mean

Sample size for estimation of a proportion

STATISTIC AND PARAMETER