Académique Documents
Professionnel Documents
Culture Documents
in
Biostatistics
CONTENTS
Pages
Chapter 1
Introduction
Background
Definition of Biostatistics
Need for Biostatistics
Application of Biostatistical Methods
Chapter 2
Descriptive Statistics
Introduction
Descriptive Methods for Qualitative Data
Descriptive Methods for Quantitative Data
Chapter 3
Probability
Introduction
Probability Calculation (Addition and Multiplication Rules)
Chapter 4
Chapter 5
Chapter 6
Estimation
Statistic and Parameter
The Standard Error of a Mean
The Standard Error of a Proportion
Chapter 7
Chapter 8
Chapter 9
Chapter 10
Linear Regression
Correlation
Logistic Regression
Chapter 11
FOREWORD
The study of statistics deals with the collection, processing and interpretation of data. The concepts of
statistics are applied in many scientific fields that include agriculture, business, engineering and health.
When focus is on biological and health sciences, the term biostatistics is used. This manual of
biostatistics was written for students of the health sciences and serves as an introduction to the study of
biostatistics. The contents of the manual are based on the requirements for the biostatistics courses
offered at the Muhimbili University College of Health Sciences for both undergraduates and
postgraduates.
Textbooks on mathematical statistics usually include theoretical examples and exercises. The task of
finding relevant data is so enormous that even textbooks on applied statistics rarely include practical
examples and exercises. In particular, a course in biostatistics which is not introduced via numerous
examples of real data renders a restrictive view of the subject and hence tends to discourage the
uninitiated student. This manual is intended to provide substantial contact with a variety of statistical
methods and data sets so that the student can appreciate their application and the contexts in which they
are used. In the process the manual will facilitate learning of the student and provide handy notes and
references for further reading.
The authors have performed a valuable service in compiling the present manual. Many of the examples
and exercises given in this collection are based on health-related data, and the techniques which the
student is expected to apply cover a wide range of commonly used techniques. The manual will be of
great value both as the basis for a taught course and for private study.
ACKNOWLEDGEMENT
This work would have been impossible without the generous financial support of SIDA (SAREC) as
part of Research Capability Strengthening in the Department of Epidemiology/Biostatistics.
Japhet Z. J. Killewo
Associate Professor and Head of Department
Department of Epidemiology/Biostatistics
Chapter 1
INTRODUCTION
BACKGROUND
Biostatistics can be defined as the application of statistics to biological problems. To many
biomedical scientists, however, the term is considered to mean the application of statistics
specifically to medical problems. For this group of people, therefore, biostatistics and medical
statistics are synonymous. Indeed the kind of (bio)statistics taught in University Medical Schools is
medical statistics in which some applications which are specific for agricultural sciences, for
example, are not included.
Conversely, in Universities of Agriculture the term BIOMETRY is preferred to biostatistics.
Biometry (literally meaning measurement of life), refers to the application of statistical methods to
the analysis of biological data. In strict terms this should include analysis of data from (human)
medical sciences as well, but in practice less weight is attached to this.
Whether biometry or biostatistics (and in some places biomathematics is used) the word statistics is
implied. We attempt in the following section to define statistics by describing what it is.
DEFINITION OF BIOSTATISTICS:
We can define statistics in two forms:First statistics as a noun is a plural for the word statistic which simply means numerical
statements (i.e. information that is available in numbers). Examples of this include:(i)
hospital data on the number of admissions for some condition in a defined time
period
(ii)
How much drug (e.g. chloroquine tablets) is distributed to health
units
hospitals,
health centres, dispensaries, etc.
Secondly statistics as a discipline is a field of study concerned in broad terms with:(i)
Collecting, organizing and summarizing data in a systematic way.
(ii)
Drawing of inferences about a population on the basis of only a part of the
population targeted.
Note: Talking of the singular form of statistics here is as meaningless as it would be in putting
mathematic or physic as singulars for mathematics or physics, respectively.
The first part of the subject is usually referred to as Descriptive Statistics, while the second part,
which, provides objective means of drawing conclusions, constitutes Inferential Statistics.
In this course we are concerned mainly with the second sense of the meaning of statistics - that is, as
a discipline. Moreover, from the above background, the kind of BIOSTATISTICS here will be
specifically that of medical statistics.
With these results one may be tempted to conclude that the new treatment is better than the old (standard).
But an analysis which looks at the results for male patients separately from the female patients revealed the
following:
Table 1.2:
FEMALES
TREATMENT
Standard
New
OUTCOME
Improved Did not Improve Total
% Improved
32
8
40
80
96
64
160
60
Total
128
72
200
From this table, we note that for female patients it is the standard treatment that is doing better. This is
exactly the opposite of what we saw in the overall assessment, and one might expect the new treatment to fair
better among the male patients.
If this holds the conclusion is going to be:-to female patients give the old treatment while for male patients it
is better to give the new treatment. In practical terms, the decision following this controversial conclusion
would be undesirable. However, when we look at the results relating to the male patients we see the
following:Table 1.3:
MALES
TREATMENT
Standard
New
Total
OUTCOME
Improved Did not Improve Total
% Improved
48
112
160
30
4
36
40
10
52
148
200
Just as in female patients it is the standard treatment that produces a higher percent of improvement. You
should check and verify that all this hangs together; the overall rate of improvement for the standard
treatment, for example is (32+48)/(40+160) = 80/200 = 40% as shown above. With a proper statistical
method of analysis it becomes clear now that the difference in improvement between the two treatments when
sex has been taken into account is 20% in favour of the standard treatment. Such features are common in
medical surveys and are a typical aspect of observational studies. The situation would have been put under
control in an experimental study set up. These arguments emphasize the need for biostatistical methods not
only for data analysis but also for study designs.
Chapter 2
DESCRIPTIVE STATISTICS
INTRODUCTION
Numerical information needs to be summarized before it can be used. The methods of summarizing data
(methods of descriptive statistics) vary with different types of data which are generated from different types
of variables. We first define what a variable is and then distinguish the different types of data. A variable is
an observation, characteristic or phenomenon that can take different values for different persons, times,
places. etc.
Examples of variables
VARIABLE
Height (cm)
Weight (kg)
Parity
Outcome of disease
Marital status
Age (years)
Haemoglobin (g/dl)
Number of AIDS cases
POSSIBLE VALUES
158, 169.3 170, 200.6, etc.
10.2, 50, 69.4, 84, etc.
0, 1, 6, 8, 10, etc.
Recovery, Chronic illness, death
Single, Married, Widowed, Separated
1, 5, 30, 36, etc.
8.9, 14.2, 12.7, etc.
278, 301, 313, 350, etc.
Types of variables
There are two types of variables:
(1) Qualitative (categorical) variables
(2) Quantitative (numerical) variables.
Nominal measurement
These are used for identifying various categories that make up a given variable.
Example:
(1) Religion: 1= Muslim , 2 = Christian, 3 = Other
(2) Sex: 1=male, 2=Female
Note that the numbering (codes) does not signify ranking.
The categories comprising a nominal variable can not occur together and are not related.
(b)
Ordinal Measurement
These are used to reflect a rank order among categories comprising a variable.
Example: perceived level of pain
1=No pain, 2=moderate pain, 3=severe pain
Number used has no other meaning than indication of rank order.
10
Ordinal measurement enables one to make a qualitative comparison (such as more/less pain) but not a
quantitative comparison like how much more
(c)
Interval measurement
Numbers used for this level of measurement are more meaningful than in the former levels.
Arithmetic operations (+ and -) can be performed. The distance between any two consecutive
points is the same along the scale.
Examples:
i.
ii.
(d)
Ratio Measurement
This is the most sophisticated level of measurement. This level has all the characteristics of interval
measurement but it has an absolute zero point that represents an absence of the measured quantity.
Example, weight, length, height, age, etc.
Note: Measurement at ratio level can be converted to lower levels.
11
Example, the following data shows a qualitative variable "Result of sputum examination".
If: 1 stands for smear -ve, culture -ve.
2 stands for smear -ve, not cultured.
3 stands for smear or culture +ve.
1
1
1
3
1
2
2
1
2
1
1
1
1
2
1
2
1
2
1
1
1
1
1
1
3
2
1
1
1
1
1
2
2
1
1
1
1
3
1
3
1
1
3
1
1
3
2
1
3
1
1
1
1
1
2
2
1
1
1
1
1
1
1
3
3
3
2
1
1
1
1
1
2
1
1
1
1
2
1
2
1
2
2
1
3
1
1
1
3
1
1
1
3
2
1
2
3
1
3
3
1
3
1
1
1
3
3
1
2
2
1
1
1
1
3
1
3
1
3
1
3
1
1
3
1
1
3
1
1
3
1
2
3
1
1
1
1
1
1
1
2
1
1
3
3
1
3
2
3
1
1
2
1
1
3
1
1
1
2
2
3
1
3
1
3
2
3
1
1
2
3
1
1
2
1
3
3
2
1
1
3
1
1
1
3
1
1
1
1
1
1
1
3
1
1
1
2
1
1
2
2
1
1
1
1
1
1
3
1
1
1
2
1
1
2
1
1
2
1
3
2
1
3
2
1
1
1
1
1
----------------
Frequency
144
40
45
Table 2.1:
A frequency distribution for the variable "Result of sputum examination".
Frequency
Relative frequency Cumulative relative frequency
VALUE
Smear -ve, culture -ve
144
62.9
62.9
Smear -ve, not cultured
40
17.5
80.4
Smear or culture +ve
45
19.6
100.0
Total
229
100.0
Use of diagrams:
Frequency distributions can be illustrated visually by means of statistical diagrams. These diagrams serve two
main purposes:(i)
Presentation of information/data (e.g. report) in articles for ease of appreciation
(ii)
To serve as a private aid for further statistical analysis.
Two types of diagrams are commonly used to illustrate qualitative data. These are pie charts and bar charts.
1. Pie chart
Pie charts are used to express the distribution of individual observations into different categories (Note: The
frequencies should be converted into percentages totalling 100 for a pie chart to be used).
12
Example, below is a pie chart showing the distribution of first year students at Muhimbili University
College of Health Sciences (MUCHS) by course of study.
DDS
14.0%
BSc N
12.0%
MD
49.0%
BPharm
25.0%
Fig 1:
D
istribution of First Year Students at MUCHS by Course of
Study.
2. Bar chart
These are the simplest and most effective means of illustrating qualitative data. The various categories of a
variable are represented on one axis and the frequency or relative frequency are represented on the other axis.
The length of each bar represents the number of observations (frequency) in each category or the relative
frequency in percentage. Example, cnsider the following birth control method mix in a certain population:
Abstinence
Oral contraceptive
Depo Provera
Loop
Spermicides
Condoms
Vasectomy
Hysterectomy
Norplant
3%
32%
9%
17%
7%
26%
3%
2%
1%
In this example, use of a pie chart for this variable would not be suitable because the diagram will be
13
Percentage
35
30
25
20
15
10
0
Abstinance
Oral Contrac.
Depo Provera
Loop
Spermicides
Condoms
Vasectomy
Hysteroctomy
Norplant
In a two way table data are presented in rows and columns. The format for a table depends upon the data and
the aspects of the data which are important to portray.
A two-way table should include the following:
1. A clear title.
2. A caption for the rows and columns with units of measurement of the variable.
3. Labels for each individual row or column. i.e The values taken by the variable concerned.
4. Marginal and grand totals.
Consider the following example:
In a study to investigate whether or not HIV1 infection is a risk factor to pulmonary tuberculosis (PTB), a
total of 2165 individuals were examined. Blood samples were also collected from these individual for
laboratory diagnosis of HIV1 infection.
The following results were obtained:
Of the 2165 individuals examined, 651 were found to be negative for HIV1 infection. Of those who were
negative, 57 were found to have PTB. 1526 of the HIV1 positive, 875 were found to have PTB.
14
15
Table 2.2:
PTB STATUS
Negative
Total
HIV STATUS Positive
Positive
875 (57.0)
651 (43.0) 1526 (100.0)
Negative
57 (8.9)
582 (91.0)
639 (100.0)
Total
932 (34.0)
1233 (57.0) 2165 (100.0)
Numbers in brackets show the row percentages.
The cells of a two way table may contain percentages instead of the real counts. Calculation of percentages
may be row-wise or column-wise depending on the purpose of the table.
Example: In the above table our interest is to investigate whether HIV1 infection is a risk factor to PTB. So
our aim is to see whether PTB is higher in HIV1 positives than in HIV1 negatives. Hence, the row
percentages are more appropriate in this case.
(ii)
Proportion of girls in the first year MUCHS= Number of girls in 1st Year
Total number of 1st year students
Proportion of male births= Number of male births
Total number of births.
16
That is, crude death rate = Number of deaths in one year x 1000
Total population
Rates may be expressed per 1000, per 100,000 or per 1,000,000 population depending on convention and
convenience.
17
Table 2.4: Frequency distribution of number of lesions caused by small pox virus in egg membranes.
NUMBER OF LESIONS
FREQUENCY (NUMBER OF MEMBRANES)
01
106
2014
3014
4017
508
609
703
806
901
1000
110-119
1
Total
80
Note: "-" means up to but not including the next tabulated value. Example, 10- means 10 is the lower limit
while 19 is the upper limit. 14.5 is the mid point for the class interval 10- .
The following rules are used to make frequency distribution for a grouped data.
1. Determine the range, R, of values. (R=largest value -smallest value)
2. Decide on the number, I, of classes. This number depends on the form of data and the requirements of the
frequency distribution. But usually they should be between 5 and 20 for convenience.
3. Determine the width of the class interval, W, such that W=R/I. A constant width for all classes is
preferable.
4. Choose the upper and lower limits of the class interval careful to avoid ambiguities.
5. List the intervals in order. Use tallies to allocate each observation into the class in which it falls. Add the
tally marks to obtain class frequencies.
Use of diagrams in quantitative data:
A: Histograms:
A histogram is a familiar bar-type diagram. Values of a variable are represented on a horizontal scale and the
vertical scale represents the frequency or relative frequency at each value. Each bar centres at the mid point
of the class. Example, using data on Table 2.3,
18
Fig 3: Histogram representing the frequency distribution of counts of trypanosomes in the tail blood of a rat.
30
Frequency
25
20
15
10
5
0
4
5
Count
If the frequency distribution is made of class intervals which are not equal, it is necessary to calculate the
average frequency per standard interval.
Example:
Table 2.5:
Frequency distribution of age at loss of last tooth
Frequency
Interval width
Average No/year of age
Age
11-15
16-19
20-24
25-29
30-34
35-44
45-54
55-74
1
7
21
35
40
58
28
10
5
4
5
5
5
10
10
20
0.20
1.75
4.20
7.00
8.00
5.80
2.80
0.50
19
Fig.4
B: Line diagrams:
These are often used to express the change in some quantity over a period of time or to illustrate the
relationship between continuous quantities. Each point on the graph represent represents a pair of values i.e. a
value on the x-axis and a corresponding value on the y-axis. The adjacent points are then connected by
straight lines.
40
30
20
10
0
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
Year
Fig. 5 A line diagram showing cumulative number of AIDS cases in Tanzania from 1983 to 1992.
C: Frequency polygons
Frequency polygons are a series of points (located at the mid-point of the interval) connected by straight
lines. The height of these points is equal to the frequency or relative frequency associated with the values of
the variable (or the interval). The end points are joined to the horizontal axis at the mid points of the groups
immediately below and above the lowest and highest non-zero frequencies respectively.
Frequency polygons are not as popular as histogram but are also a visual equivalent of a frequency distribution. They can
easily be superimposed and therefore superior to histograms for comparing sets of data.
20
30
Frequency
25
20
15
10
5
0
Fig.6
4
5
Counts
Frequency polygon for the number of trypanosomes in the tail blood of a rat.
120
Frequency
100
80
60
40
20
0
Fig.7
4
5
Counts
Cumulative frequency curve for the number of trypanosomes in a tail blood of a rat.
21
Generally: x = Xi
n
where, X = X1 + X2 + X3 + ... + Xn,
Xi
n
22
128
With the grouped data the class midpoint should be used when calculating the mean. Consider data on Table
2.4 The mean number of lesions caused by small pox virus in egg membranes is :
(5x1)+(15x6)+(25x14)+ ... +(95x1)+(105x0)+(115x1)
80
= 3670 = 45.8
80
The arithmetic mean is a preferred measure since it uses more information from each observation. However it
tends to be pulled by extreme values. Example, the following are duration of stay in hospital (in days) for
some condition.
5 5 5 7 10 20 102
The mean duration of stay x = 154 = 22, this does not reflect the mean duration of stay.
7
2. Median:
The median is the middle observations when all the observations are listed in increasing or decreasing order.
Example, below is a series of duration (in days) of absence from classes due to sickness.
1, 1, 2, 2, 3, 3, 4, 4, 4, 4, 5, 6, 6, 6, 6, 7, 8, 10, 10, 38, 80
The median is 5.
Generally, when n (number of observations) is odd the median is 1/2 (n+1)th observations. But when n is
even since there is no middle observation, the median is the mean of the two middle observations, i.e. 1/2nth
and (1/2n+1)th observation.
In frequency distributions, the median can be obtained by accumulating the frequencies and noting the value
of the variable which divides the data into two equal halves i.e. An observation where 1/2n of the observation
lie.
Note:
1. The median is less efficient than the mean because it takes no account of the magnitude of most of the
observations.
2. If two groups of observations are pooled, the median of the combined group can not be expressed in terms
of the medians of the two component groups.
3. The median is much less amenable than the mean to mathematical treatments and so it is less used in more
elaborate statistical techniques.
However if the data are distributed asymmetrically, the median is more stable than the mean. Consider the
example on the duration of stay in hospital where the median is 7; this is more realistic than the calculated
mean of 22 days.
3. Mode:
The mode is the value with the highest frequency. i.e. The value which occurs most frequently. The modal
value (days) for the duration of stay in hospital, example given above, is 5.
Measures of variability:
23
These measures express the degree of variation or scatter of a series of observations. Common measures of
variation are range, variance and standard deviation.
1. The range:
This is the difference between the maximum value and the minimum value.
Example: if the lowest and highest of a series of diastolic blood pressure are 65 mm Hg and 95 mm Hg.
Then, the range = 95 mm Hg - 65 mm Hg = 30 mm Hg.
The range is seldom used in statistical analysis because:a) It wastes information since it uses information from only two extreme values.
b) The two extreme values are more likely to be faulty.
c) The range increases with increasing number of observations.
2. Variance and standard deviation:
The variance is a measure of variability which makes use of the differences from each observation and the
mean i.e. (Xi - x ). If all the differences are added together, and their mean calculated, it gives an indication
of the overall variability of the observations.
But (Xi - x ) is always zero since some differences are positive while some are negative. Because of this, the
differences are squared.
The variance is the mean value of the squared deviations from the mean.
i.e. variance = (Xi - x )2
n
and the numerator, (Xi - x )2 is called, the sum of squares about the mean.
Since these differences are squared, the variance is measured in the square of the units in which the variable
X is measured. For example, if X is height in cm. The variance will be in cm2.
A measure of variation which is measured in the original units of the variable is the standard deviation
which is the square root of the variance.
Standard deviation = (Xi - x )2
n
The standard deviation shows the average deviation of observations from the mean. And the interval x +
2SD covers roughly a 95% of all the observations.
The population variance is in most cases unknown because data are normally not available for the whole
population. When this is the case, the population variance is estimated by the sample variance, S2.
S2 = (Xi - x )2
n-1
24
Note a change in the denominator from n to n-1. When n-1 is used in the denominator, it gives a better
estimate of the population variance than when n is used.
Calculation of variance and standard deviation:
To calculate the variance and standard deviation for the following data:
(Xi - x )2
0
9
16
16
49
9
1
100
Xi - x
0
-3
-4
+4
7
-3
-1
Xi
8
5
4
12
15
5
7
56
Xi = 56
n=7
x = 56 = 8
7
(Xi - x )2 = 100
S2= 100 = 16.67
6
S =- 16.67 = 4.08
Variance and standard deviation can be calculated using the shortcut formula for (Xi- x )2 (don't forget to
divide by n-1, later)
(Xi - x )2 = Xi2 - (Xi)2
n
So using the same data above,
Xi
8
5
4
12
15
5
7
Xi=56
Xi2
64
25
(Xi - x )2 = 548 -(56)2
16
7
144
= 548 - 3136
225
7
25
= 100
49
S2
= 100 = 16.67
2
Xi =548
6
S
= 16.67 = 4.08
We defined the mode as the value of the variable which occurs most frequently. In other words it is the value
at which the frequency curve reaches a peak. When the frequency distribution has one peak (one mode), it is
called a unimodal distribution.
Table 2.6:
FREQUENCY
NUMBER OF MALES
(NUMBER OF SIBSHIPS)
0
161
1
1,152
2
3,951
3
7,603
4
10,262
5
8,498
6
4,948
7
1,655
8
264
Total
38,495
The mode of this distribution is 4 and the distribution is Unimodal as seen in Fig. 8.
In some of the unimodal distributions, the frequency curve is "BELL SHAPED". i.e. the mode is somewhere
between the two extremes of the distribution. Such distributions are said to be symmetric.
In symmetric distributions the mean, mode and median coincide. Other Unimodal distributions are
asymmetric. Asymmetric distributions have the mode (peak) not at the centre of the distribution curve. An
asymmetric distribution is called skew distribution.
The distribution is positively skew if the upper tail is longer than the lower tail and is negatively skew if the
lower tail is longer than the upper tail.
Some distributions have more than one mode. If a distribution has two modes, it is called a bimodal
distribution. If the distribution is symmetric but bimodal the mean and the median are approximately the
same, But this common value can be somewhere between the two peaks.
26
12
10
NO.OF MALES
Fig.8:
Normally a bimodal distribution indicates that within the population under study there are two distinct groups
which differ in the variable being measured.
Examples of variables that follow a Bimodal distribution are:
i. Body temperature of malaria patients
ii. Distribution of values of dilution levels of phenylthiourea solution to determine tasters and non tasters.
27
The table below shows data for 104 medical students who determined their taste threshold to phenylthiourea
(PTC):
Table 2.7:
Solution number
1
2
3
4
5
6
7
8
9
10
11
12
Concentration (mg/l)
1.27
2.54
5.08
10.20
20.30
40.60
81.20
182.00
325.00
650.00
1300.00
>1300.00
Number of students
11
16
23
12
3
0
3
5
8
10
8
5
25
20
15
10
5
0
0.1 0.4 0.7
28
EXERCISE
1.
The following table shows the numbers of viral infected patients not in hospital and in hospital
subdivided by sex and age.
NOT IN HOSPITAL
IN HOSPITAL
Age (years)
Males
Females
Males
Females
0 - 14
43
42
25
9
15 - 29
59
49
55
27
30+
65
28
39
14
Obtain a two way summary table to show how the proportion (in percent) of patients who are in hospital
varies with : i. Age ii. Sex
2.
PLACE OF DEATH
Transport
Work
Home
Other
1971
8401
860
6917
3068
1976
7306
712
6250
2831
YEAR
1980
6945
630
6009
2516
1982
6407
457
5468
2781
1983
6138
443
5514
2459
Construct:
a. A bar chart showing accidental deaths by place for each year shown.
b. A pie chart showing accidental deaths by place for 1983.
3.
A sample of 11 patients admitted for diagnosis and evaluation to a newly opened psychiatric ward of
a general hospital experienced the following lengths of stay.
PATIENT NUMBER
1
2
3
4
5
6
LENGTH OF STAY
29
14
11
24
14
14
PATIENT NUMBER
7
8
9
10
11
Find:
a. The mean length of stay for these patients.
b. The variance
c. The mode.
29
LENGTH OF STAY
14
28
14
18
22
4.
The following are the fasting blood glucose levels of 100 children.
56
57
62
63
64
60
61
67
69
68
65
65
68
65
75
66
69
72
75
81
69
66
65
65
65
68
72
73
73
81
65
65
66
68
66
72
73
75
66
73
73
68
69
67
67
67 61
67 77
65 75
62 55
63 60
57
57
62
67
59
72
61
73
63
80
61
76
57
68
64
64
65
76
58
56
71
58
55
79
71
60
80
80
55
65
73
75
74
68
63
74
59
55
65
59
56
52
75
63
74.
The following are the number of babies born during a year in 60 community hospitals.
30 37 32 39 52 55 55 26 56 57 45 43 28 58 46 27 52 40 59 43
56 54 53 49 54 48 42 54 53 31 45 32 29 30 22 49 59 42 53 31
32 35 42 21 24 57 46 54 34 24 47 24 53 28 57 56 57 59 50 29
From these data find:
(a) The mean, (b) The median, (c) The variance, (d) The standard deviation.
6.
The following are the haemoglobin values (g/100ml) of 10 children receiving treatment for
haemolytic anaemia.
9.1 10.1 11.4 12.4 9.8 8.3 9.9 9.1 7.5 6.7
Compute:
The sample mean, median, variance and the standard deviation.
30
Chapter 3
PROBABILITY
INTRODUCTION
The theory of probability underlies the methods for drawing statistical inferences in medicine. The
knowledge of probability will therefore help you to set the groundwork for the development of statistical
inference.
Definition:
Probability of an event is defined to be the proportion of times the event occurs in a long series of random
trials.
Examples:
1. If an unbiased coin is tossed many times roughly 50% of the results will be heads. Thus when tossed once,
the probability of a Head (H) or a Tail (T) is .
2. If you are told that in a certain country 10% of the population are HIV positive. If a person is selected from
this population at random, it could be said that the probability that he/she is HIV positive is 1/10 since this
event occurs on average to one person in 10.
3. A die has six sides numbered 1, 2, 3, 4, 5, 6. If an unbiased die is tossed once, the probability of any of the
sides showing is 1/6 i.e. P(1) = 1/6, P(2)= 1/6, P(3)= 1/6, P(4)= 1/6, P(5)= 1/6, P(6)= 1/6.
Note:
(1) Probabilities are proportions and so they take values between 0 and 1.
(2) A probability of 0 means that the event never occurs whereas the probability of 1 means the event
certainly occurs.
(3) The sum of probabilities of all possible outcomes is 1.
Example: in tossing a coin P(H) + P(T) = 1 and in tossing a die,
P(1) + P(2) + P(3) + P(4) + P(5) + P(6) = 1.
31
The Addition law:If A and B are possible events with known probabilities of occurrence then, P(A or B or both) = P(A) + P(B) P(A and B).
P(A and B) is the probability of the double event for non mutually exclusive events.
Consider a doctor's name being chosen haphazardly from the Tanzania medical register. If the probability that
this doctor is a male is 0.9 and the probability that the doctor qualified at Muhimbili Medical School is about
0.8,
what is the probability that the doctor is either a male or the doctor qualified at Muhimbili medical
school?
Let A be the event that the doctor is a male and B be the event that the doctor qualified at Muhimbili medical
school.
P(A or B) = P(A) + P(B) - P(A and B)
=0.9 + 0.8 -0.72
=0.98
Note that A and B are not mutually exclusive events because a doctor can be a male and qualified at
Muhimbili Medical School. In this case if the probability of the double event is not subtracted the probability
will exceed 1.
But if the two events are mutually exclusive, the probability of the double event is 0 and so the probability of
either A or B is given by the sum of the probabilities of two events.
That is, P(A or B) =P(A) + P(B)
Example:
These are mutually exclusive because you cannot have 3 and 5 at the same time.
So P(3 or 5) = P(3) + P(5).
= 1/6 + 1/6
= 1/3
Multiplication rule:
If for example there are two random sequence of trials proceeding simultaneously. e.g. at each stage a coin
may be tossed and a die thrown.
How can we get the probability of a particular combination of results. e.g. P(H and 5)
we need to use the multiplication rule.
P(H and 5) = P(H) x P(5, given H) can be written as
P(H and 5) = P(H) x P(5/H).
The second term on the right side, P(5/H) is called Conditional Probability i.e. Probability of a 5 showing on
a die given that a Head appeared on the coin.
32
Take an example of playing cards. It is a pack of 52 cards which are 13 Spades, 13 Diamonds, 13 Hearts, 13
Flowers. If you draw two cards (one at a time) from a pack of cards, what is the probability that the 1st and
2nd cards will be Spades?
NOTE: P (spade) = 13/52
P (spade/spade on 1st draw) = 12/51
This is so because of the fact that you have already drawn 1 spade thus decreasing the number of spades and
the pack by 1. So P (spade on 1st and 2nd draw) = 13/52 x 12/51 = 0.0588.
Definition:
Independent events:
Two events are independent if the occurrence of one does not affect in anyway the occurrence of the other.
Thus if A and B are independent events P(B/A) = P(B). When a coin is tossed, the outcome of the 1st trial
does not affect the outcome of the 2nd trial.
In independent trials, the multiplication rule assumes a simple form P(A and B) = P(A) P(B).
e.g. P(H and 5) = P(H) x P(5)
= 1/2 x 1/6 = 1/12.
EXERCISE
1.
2.
The following table shows 1000 nursing school applicants classified according to scores made on a
college entrance examination and the quality of the high school from which they graduated, as rated
by a group of educators.
SCORE
Low (L)
Medium (M)
High (H)
Total
TOTAL
220
390
390
1000
a)
Calculate the probability that, an applicant picked at random from this group:
i) Made a low score on the examination.
ii) Graduated from a superior high school.
iii) Made a low score on the examination and graduated from a superior high school.
iv) Made a high score or graduated from a superior high school.
b)
33
(v) P(H/S).
Chapter 4
THE NORMAL DISTRIBUTION
INTRODUCTION
Breakdown of the total probabilities into the probabilities of each of the events is called probability
distribution. A variable, the different values of which follow a probability distribution, is known as a
random variable.
In genetical experiment, an example of a probability distribution can be obtained when we may cross two
heterozygotes with genotypes Aa. The progeny will be homozygotes (aa or AA) a heterozygotes (Aa) with the
probabilities shown below:
No. of A genes
Probability
in the genotype
0
1
2
Total
1
Probability distribution can be presented on the drawing as:
Genotype
aa
Aa
AA
0.6
PROBABILITY
0.5
0.4
0.3
0.2
0.1
0.5
1
1.5
NO.OF A GENES
2.5
Fig.10: Probability distribution for a random variable. The number of A genes in the genotype of progeny of
an Aa x Aa cross.
34
However, for continuous random variables, the probabilities of particular values of the variable are negligible
(sometimes zero). So to obtain the probability distribution of a continuous random variable, the concept of
probability should be confined to a specified interval on the continuous scale.
For example: when the probability that a man to be selected at random is exactly 70.2876 inches in height is
presumably zero; the probability that the man selected at random his height is between 70 and 72 inches is
0.12.
Continuous probability distribution.
Different random variables have different probability distributions, but the one which we will discuss here is
the Normal distribution.
The normal distribution:
The Normal (or Gaussian) distribution, is the most important continuous probability distribution.
Characteristics of the normal distribution
1.
2.
3.
4.
5.
Normally, the probability distribution of the variables we observe are unknown. But if the smooth curve
depicting the probability distribution is bell shaped and reasonably symmetrical about the mean, use can be
made of the normal distribution.
The normal distribution as we have seen above, is determined by its mean and its standard deviation. These
quantities are different for different problems and so it is not possible to make tables of the Normal
distribution for all the values of and . So calculations are made by referring to the Standard Normal
distribution which has =0 and =1.
Thus an observation X from the normal distribution with mean and standard deviation can be related to a
standard normal deviate by calculating (SND).
SND = X -
Thus, any normal distribution with mean and standard deviation , the probability between X1 and X2 is
the same as the probability between SND1 and SND2 in the standard Normal distribution where
SND1= X1 - and
SND2 = X2 -
The table showing probabilities of the standard normal distribution is found at the end of this manual.
35
The first two digits of the SND are shown in the 1st column and the third digit is given by the other column
headings. The figures in the body of the table for particular values of the SND show the area under the
standard normal curve to the right of the SND.
If SND = 0.00 , the area to the right of SND is 0.5.
If SND = 1.14, the area to the right of SND is 0.12714 and the area to the left is given by 1-0.12714 =
0.87286.
Examples of applications of the standard normal distribution
1.
A study of blood pressure of Negro schoolboys gave a distribution of systolic blood pressure (SBP)
close to Normal with = 105.8mmHg and = 13.4 mmHg.
a)
What proportion of boys would be expected to have SBP greater than 120mmHg?
b)
What proportion of boys would be expected to have SBP less than 120mmHg.
If 14.5% have SBP greater than 120 mmHg, then 100 - 14.5 = 85.5% will have SBP less than 120
mmHg.
c)
What proportion of boys would be expected to have SBP between 85 and 120 mmHg.
calculate SND1 =, 85 - 105.8 = -1.55
13.4
and SND2 = 120 - 105.8 = 1.06
13.4
36
37
EXERCISE
1.
Suppose the average length of stay in a chronic disease hospital of a certain type of patient is 60 days
with a standard deviation of 15. If it is reasonable to assume an approximately normal distribution of
lengths of stay,
Find the probability that a randomly selected patient from this group will have a length of stay :
a) Greater than 50 days.
b) Less than 30 days.
c) Between 30 and 50.
d) Greater than 90 days.
2.
If the total cholesterol values for a certain population are approximately normally distributed with a
mean of 200 mg/100 ml and a standard deviation of 20mg/100 ml,
Find the probability that an individual picked at random from this population will have a cholesterol
value:
a) Between 180 and 200 mg/100 ml.
b) Greater than 225 mg/100 ml.
c) Less than 150 mg/100 ml
d) Between 190 and 210 mg/100 ml.
38
Chapter 5
INTRODUCTION TO SAMPLING TECHNIQUES
INTRODUCTION
Often in research work we are dealing with groups which are effectively infinite, such as the number of
underfives in a district, for example. In sampling, part of a group (population) is chosen to provide
information which can be generalized to the whole, although in theory it would be possible to investigate the
whole group. Sampling is adopted to reduce labour and hence costs.
Definition:
Sampling is the process of selecting a number of study units from a defined study population. Otherwise, if
the whole population is studied the process is referred to as taking a census. We can illustrate the process of
sampling and the important activities involved with the following diagram:(1)
(2)
sample
n
Study population
N
(5)
Parameters
(3)
(4)
Statistics
The diagram depicts drawing a sample of size n using a particular sampling method from a study population
with N units (subjects). Inferential statistics techniques are then used to make inferences about the study
population on the basis of results from the sample.
The steps:
1) Identifying the study population (note: it is possible to have different study populations in one study).
2) Drawing a sample from the study population.
3) Describing the sample (e.g. by calculating relevant statistics).
4) Making inferences about the parameters.
5) Drawing conclusions about the study population.
39
Random against biased sampling:Selection of the study units can be purposive or random. When it is purposive, no valid assessment of
sampling error can be made and in many instances this will lead to some bias. We will come back to "bias" in
detail later under "other aspects of sampling".
If conclusions that are valid for the whole population are to be drawn on the basis of a sample, then the
sample should be representative of that population. A representative sample is one that has all the important
characteristics of the population from which it is drawn. Selection of sample on random basis is a necessary
but not always sufficient condition to achieve representativeness.
We shall consider two main aspects of sampling, namely:i. the sampling methods
ii. sample size.
Moreover, in the discussion, we shall confine ourselves to surveys designed to provide estimates (particularly
the mean and proportion) of certain characteristics of populations as opposed to other study types.
SAMPLING METHODS
The choice of a particular sampling method is influenced by the availability of a list of all the units that
compose the study population. This is called the sampling frame.
Examples could be a list of villages, a list of eligible users of family planning methods, a list of University
students, etc.
Types of sampling:
We can classify sampling methods into two types:
i. non-probability sampling and
ii. probability sampling.
Non-probability sampling:
There are two common methods which fall under this method: These are (i) convenience sampling and (ii)
quota sampling.
i. Convenience sampling - sample is obtained on convenience basis, e.g. the study units that happen
to be available at the time of data collection are selected (many hospital based studies use
convenience samples).
A major limitation of this approach is that the sample drawn may be quite unrepresentative of the
study population.
ii. Quota sampling - a fixed predetermined number of sample units from different categories of the
study population is obtained. Obtaining a sample in this manner ensures that a certain number of
sample units from different categories with specific characteristics (such as sex, religion, age) are
represented in the sample. It is useful when one desires to provide a balance of study units
according to some characteristics of interest. Convenience sampling would not achieve this sort of
balance.
40
Probability sampling:
In this type of sampling the selection procedure has some element of probability/chance. In particular, a study
unit has some known probability of being selected into the sample. We shall discuss five forms of probability
(also known as random) sampling.
(1)
First, determine how many number digits you need. That is, see whether the sampling
frame is a one, two, or more digits. For example, if your sampling frame consists of 10 units,
this implies you will be choosing from a frame of a two-digit number in size. You must use
two digits from the random number table to choose from numbers 1-10.
If, however, your sampling frame is of three digits in size, then you obviously need to
choose from three digits. For example, the number 43 in columns 10, 11, and row 27, would
become 431. Going down the next numbers would be 107, 365, etc.
You would follow same reasoning if you needed a four digit number, for a sampling frame
which is of four digits. In our example of the number 431 on columns 10, 11,12, row 27, this
would now become 4316, the next down being 1075, and so on.
ii. Decide before hand whether you are going to go across the page to the right, down the page,
across the page to the left, or up the page.
iii. Without looking at the table, and using a pencil, pen, or any sharp-ending object, pin-point a
number to establish your starting point.
iv. If this number (in step 3) is among those on your sampling frame, take it. If not, continue to
the next number in the direction you decided before hand in step 2 until you find a number
that is within the range you need. This process goes on until you achieve enough units for
your sample.
41
(2)
Systematic sampling:
As the name suggests, this sampling method is such that elements in the sample are obtained in a
systematic way.
In carrying out systematic sampling, the following steps are important:
i) Obtain the sampling frame (and the size of the study population N,
ii) Decide on the sample size, n
iii) Calculate the sampling interval, k = N/n
iv) Select the first element at random from the first k units
v) Include every kth unit from the frame into the sample.
say)
Stratified sampling:
In this method the population is divided into subgroups, or strata whereby each stratum is sampled
randomly with a known sample size. Strata may be defined according to some characteristics of
importance in the survey. These could be occupation, religion, age groups or even locality whereby
regions of the country may be taken as strata in a national health survey.
The steps involved in stratified sampling are as follows:i. Divide the population into subgroups (strata)
ii. Draw a sample (of predetermined size) randomly from each of the stratum.
An important stratification principle is that the between-strata variability should be as high as
possible, or equivalently that each stratum should be as homogeneous as possible (i.e. units within a
stratum should be as much alike as possible and units in different stratum should be as much different
as possible).
(4)
Cluster sampling:
There are situations in which obtaining a complete list of individuals in the study population is
practically not feasible or a complete sampling frame is not available before the investigation starts.
In such cases it would be easy and convenient to talk of a sampling frame in which the sampling units
are a collection (cluster) of study units.
42
Examples of such clusters would be schools, hospital wards, villages, etc. Since in this case the
sampling unit is a cluster (e.g. a school) the sampling method is known as cluster sampling. The
selection steps will be exactly the same as those for any of the above random sampling methods but
the sampling unit being the cluster.
Unlike in stratified sampling, an important principle in cluster sampling is that units within a cluster
should be as much heterogeneous as possible while the between-cluster variability should be as low
as possible.
(5)
Multistage sampling:
Multi-stage (originating from the Latin word "multus" meaning "many") sampling is carried out in
many (more than 1) stages, and different sampling techniques can be employed at every stage. In this
method the sampling frame is divided into a population of first-stage sampling units, of which a
first-stage sample is taken. Each first-stage unit selected is subdivided into second-stage sampling
units, which are then sampled. The process continues till it is convenient to stop.
To illustrate multistage sampling consider a health survey of primary school children in Tanzania
mainland. An immediate problem to taking a sample of these children is that it is almost impossible
to construct a complete sampling frame. A multistage sample might be:
(a) to take a sample of regions;
(b) within each selected region take a sample of districts;
(c) within each selected district, take a sample of schools;
(d) within each selected school, take a sample of school children, and carry out the
investigation.
The sample would thus be accomplished in four stages and notice that the construction of a complete
sampling frame for each stage is relatively easy.
Apart from this advantage (of coming up easily with complete sampling frames), multistage sampling
procedure is likely to result in an appreciable saving in cost by concentrating resources at selected
schools instead of a sample made up of children scattered in all parts of the country.
Sometimes, in the final stage of sampling, complete enumeration of the available units is undertaken.
In the above example, once a survey team has reached the level of a school it may cost little extra to
examine all the children in the school; it may indeed be useful to avoid complaints from children not
included in the study within the same school.
Bias in sampling
Bias in sampling refers to the systematic error in sampling procedures that may lead to distortion in
results. Sources of bias in sampling include the following:
i)
Non-response:
This is encountered mainly when subjects refuse to give a reply during interview, or when
they (the subjects) forget to fill in a questionnaire. The non-respondents (particularly those
due to refusal) may differ systematically from those who respond.
43
ii)
iii)
iv)
v)
vi)
(b)
Ethical considerations
If recommendations from a study are intended for the entire study population (e.g all relevant
individuals in a region), then one is bound ethically to ensure the sample studied is representative of
that population.
Remember that random selection of a sample does not guarantee representativeness.
SAMPLE SIZE
(NOTE: This sub-section can be skipped without loss of continuity until variance and standard deviation has
been covered).
In the planning of a study in almost any subject, one of the first and fundamental questions to be considered is
the size of the study. The trivial answer to the question how big a sample do I need?' would be make as large
a sample as possible, since in a given study an increase in sample size will increase the precision of the
sample results'.
Clearly issues of cost of collection and processing of data come in with a potential limiting effect on the
sample size. We shall discuss the aspect of sample size in the simplest situation whereby a study is designed
to estimate a parameter such as the mean or the proportion and confine ourselves to the statistical problems
involved in the calculations.
(1)
Thus 2= 22/n. Hence 2 = 4 /n, and the required sample size, n, is given by:
n = 4 /
44
This formula implies knowledge of the population standard deviation , and in almost all
surveys this is unknown. It is necessary to replace with an estimate. This estimate may be
obtained from results of previous studies on the variable or alternatively be obtained as a
direct result from a pilot study.
(2)
45
applied the following considerations are important in coming-up with a reasonable sample
size:-
(a) The type of study: -in exploratory studies, you usually have relatively small
samples.
(b) The number of variables to be used in the study: -the more variables, the smaller
the sample size, for practical reasons.
(c) The expected variation in the study population with respect to the most important
variables: -the bigger the variation, the larger the sample one needs.
(d) The scale on which the findings and recommendations from the study will be
used: -the larger the scale, the larger the sample.
Finally, we wish to point out that it is not generally true that the bigger the sample size, the
better the study becomes! In general, it is much better to increase in accuracy of data
collection (e.g. through careful pre-testing of the tools, or improving the quality of
interviewers, if any) than to increase on the sample size.
EXERCISE:
1.
A study is being planned to determine the mean birthweight of babies born at Muhimbili
Medical Centre. Birthweights are approximately normally distributed and 95% of the
weights are probably between 2000g and 4000g.
Determine the required sample size so that there is a 95% chance that the estimated mean
birthweight does not differ from the true value by more than 50g. (Hint: calculate the
standard deviation of the birthweights, first).
2.
You have been assigned to conduct a study in order to determine the prevalence (i.e.
proportion of people affected with) of bancroftian filaria infection in Dar-es-Salaam region.
A review of literature on the subject reveals that, studies done along the East African coastal
strip some years back, showed the prevalence to be in the order of 30%. What sample size do
you require in order to come up with a reasonable estimate in your study? Give a complete
answer including describing any assumptions or prior decisions that you undertake.
46
Chapter 6
ESTIMATION
In Chapter 5 we mentioned that we study a sample with the view to learning something about the
population as a whole.
In general, we wish to estimate characteristics of the population such as:
i. the mean value of some measurement;
ii. the proportion of the population with some characteristic.
Sample (Statistic)
mean
variance
proportion
S2
p
Population (Parameter)
Thus, the sample mean x estimates the population mean , for example.
In general, the sample mean or sample proportion is unlikely to be exactly equal to the mean or
proportion in the population, although the former is intended to estimate the latter. If the two are
exactly equal to one another, it is just by coincidence.
This amounts to saying that almost always our conclusion about a population on the basis of the
sample we have taken will have some error.
We distinguish between two sorts of error:
(i) Sampling errors and
(ii) non-sampling errors
Sampling errors are those which arise due to the fact that we have observed only part of the whole
population, and they get less important as the sample size increases.
For example, an estimate of the mean number of children per household in a certain district based on
two households only (in the district) will certainly be poorer than that based on a sample of say 100
households.
47
We say there is less sampling error in the latter situation than in the former. If we investigated the
whole population (i.e. all households in the district) the sampling error would be zero because we
would know the population mean exactly.
Non-sampling errors are due mainly to fault in the sampling process which is likely to create room
for the potential sources of bias (sometimes also referred to as systematic errors) highlighted in
Chapter 2. These errors are potentially serious since the bias they cause may lead to invalid
conclusions being drawn. Increasing the size of a sample will not necessarily reduce the nonsampling errors.
For example, subjects may refuse to give a reply during interview or they may forget to fill in a
questionnaire. These non-respondents may differ systematically from those who respond.
Non-sampling errors also occur through equipment faults, observer errors and during data processing
through coding, data entry, etc.
However, in this section we will direct our attention to sampling (also known as random) errors.
The larger the sample size the better the precision in estimating (i.e. large samples are more
likely to produce closer estimates than small samples).
2.
If the variability of the observations in the parent (study) population is small we would
expect the error to be small also, and vice-versa. Thus the sampling error depends on the
variability of observations in the population.
We mentioned earlier on the idea of taking repeatedly a random sample of size n and calculating each
time the sample mean x . This would lead to obtaining a series of values of x , and the natural
questions relating to this (new) variable x will be on its distribution as well as the mean and
variance of the variable. It can be shown mathematically that:i.
48
ii.
iii.
iv.
The mean of the distribution of x is the same as that of X (i.e. the mean of the
sample means is the same as the mean of the parent population).
The variance of x is 2/n where 2 is the variance of X. It is easy to see that as the
sample size n increases, the variance of x decreases. From an earlier explanation,
this observation is expected.
The standard deviation of x is the square-root of its variance, and is often referred to
as the standard error of the mean. That is, the standard error of the (sample) mean,
usually written as SE( x ), is given by /n.
Note: In practice, the value of 2 will be unknown. It can be replaced by the sample value,
s2, and the expression for the standard error SE( x ) applies accordingly.
The fact x that tends to follow a normal distribution is remarkable, since this implies that the
properties of normal distributions apply to the distribution of the sample mean. In particular, we now
know that x follows a normal distribution with parameters and 2/n as the mean and variance,
respectively.
Hence, it follows, for example, that 95% of the sample means lie within the interval 1.96SE( x ).
This implies that there is a 95% chance of getting a sample mean within the interval 1.96SE( x ).
Equivalently, we are saying that the probability of having a sample mean in the interval 1.96
SE( x ) is 0.95.
Note: The limits of the interval 1.96SE( x ) are -1.96SE( x ) and +1.96SE( x ). That is,
alternatively, we are talking of the interval ranging from -1.96SE( x ) to +1.96SE( x )
We can express the above statements mathematically as follows:Pr{-1.96SE( x ) < x < +1.96SE( x )} = 0.95, where Pr{x} means "probability of x"
Re-arranging the left-hand side of the above equation, we obtain the following equivalent equation:
Pr{( x -1.96SE( x ) < < x +1.96SE( x )} = 0.95.
In words, this says that the probability that the interval
x -1.96SE( x ) to x +1.96SE( x ) includes the population value is 0.95.
When the value of x (and that of SE( x )) is known, then the interval x -1.96SE( x ) to x +1.96SE( x
), often written also as ( x -1.96SE( x ), x +1.96SE( x )), is called the 95% confidence interval of .
The logic of this is that, for known values of x and SE( x ), the interval ( x -1.96SE( x ), x +1.96
SE( x )) is known and fixed. Hence, it no longer makes sense to talk of the interval including with
0.95 probability since the probability is definitely either 1 or 0. That is, either the interval includes
or does not include .
Wider intervals, and therefore higher "confidence" can be set if required. For example, the value
2.58 can be used in the place of 1.96 to set 99% confidence intervals. Indeed an appropriate
standardized normal deviate, z, can be used to obtain desirable confidence intervals.
49
While we have used a property of the normal distribution (notably, the one which states that 95% of
the values lie within 1.96 standard deviations about the mean) to define a confidence interval, it is
important to distinguish between the 95% spread (or tolerance) interval/limits and the 95%
confidence interval/limits. The former is a descriptive measure while the latter is used in estimation
problems as a measure of precision. In particular, the limits 1.96 include 95% of the values in
the population whereas the limits 1.96/n include 95% of the sample means.
EXERCISE
1.
The distribution of the duration of stay in a hospital for a certain condition is known to be
skewed to the right. The mean length of stay is 10 days and the standard deviation is 8 days.
It is proposed to study a sample of 100 patients admitted in hospital for that condition.
(a)
(b)
(c)
(d)
2.
What kind of distribution will the duration of stay of the patients in the sample
follow?
Comment on the suitability of the use of the mean duration of stay as a summary
measure of central tendency in this case.
If you took many such samples (i.e. repeatedly) what kind of distribution would the
sample means follow?
What would be the mean and the standard deviation of the distribution of the sample
means in (c) above? Give a complete numerical answer.
In a random sample of 150 University of Dar-es-Salaam students it was found that 38 of them
received or needed to receive treatment for defective vision.
(a)
Estimate the proportion (in percentage) of students at the University who receive or
need to receive treatment for defective vision.
50
(b)
Estimate 90%, 95% and 99% confidence intervals for the true proportion of
University of Dar-es-Salaam students who receive or need to receive treatment for
defective vision.
51
Chapter 7
SIGNIFICANCE TESTS: ONE SAMPLE
INTRODUCTION
Chapter 6 dealt with the estimation of population parameters by sample statistics. These
sample statistics may further be idealized to answer questions about the population
parameters. In the framework of statistical inference the question is reduced to a hypothesis
and the answer to it expressed as the result of a test of the hypothesis.
Definition of terms
1.
2.
Null hypothesis, Ho: This term relates to the particular hypothesis under test. In many
instances it is formulated for the sole purpose of being rejected or nullified. It is often
a hypothesis of 'no difference.
3.
Alternative hypothesis, H1: This is a statistical hypothesis that disagrees with the
null hypothesis.
The null hypothesis H0 and the alternative hypothesis H1 concern populations but our
conclusions are based on samples taken from these populations. Generalization from
sample to population is dangerous since sampling errors are involved. Therefore we
are unable to say that H0 or H1 are definitely true because of this sampling effect.
If sampling errors are taken into account, it can be investigated how likely each of these
hypothesis is. We have to measure the relevant information in the sampled data and weigh
this information in relation to the sampling errors involved.
4.
A statistic: is a value which depends on the outcomes on a variable for the sampled elements.
5.
A test statistic: is a statistic which represents the relevant sample information for the
question under investigation. It provides a basis for testing a statistical hypothesis and has a
known sampling distribution with tabulated percentage points (e.g. standard normal, 2, t
etc). The value of a test statistic differs from sample to sample.
6.
7.
Critical value: This is the value of the test statistic corresponding to a given significance
level as determined from the sampling distribution of the test statistic (by using statistical
52
tables which will be explained later). The critical value is the boundary value such that if the
value of the test statistic is more extreme (i.e. more unlikely) than the critical value, then H0
is rejected and the probability of rejecting H0 when it is true is less than the significance
level.
CONCEPT OF P-VALUES
The p-value is a probability associated with the observed test statistic value.
The p-value of an observed test statistic value is the probability to obtain a test statistic value as
extreme as, or more extreme than, the observed test statistic value, if H0 is true. For example, in a
clinical trial this statement refers to the observed difference between the treatment groups. We are
therefore relating our data to the likely variation in a sample due to chance when the null hypothesis
is true in the population.
Interpretation of p-value
Large p-value points the null hypothesis
Small p-values are evidence for the alternative hypothesis.
A proposed guideline is:
p > 0.05
0.01 < p < 0.05
0.001 < p < 0.01
p < 0.001
No evidence against Ho
Evidence in favour of H1, but be careful
Substantial evidence in favour of H1
Very strong evidence in favour of H1. The possibility that Ho is true can be neglected.
However, for a proper interpretation of the p-value the sample size should be considered. If the
sample size is too small the sampling error will be large. This will prohibit us to find evidence against
Ho and result in high p-values, even if Ho is not true.
Relationship between p-values and sample size
Sample size is important in the interpretation of p-values.
p-value
Small
Large
Sample size
Small
Large
- evidence against Ho
- evidence against Ho
- results point away from Ho
- results support H1
- difficult to interpret
- no evidence against H1
- can't distinguish between Ho and H1
- results point at Ho
The following results relating to malnutrition among underfives in Dodoma and Mwanza using
different sample sizes confirm the above explanation.
n
50
500
50000
Dodoma
40%
40%
31%
Mwanza
30%
30%
30%
P
0.29
0.0098
0.0012
53
Conclusion
No significant difference
Highly significant
Highly significant
8.6 = 2.0
4.33
This value just exceeds the 5% value of 1.96, and the difference is therefore significant, i.e p<0.05
Thus we conclude that it is likely that there is an increase in the mean survival time among patients
who were treated by a new technique.
54
55
Clearly these two approaches are related. If, for example, the 95% confidence interval includes the
value of the parameter proposed by the hypothesis then the result of the test must be non significant
at the 5% level (i.e. p>0.05).
If, on the other hand, the 95% confidence interval does not include the value of the parameter
specified in the null hypothesis, then the result of the test must be significant at the 5% level (i.e.
p<0.05).
For example, in the test of a sample mean (example on mean survival time of patients after being
treated by a new technique), x =46.9 months and SE( x )=4.33 months.
Thus a 95% confidence interval for the true mean survival time due to this new technique is
x 1.96 x SE( x )
= 46.9 1.96 x 4.33
= 46.9 8.49
= 38.4 to 55.4
The value proposed in the null hypothesis is 38.3 months and we note that it is not included in the
confidence interval. It would thus be concluded that 38.3 is an unlikely value for the mean survival
time of cancer patients after treatment. Equivalently, we are saying, the null hypothesis is rejected at
5% level (i.e. p<0.05).
The t-test
As already shown above, the standard normal deviate test involves the calculation of
SND = x - = x -
SE( x ) /n
The SND is then compared with the critical values 1.96 or 2.58. This was applied since the
population standard deviation, , was known. If, as is usually the case, is unknown, the SND cannot
be calculated. However, the value of can be estimated from the sample by the standard deviation s.
Replacing in the above formula by s, we obtain a new quantity t, given by
t = x-
s/n
t follows the t-distribution on n-1 degrees of freedom.
As the sample size increases, s should be nearly equal to and t will be very close to the standard
normal deviate.
At the end of this manual, we find a table which shows the critical values of t, for each number of
degrees of freedom.
56
Example:
The following data are uterine weights (in mg) of each of 20 rats drawn at random from a large stock.
Is it likely that the mean weight for the whole stock could be 24 mg, a value observed in some
previous work?
9
15
18
19
21
22
26
29
14
15
18
19
22
24
27
30
16
24
20
32
EXERCISE
1.
The mean level of prothrombin in the normal population is known to be 20.0 mg/100 ml of
plasma and the standard deviation is 4 mg/100 ml. A sample of 40 patients showing vitamin
K deficiency has a mean prothrombin level of 18.5 mg/100 ml.
(a).
How reasonable is it to conclude that the true mean for patients with vitamin K
deficiency is the same as that for a normal population?
(b).
Within what limits would the mean prothrombin level be expected to lie for all
patients with vitamin K deficiency? (Give the 95% confidence limits).
57
Chapter 8
SIGNIFICANCE TESTS: TWO SAMPLES
COMPARISON OF TWO MEANS
We shall distinguish between two situations: the unpaired case, in which the two samples are of equal
size and the individual members of one sample are paired with particular members of the other
sample; and the unpaired case, in which the samples are quite independent.
Matched/paired observations
So far, the problem arising from the comparison of a single sample mean with some value proposed
under the null hypothesis has been considered. We had only one sample which was compared with a
fixed value , that is has no sampling error.
A common problem which normally arises in medical trials is the comparison of the responses to 2 or
more treatments. It is sometimes possible to reduce this problem of comparing 2 sets of responses to
treatments to a single sample problem previously described.
Suppose we have 10 patients as experimental units and they each have responses to 2 treatments, i.e
we are using the same patient as his own control assuming the order of administration has no effect.
Example: The following are anxiety scores recorded for 10 patients receiving a new drug and a
placebo in random order.
Patient
1
2
3
4
5
6
7
8
9
10
Anxiety score
Drug
19
11
14
17
23
11
15
19
11
8
Placebo
22
18
17
19
22
12
14
11
19
7
Difference (d)
(drug-placebo)
-3
-7
-3
-2
1
-1
1
8
-8
1
-13
Null hypothesis: The mean difference in the anxiety scores in the population from which this sample
was taken is zero. (i.e the mean difference observed in the sample is merely due to sampling error).
n=10,
d = -13,
d =-1.3,
s=4.548
Estimated standard error of d =s/n = 1.438
Calculate t = d - 0
58
SE(d )
SND =
( x1 x 2 )
SE ( x 1 x 2 )
where
SE ( x1 x2 ) =
2
1
n1
2
2
n2
SE 2 ( x1 ) + SE 2 ( x2 )
Example:
In a study of the age of menarche in women in the USA the following distributions were observed for
samples of women aged 21-30 and 31-40 years.
Age of menarche
59
10
11
12
13
14
15
16
17
18
3
11
28
23
12
1
2
8
14
27
5
8
1
1
Total, n
x
x2
(x2)/n
(x-)2= x2-(x)2/n
s2
s
SE( X )
66
916
13.88
12838
12713
125
78
969
12.42
12127
12038
89
1.923
1.387
0.171
1.156
1.075
0.122
SE( x1 - x2 )= 0.1712+0.1222
= 0.04413 = 0.2101
There is very strong evidence that on average, younger women's age of menarche is
less than the older women's age.
(x1 - x2 ) 1. 96SE ( x1 - x2 )
i.e.
i.e.
60
. So they
But our two sample variances s12 and s22 are two2 separate
2
are combined to give a single best estimate of , s with degrees of freedom equal to (n1-1)+(n2-1) or
n1+n2-2. Therefore,
2
s =
1
s2 =
2
S(x1- x1 )2
n1-1
and
( x2 x2 ) 2
n2 1
SE( x 1 - x 2 ) = s (
1
1
+ )
n1 n 2
x1 - x2
t=
SE( x 1- x2 )
, d.f.=n1+n2-2
Example:
The following data show the abrasiveness of two brush-on denture cleaners A and B, measured by
weight loss in mg.
A: 10.2, 11.0, 9.6, 9.8, 9.9, 10.5, 11.2, 9.5, 10.1, 11.8
B: 9.6, 8.5, 9.0, 9.8, 10.7, 9.0, 9.5, 9.9
x
n
x2
(x)2/n
(x- X )2
A
103.6
10
10.36
1078.44
1073.296
5.144
B
76.0
8
9.50
725.20
722.0
3.20
61
s2 = 5.144+3.20
(10-1)+(8-1)
= 0.5215
s = 0.5215 = 0.7221
The standard error of the difference between the two groups , A and B is estimated by
SE ( x A x B ) = 0. 7221
1 1
+ = 0. 3425
10 8
xA - xB +- 2.12xSE( x A- xB )
The 95% confidence interval for
the true mean difference is
i.e. 0.86 2.12 x 0.3425 = 0.13 to 1.59 mg.
r1+r2
p = n +n
1 2
The standard error of p1-p2 is
62
SE ( p1 p2 ) =
p(1 p)(
1 1
+ )
n1 n 2
The null hypothesis is thus tested approximately by using the standard normal deviate which is
SND =
p1-p2
SE(p1-p2)
Example:
A clinical trial was undertaken to assess the value of a new method of treatment A, in comparison
with the old treatment B. The patients were divided into two groups randomly.
Of 257 patients treated with treatment A 41 died.
Of 244 patients treated with treatment B 64 died.
The two proportions of patients dying are:
p1 = 41/257= 0.1595 and p2 = 64/244= 0.2623
Null hypothesis: The two treatments are equally effective. ie. the population proportions 1 and 2
are equal.
If the null hypothesis is true, then the two equal population proportions 1 and 2 can be written
simply as , i.e. 1 = 2 = .
We replace by the best single estimate available. This estimate is the proportion say p, obtained by
pooling the two samples.
This gives
p = 41+64 = 105 = 0.21
257+244
501
Therefore SE(p1-p2) = 0.21x.79(1/257 +1/244) = 0.001327= 0.0364
Standard normal deviate, SND =(p1-p2) - (1-2) = 0.1595 - 0.2623 -0
SE(p1-p2)
0.0364
SND = -2.82, p<0.01
The result is highly significant and suggests that treatment A (with a smaller proportion of patients
dying) is better than treatment B.
95% confidence limits for the true difference in the proportions dying are
p1 - p2 1.96 x SE(p1-p2)
-0.1028 1.96 x 0.0364
i.e. -0.174 to -0.031
63
64
EXERCISE
1.
A clinical trial to test the effectiveness of a sleeping drug was conducted among 11 patients.
They were observed during one night with the drug and one night with a placebo. One patient
died before the placebo reading was taken.
The following are the results of testing the effectiveness of the drug:
Patient number
1
2
3
4
5
6
7
8
9
10
11
x
x2
(a).
(b).
2.
Hours of sleep
Drug
Placebo
6.1
5.2
7.0
7.9
8.2
3.9
7.6
4.7
6.5
5.3
8.4
5.4
6.8
(died)
6.9
4.2
6.7
6.1
7.4
3.8
5.8
6.3
77.4
52.8
551.16
292.88
Establish whether there is or there is no real difference in sleeping time between the
drug and the placebo.
Determine the 95% confidence interval for the difference in the mean sleeping time.
Comparison of birth weights of children born to 15 non-smokers with those born to 14 heavy
smokers gave the following results:
Mean
Standard deviation
Non-smokers
3.5933
0.3707
Heavy smokers
3.2029
0.4927
Is there enough evidence that on average children born to non-smokers are heavier than
children born to heavy smokers? Confirm your results by a 95% confidence interval of their
difference in birth weights.
3.
In a study of the cariostatic properties of dentrifices 423 children were issued with dentrifice
A and 408 were issued with dentrifice D. After 3 years, 163 of the children on A and 119 of
the children on D had withdrawn from the trial. The authors suggest that the main reason for
withdrawal from the trial was because the children disliked the taste of the dentrifices. Do
these data indicate that one of the dentrifices is disliked more than the other?
65
Chapter 9
THE CHI-SQUARED (
2) TESTS
INTRODUCTION
The 2 (Greek letter chi, pronounced kye) squared test is used to determine whether a set of
frequencies follow a particular distribution (e.g. Binomial, Normal, Poisson, etc). In its basic form it
tests whether the observed frequencies of individuals with some characteristics are significantly
different to those expected on some hypothesis.
Treatment A
Treatment B
Total
Outcome
Died
Survived
41
216
64
180
105
396
Total
257
244
501
Such a table is called a 2x2 contingency table since there are 2 rows and 2 columns. (In general we
can have an "rxc" contingency table, i.e. a table with r rows and c columns).
From the above table, the observed frequencies are 41, 216, 64 and 180. We need to obtain the
expected frequencies under the null hypothesis that "the row treatments have the same effect on
the outcome".
The expected frequencies are calculated in the following way:Expected frequency, E = row total x column total
grand total
For example, in the top left cell, where we observe 41 deaths the expected frequency under the null
hypothesis is
105x257 = 53.86
501
These expected frequencies are shown in the table below. They add up to the same grand total as the
observed frequencies.
We can then compare between the observed and the expected frequencies by looking at their
differences. We need also to consider the importance of the magnitude of the differences (eg. a
difference of 5 between 995 and 1000 is not as important as the "discrepancy" of size 5 between 2
and 7).
66
E
53.86
203.14
51.14
192.86
501.00
41
216
64
180
501
O-E
-12.86
12.86
12.86
-12.86
0.00
(O - E)2
3.07
0.81
3.24
0.86
7.98
The chi-squared value is obtained by calculating (observed-expected)2/expected for each of the four
cells in the contingency table and then summing them.
The general formula for 2 is
2 = (O-E)2
E
The percentage points of the chi-squared distribution are given at the back of this manual. The values
depend on the degrees of freedom.
If a contingency table has r rows and c columns then the degrees of freedom are given by
df = (r-1)(c-1). From our example the degrees of freedom is, df= (2-1)(2-1) = 1
Therefore from the above table, 2 = 7.98 on 1 df.
The Chi-Squared Table at the end of this manual shows that the observed value of 7.98 is beyond the
0.01 point of the chi-squared distribution. Therefore p<0.01. We conclude as before that the
differences between the two treatments are highly significant.
Note that previous analysis yielded Z = -2.82. It can be shown that for d.f.=1, Z2 = 2 i.e. 2.822 =
7.978. A short-cut formula for computing 2 for a 2x2 table is given as follows.
Variable y
y1
y2
Column marginal total
Variable x
x1
x2
a
b
c
d
s1=a+c s2=b+d
2 = (ad -bc)2n
r 1r 2s 1s 2
67
Type of school
Below average
Average
Above average
Total
Good
62
50
80
192
Oral hygiene
Fair+
Fair103
57
36
26
69
18
208
101
Bad
11
7
2
20
Total
233
119
169
521
Null hypothesis (Ho): There is no association between oral hygiene classification and type of school
attended. i.e the proportions of children attending below average, average and
above average schools are the same in children with good, fair+, fair- or bad
oral hygiene.
The expected number of children attending below average schools in a sample of 192 children with
good oral hygiene is
233 x 192 = 85.9
821
Similarly, the expected number of children attending below average schools out of 208 children with
fair+ oral hygiene is
233 x 208 = 93.0
521
Thus the expected frequencies are given in the table below:-
Type of school
Below average
Average
Above average
Total
Good
85.9
43.9
62.3
192.1
Oral hygiene
Fair+
Fair93.0
45.2
47.5
23.1
67.5
32.8
208.1
101.1
Bad
8.9
4.6
6.5
20.0
Total
233.0
119.1
169.1
521.2
93.0
6.5
68
Below average
62 = 0.27
233
Average
50 = 0.42
119
Above average
80 = 0.47
169
From above, we note that a large proportion of children with good oral hygiene attended above
average schools compared to those who attended below average schools.
Comments regarding the use of 2 tests
1.
The Chi-squared test is only valid for comparing observed and expected frequencies. It is not
valid for other variables such as percentages, means, rates, etc.
2.
The Chi-squared test is not valid for cells with expected frequencies less than 5. With very
small frequencies in a 2x2 table, the Fisher's exact test should be used.
EXERCISE
1.
Miscarriage
Threatened
Not threatened
Total
Types of placenta
Normal
Minor
Major
10
18
14
36
12
8
46
30
22
Total
42
56
98
Investigate the association between threatened miscarriage and the degree to which the
placenta is circumvallate at delivery.
69
Chapter 10
ASSOCIATION BETWEEN QUANTITATIVE VARIABLES
INTRODUCTION
Examples of quantitative variables have been seen in Chapter 2. The methods for analyzing the
relationships between two or more of such variables are linear regression and correlation.
In order to illustrate the methods of linear regression and correlation, we will use data on body weight
and plasma volume of eight healthy men.
The objective of the analysis is to see whether a change in plasma volume is associated with a change
in body weight.
Table 10.1: Plasma volume and body weight in eight healthy men.
Subject
1
2
3
4
5
6
7
8
SCATTER DIAGRAM
When two related variables, also called bivariate data, are plotted on a graph in the form of points or
dots, the graph is called a scatter diagram. Each point on the diagram represents a pair of values, one
based on X-scale and the other based on Y-scale. Usually, making a scatter diagram is the first step in
investigating the relationship between two variables, because the diagram shows visually the shape
and degree of closeness of the relationship.
Values on the X-scale refer to the explanatory or independent variable and on the Y-scale refer to the
response or dependent variable. In situations where it is not clear which is the response variable, the
choice of axes is arbitrary.
In the above example, take the independent variable (x) to be body weight and the response variable
(y) to be plasma volume. The scatter diagram would look like the one drawn below.
70
P
l
a
s
m
a
v
o
l
u
m
e
3.6
3.4
3.2
3
2.8
2.6
2.4
2.2
2
56 58 60 62 64 66 68 70 72 74 76
Body weight (kg)
LINEAR REGRESSION
When a response variable appears to change with a change in values of the explanatory variable, we
may wish to summarize this relationship by a line drawn through the scatter of points.
Geometrically, any straight line drawn on a graph can be represented by the equation:
y = a + bx
y refers to the values of the response (dependent) variable and x to values of the explanatory
(independent) variable. The equation tells us how these variables, x and y, are related. The constant 'a'
is the intercept, the point at which the line crosses the y-axis; that is, the value of y when x = 0.
The coefficient of x variable ('b') is the slope of the line. It tells us the average change (increase or
decrease) due to a unit change in x. It is sometimes called the regression coefficient.
Although we could draw the line through these points 'by eye', this would be a subjective approach
and therefore unsatisfactory. An objective approach, and therefore better, way of determining the
position of a straight line is to use the method of least squares. Through this method, we choose a
and b such that the vertical distances of the points from the line are minimized; or, we minimize the
sum of squares of these vertical distances - hence the term 'least squares'.
b is computed as follows:
b = Sxy = (x - x )(y - y )
Sxx
(x - x )2
71
72
P
l
a
s
m
a
v
o
l
u
m
e
3.6
3.4
3.2
3
2.8
2.6
2.4
2.2
2
56 58 60 62 64 66 68 70 72 74 76
Body weight (kg)
Fig.11.2 Scatter diagram of plasma volume and body weight showing the
regression line.
linear
CORRELATION
Linear regression provides us with a straight line with which to summarize the relationship between
two variables. However, it does not tell us how closely the data lie on a straight line. The closeness
with which the points lie along the straight line is measured by the (Pearson's) correlation coefficient,
r.
r = Sxy = (x - x )(y - y )
SxxSyy [(x - x )2 (y - y )2]
As we noted with the regression coefficients' calculation, here also further simplification in the
denominator can be made when calculating the correlation coefficient:
(x - x )2 = x2 - (x)2/n
(y - y )2 = y2 - (y)2/n
Considering the above example,
(x - x )(y - y ) = 1615.295 - 535 x 24.02/8 = 8.96
(x - x )2
(y - y )2
Therefore,
r=
8.96
(205.38 x 0.678)
= 0.76
73
3.
4.
5.
LOGISTIC REGRESSION
Introduction
We have so far dealt with simple linear regression with a continuous dependent variable. We can extend
the methods of simple linear regression to deal with more than one independent variables in the form of
multiple linear regression. That is the multiple regression model yields an equation in which the
dependent (outcome) variable is expressed as a combination of the independent (explanatory) variables.
This takes the following form:
y = 0+1x1+...+kxk, where
y is the dependent variable,
x1, 1 x2, ...xk are the k explanatory variables (sometimes called predictor variables or covariates), and
0, 1, ... k are the regression coefficients.
As stated earlier on, these methods assume that the outcome variable of interest is numerical ( and
measured on a continuous scale), although the explanatory variables do not necessarily have to be
continuous.
It is very common, however, that in many kinds of medical research the outcome variable of interest is a
proportion ( or a percentage) rather than a continuous measurement.
We cannot use the ordinary multiple linear regression for the analysis of the individual and joint effects
of a set of explanatory variables on the outcome variable which is in the form of a proportion. Two
features of proportions based on counts (proportions based on measurements do not come in here) are
important when considering a statistical analysis:
(a) if the denominator of the proportion is n and the population value is , the variance of this
proportion is (1 - )/n and for a given n, this depends upon the value of ,being largest when =1/2
and smaller when is in the neighbourhood of 0 or 1. Hence the usual assumption of constant
variance 2 can no longer hold.
(b) when we relate a proportion variable to other quantities by some form of a regression model, we
need to take seriously the fact that the true proportion cannot go outside the range 0 to 1. Because of
74
this the parameters have a limited interpretation and range of validity. We can instead use a similar
approach known as multiple linear logistic regression or just logistic regression.
Transformed proportions
We can overcome some of the problems in (b) above by looking at the response proportion on a
transformed scale which does not have the fixed boundaries at 0 and 1. Suppose p is the proportion of
individuals with some characteristic of interest. Or equivalently, let p be the probability of a subject
having a disease, then 1-p is the probability that the individual does not have the disease, and the odds
of having the disease is p/(1-p). As p changes from 0 to 1, the corresponding odds (i.e the ratio
p/(1-p)) change from 0 to . So this transformation removes one of the boundaries. To remove the
other, we consider the odds on a logarithm (log) scale: the log odds will go from - to + as p goes
from 0 to 1. If we use the natural logs (i.e. logarithms to the base e), the transformation loge(p/1-p)) is
called the logit of p.
p
2, and this is the log odds. The estimated value of p can be derived from
1 - p
p1
p
p (1- p0 )
3-loge 0 = loge 1
4, which is
1- p1
1- p0
p0 (1- p1 )
75
Table 10.2 Number of mosquitoes killed in a batch a the dose of insecticide used
Dose of insecticide
Number of mosquitoes
Number of mosquitoes in a batch
killed
10.2
44
50
7.7
42
49
5.1
24
46
3.8
16
48
2.6
6
50
Plotting the proportion killed in each batch against the dose of insecticide (a log scale for the dose or
concentration is usually appropriate) is a recommendable starting point. The simple linear regression
model will not fit very well the data, and it will lead us to expect responses which are negative for very
low doses or greater than 1 for high responses. In fitting a logistic regression model to these data,
p
6 = -4.887 + 3.104 ln(dose)
1 - p
76
Table 10.4 Logistic regression analysis of the hypertension data shown above in Table 10.3
Constant
Smoking (x1)
Obesity (x2)
Snoring (x3)
Regression coefficient
(b)
-2.378
-0.068
0.695
0.872
Standard error
se (b)
0.380
0.278
0.285
0.398
0.24
2.44
2.19
p-value
0.810
0.015
0.028
The significance of each variable can be tested by treating z=b/se(b) as a standard normal deviate. Can
see that the P-value for smoking is very large (0.81) and hence we can say that, smoking has no
association with hypertension. Obesity and hypertension have a significant association with
hypertension. (in both cases P<0.05).
The analyses presented relate only to the main effects of obesity, smoking and snoring. We need to
consider also the possible presence of any important interaction between two of these factors. That is,
we should investigate whether the effect of a factor depends on the level of another factor. In fact this
was done, and no interaction term was found to be statistically significant at any interesting level.
Omission of smoking in the model produced only minimal changes in
coefficients. Hence the regression equation for this model is
logit(p) = -2.378 - 0.068x1 + 0.695x2 + 0.872x3, where
x1, x2, and x3 are codes for smoking, obesity, and snoring, respectively.
The above equation enables us to calculate the estimated probability of having hypertension, given
values of the three variables. In particular, we can obtain the odds ratio of hypertension associated
with any of the three factors. For example, let us consider variable x2, obesity:
putting x2 = 1 (for presence of obesity), gives:
logit(p1) = -2.378 - 0.068x1 + 0.695 + 0.872x3, and
putting x2 = 0 (for non-obese), gives:
logit(p0) = -2.378 - 0.068x1 + 0.872x3.
As discussed earlier, the difference logit(p1)-logit(p0) = 0.695 is the log odds ratio. Hence the odds
ratio for hypertension associated with obesity = e0.695 = 2.00. In general, for any binary variable the
odds ratio (OR) can be estimated directly from the regression coefficient b as OR = eb. Confidence
limits follow immediately from the standard error of b and on taking b to have an approximate Normal
distribution.
77
EXERCISE
1.
In the following data, four doses (on a long-scale) of vitamin D were tested, each on a
number of rats, and the results were assessed by means of a line test on bones in terms of
arbitrary scores.
Dose, x
(Mean) response, y
-0.45
2.64
0.25
0.77
1.46
x = 2. 032; y = 38.85
x = 0.51; y = 9.71
x2 = 2.9895; y2 = 486.4169
xy = 34.2929
(a)
(b)
78
Chapter 11
VITAL STATISTICS AND DEMOGRAPHY
SOURCES OF DEMOGRAPHIC INFORMATION
Quality of data depend on many factors, one of which is the source of data. Sources of data have a
direct implication to the quality in terms of coverage, completeness and cost.
In this chapter we will concentrate on the following sources of demographic data:
(a)
Census
(b)
Vital registration
(c)
Sample surveys
Census
Census is a systematic, routine way of counting subjects in a defined boundary or limits of land.
Census produces reports of individuals, population size and structure at a point in time.
Originally, census was limited to people only; but very recently we find censuses of agriculture,
business, livestock, housing, etc and sometimes done concurrently with population census.
The main characteristics of census is that it covers the whole population. No sampling is involved and
each person should be enumerated separately. Census must have a legal basis to make it complete and
compulsory. It reflects a single point in time although the whole process can take a longer time.
Basic questions which should appear on the questionnaire are name, age, sex, relationship with the
head of household, marital status, race/religion/ethnicity, education, occupation, employment status,
migration and amenities. Additional questions would depend on the availability and quality of vital
registration.
Population census can be carried out using either of the below mentioned methods:
1.
De facto method:
This method designates persons to an area or location they are found during enumeration.
The population "in fact" there. The question of originality does not count here. It is
considered that, say, in 1988 Tanzania Population Census, Zanzibar had a population of
641,000. This implies that, these people spent a night in Zanzibar before a census night.
Tanzania follows this method of enumeration.
2.
De jure method:
De jure method of enumeration allocates persons to their normal residence. Meaning "people
who belong to the area or have the right to live there through citizenship, legal residence or
whatever". For example, a businessman working in Dar es Salaam but living in Arusha would
be assigned to Arusha on a de jure type of enumeration.
In Tanzania census is normally conducted after every ten years (decennial). This has a planning setback implication in a sense that population is changing rapidly because of births, deaths and
79
movements. To overcome this problem, normally inter-censal surveys or mini-surveys are conducted.
Example of such surveys is a 1991 Tanzania Demographic and Health Survey (TDHS). However,
further surveys on morbidity and for specific diseases can be conducted whenever a need arise.
Vital registration
Vital registration system is very common to developed counties where information on births,
marriages, deaths and migrations are collected. In developing countries the system whenever
employed is prone to incompleteness otherwise they are non-existent.
Questions in the vital registration system are always very simple and few. Consider hospital or health
service data here in Tanzania. Examples of such registrations are information on deaths found in
hospitals (death certificates). Birth and marriage data found in churches, mosques, Area
Commissioner's offices and migration data found at airports and borders.
The short-fall of vital registration system is that they are normally incomplete, selective samples,
diverse and are practically unreliable. This does not mean that the system should be discarded,
instead it should be improved to remove these errors.
Sample surveys
Sample surveys give the same information in a more detailed form where vital registration system
does not exist. Only a sample of a population is involved. Sample surveys are thus, less costly when
compared with census.
The other advantages of surveys include the pace of collecting the information. They are relatively
quicker, more detailed than other systems like census. The cost of surveys are the errors introduced
through sampling.
Measures of fertility:
There are four common measures of fertility. These are crude birth rate, general fertility rate,
gross reproductive rate and the total fertility rate.
i.
ii.
80
The modern, conventional and much more acceptable 'rate' is the general fertility rate
or simply known as 'fertility rate'. The denominator is restricted to women at risk of
child-bearing rather than the general population. It is thus, defined by:
number of livebirths in a year x 1000
mid-year population of women aged 15-49
iii.
Age
15-19
20-24
25-29
30-34
35-39
40-44
45-49
Total
Number of women
Number
births
of
665000
516000
459000
344000
310000
229000
218000
2741000
live Age
rate
21000
114000
118000
123000
37000
6000
5000
424000
specific
fertility
0.0316
0.2209
0.2571
0.3576
0.1194
0.0262
0.0229
1.0357
Total fertility rate (TFR) equals the sum of all age specific fertility rates. In this case,
TFR = 1.0357 x 5 = 5.1785.
The sum of all ASFRs is multiplied times 5 because of the 5 year age group interval. If ages
are in single years, then there is no need to multiply this sum times 5.
The figure 5.1785 means on average each woman will have 5 children during her
reproductive period given that these age specific fertility rates will still apply until she
finishes her reproductive life.
Unlike the CBR and GFR, the calculation of TFR greatly depends on the age composition
although its use is independent of age distribution.
81
iv.
The gross reproductive rate is similar to the total fertility rate only that it considers
female live births rather than all births. This implies that, ASFR for GRR is based on
females.
GRR is interpreted as the average number of daughters a woman would have if she
survived to at least age 50 and experienced the given female ASFRs. A figure of 1
means that women are able to replace themselves while a figure of 2.0 means that the
population is doubling itself: each woman is on average producing two daughters.
Like the TFR, GRR is also a hypothetical measure. It is a period measure which does
not take into account the effect of female mortality either before age 15 or 15 to 50
years.
Referring to Table 11.1 above, given the number of female livebirths the GRR is
computed as follows:
Age
15-19
20-24
25-29
30-34
35-39
40-44
45-49
Total
of
Measures of morbidity:
i.
Incidence rates:
Incidence measures the occupance of new cases of a disease in a population relative
to the number of persons at risk of contracting the disease. Therefore, the incidence
rate is the rate of contracting a disease among those still at risk. It should be noted to
make a difference between being at risk of contracting the disease at the beginning of
a period and being at risk during the entire period. The former would refer to the
incidence risk and the latter to the incidence rate. Incidence rate is expressed as:
number of new cases of disease in a period of time x 10k
number of person-years of exposure in a period
where k = 2, 3, 4, 5 or 6 depending on the convenience or convention.
82
83
ii.
Prevalence rates:
The prevalence measures the extent to which a disease exists in a population. It is
based on the total number of existing cases among the entire population. It can be
measured at an instant time (point prevalence) or looking for cases over a stretch
period of time (period prevalence).
Point prevalence 'rate' = number of subjects with the disease at time t x 10
x10k
iv.
Specific rates:
These are rates which apply only to different geographical areas, to specific age
groups, to sex separately, to educational or marital stratification, etc. They are called
rates to that specification.
(c)
ii.
Infant mortality rate = number of deaths in a year under 1 year of age x 1000
number of livebirths in the same period
Infant mortality rate is often broken down into several indices depending on the age
categories of an infant.
Neonatal mortality rate = number of deaths in a year under 28 days of age x1000
84
iv.
STANDARDIZATION OF RATES
There are situations in which one intends to compare two or more different populations (geographical
areas, different hospital populations, experimental groups, etc.) using the already mentioned crude
rates (mortality, morbidity, fertility, etc). For instance, consider the crude mortality rate. The risk of
dying depends very much to age, and sometimes differs according to sex. Age specific death rates are
high for infants and very old people and low for middle age groups.
It is therefore true that the crude mortality rate and overall incidence rates will depend on age-sex
composition of the population concerned. Crude rates may be misleading indicators of the level of
mortality, morbidity, fertility, etc.when comparing two different populations if the populations do not
have the same age and sex structure.
Standardization provides an overall summary measure of the event occurrence which does not depend
on the age, sex, race or other distribution of the group. It therefore permits comparisons of those
events occurrence in two or more study groups which are adjusted for differences in the variable of
interest of the groups.
Two methods of standardization which are commonly used are: (1) Direct standardization and
(2) Indirect standardization.
1. Direct standardization:
85
The direct standardization is applied when, for example, age and sex rates from each of the
populations under study are applied to the standard population. The outcome is the age-sex adjusted
mortality, morbidity or fertility rates.
To use the indirect method, age and sex specific rates from the standard population are treated to the
study populations to give the standardized mortality, morbidity or fertility ratios.
The choice of which method to use depend very much on the availability of data. However, it has
been found that in general, direct standardization is used for prevalence while indirect method is used
for incidence.
The following information should be available when one intends to use direct standardization
method:
(a) Study population(s) characteristics eg. age, sex rates.
(b) Standard population characteristics composition.
Once the two data have been obtained, (a) is applied to (b) to get, say, age and sex adjusted rate.
Since the standard population may or may not be one of those populations to be compared, it has to
be defined arbitrarily. But a common choice of standard population is the larger population from
which the index (study) population(s) came.
The detailed steps in calculating the standardized rate for the index population are:(a) Define your standard population.
(b) Apply the age, sex (or any other characteristic) specific rates of the index population to
the standard population to get what we would expect if the index population rates would
be rolling in the standard population.
(c) Add these cases to get the total expected number of subject in all age groups.
(d) Divide the total expected number of cases by the total in the standard population to get
the crude rate known as the standardized incidence rate for the index population.
Table 11.2a:
Age
0- 4
5- 9
10-14
15-29
30-49
50+
Total
Number examined
Table 11.2b:
71
94
27
30
36
29
287
VILLAGE B
Number of cases
Number examined
3
8
6
18
28
23
86
31
43
19
22
28
15
158
86
Number of cases
2
6
13
21
28
15
85
Total examined
102
137
46
52
64
44
445
Age
0-4
5-9
10-14
15-29
30-49
50+
Total
Total malaria cases for each age group are obtained by multiplying the proportion of villagers
diseased by the "standard" population (2 villagers total examined per age group). The results are
expected cases of malaria if the prevalence of malaria in village A and B respectively were applying.
The Age adjusted prevalence of malaria = Expected cases
Total standard population
Thus,
Village A: 142.06 = 31.92
445
Village B: 214.98 = 48.31
445
Conclusion: Village B has higher prevalence (%) of malaria adjusted for age.
Considerations over direct standardization method:
(a) The direct method of standardization requires stratum-specific (eg. age-specific) rates in
the index population(s) which sometimes are not available. In this case the method can
not be applied.
(b) The number of cases observed in the study population should be large enough to give
meaningful stratum-specific rates necessary for direct standardization. Short of this, the
method can not be used.
(c) In general, comparing disease rates in two or more groups via direct standardization is
subject to less bias than the indirect method. The reasons for this will not be discussed
here.
87
Age
00 - 04
05 - 09
10 - 14
15 - 29
30 - 49
50+
Total
TOTAL
1
Rate per 100
4.92
10.22
41.30
75.00
87.50
86.50
38.43
VILLAGE A
2
3
Population
Expected cases
71
3.5
94
9.6
27
11.2
30
22.5
36
31.5
29
25.0
287
103.3
VILLAGE B
4
5
Population
Expected cases
31
1.5
43
4.4
19
7.8
22
16.5
28
24.5
15
13.0
158
67.7
Dividing the observed number of malaria cases by the expected number would give us the the
standardised morbidity ratio.
Village A: 86 = 0.83 = 83%
103.3
Village B: 85 = 1.25 = 125%
67.7
We conclude using the standardised ratios by computing the actual age-adjusted morbidity rates for each
group to control for the effect of age:
Village A: 38.43 x 0.83 = 0.32 = 32%
100
Village B: 38.43 x 1.25 = 0.48 = 48%
100
The primary advantage of the indirect standardisation lies on the fact that this method does not
necessitate one to know the specific rates on the population being compared with, of which sometimes
are not available.
88
EXERCISE:
Consider the following data for cancer mortality in the US in 1940 and 1986:
Age
00 - 04
05 - 14
15 - 24
25 - 34
35 - 44
45 - 54
55 - 64
65 - 74
75+
All ages
1940
Population (000)
10,541
22,431
23,922
21,339
18,333
15,512
10,572
6,377
2,643
131,670
Deaths
494
667
1,287
3,696
11,198
26,180
39,071
44,328
31,279
158,200
1986
Population (000)
18,152
33,860
39,021
42,779
33,070
22,815
22,232
17,332
11,836
241,097
Deaths
666
1,165
2,115
5,604
14,991
37,800
98,805
146,805
161,381
469,330
(a) Compute the crude cancer mortality rates for 1940 and 1986 and compare these rates.
(b)Using the US population in 1940 as the standard population, apply the direct method of
standardisation. What are the age-adjusted cancer mortality rates for 1940 and 1986?
(c)Using the age-specific cancer mortality rates for 1940 as the standard, apply the indirect method to
compute the standard mortality ratios for 1940 and 1986.
(d) How does the 1986 population compare with the 1940 population in terms of cancer mortality rate?
89
LIFE TABLES
Standardizes death rates which have been discussed above can be used to study the levels of mortality of
a population and also they can be used to compare the mortality experience of two or more populations.
The standardized death rates are however a single figure index of the level of mortality. They contain no
direct information about mortality levels of different age groups. Life tables on the other hand can
summarize the mortality experience of a population at every age. They provide answers to questions
like, Suppose in a population 100,000 babies are born on the same day, how many babies will survive to
celebrate their 1st, 2nd etc birthdays assuming that babies die at the current rates of mortality?
However, the use of current mortality rates for this calculation is unreal since the babies would die at
the rates existing at the time when they die.
There are two distinct ways in which a life table may be constructed from mortality data:
In the current life table the survival pattern of a group of individuals is described subject through out
life to the age specific death rates currently observed in that particular community. This kind of a life
table is more often used for actuarial purposes and is less common in medical research.
On the other hand the cohort life table describes the actual survival experience of a group or 'cohort' of
individuals through time. The cohort may be babies born at the same time or an occupational group or
patients following a particular treatment etc. This type of life table has its most useful application in
medical research in follow-up studies eg an IUD retention study or more generally survivorship studies.
There are two types of life tables:
1. Full life table: Includes every single year of age from 0 to the highest age to which any person
survives.
2. Abridged life tables: usually considers only 5 year age groups except that the first five years of life
may be considered singly.
90
Example:
The intra-uterine device (IUD) is a method of contraception which is not well tolerated by all women
because of medical side effects such as abdominal pain, excessive bleeding, infection etc. If such side
effects occur the IUD is removed, although the IUD may also be removed for non medical reasons such
as the woman's wanting to become pregnant.
In an IUD retention study 2,479 women who had an IUD insertion during the month of January were
interviewed. They were asked whether they had retained their IUD until the 24th month during which
month it was the practise to arrange a special medical check-up and remove the IUD. For those who
their IUDs were removed, reasons for the removal and the duration of use were determined. The results
of the survey indicated that 180 women lost their IUD during the first month after insertion and 162
during the second month after insertion. The corresponding figures for the third to the twenty third
month were 90, 85, 76, 180, 162, 90, 85,76, 63, 51, 72, 85, 87, 72, 78,70, 65, 90, 92,89, 88.
This information can be represented on a life table as follows:
x
Px
qx
lx
dx
ex
0.073
0.927
2479
180
11.75
0.07
0.93
2299
162
11.63
0.04
0.96
2137
90
11.47
0.04
0.96
2047
85
10.95
0.04
0.96
1962
76
10.41
0.09
0.91
1886
180
9.79
0.09
0.91
1706
162
9.79
0.06
0.94
1544
90
9.76
0.06
0.94
1454
85
9.34
0.06
0.94
1369
76
8.88
10
0.05
0.95
1293
63
8.38
11
0.04
0.96
1230
51
7.78
12
0.06
0.94
1179
72
7.09
13
0.08
0.92
1107
85
6.52
14
0.09
0.91
1022
87
6.02
15
0.08
0.92
935
72
5.73
16
0.09
0.91
863
78
4.96
17
0.09
0.91
785
70
4.40
18
0.09
0.91
715
65
3.78
19
0.13
0.87
650
90
3.11
20
0.16
0.84
560
92
2.50
91
21
0.19
0.81
268
89
1.93
22
0.23
0.77
379
88
1.27
23
1.00
0.00
291
291
0.50
We can use the life table above to calculate the following probabilities:
a) What is the probability that a woman who retained an IUD for the first six months will have it by the
end of the 20th month
=l20
l6
=560
1706
=0.33
b) What is the probability that a woman who retained an IUD up to the beginning of 10th month will
lose it after the 18th month?
= l18
l9
=715
1369
=0.52
Example:
92
The following is an abridged life table for a certain country in a given year.
x
lx
10dx
10qx
10px
0
100000
2938
0.029
0.971
10
97062
847
0.009
0.991
20
96215
1489
0.015
0.985
30
94726
1867
0.012
0.988
40
92859
4386
0.047
0.953
50
88473
11017
0.124
0.876
60
77456
22512
0.291
0.709
70
54944
30275
0.551
0.449
80
24669
20869
0.846
0.154
90
3800
3720
0.979
0.021
100
80
80
1.000
0.000
e0x
68.03
59.94
50.42
41.13
3.86
23.15
15.7
10.20
6.57
5.21
5.00
Since every year cohorts normally lose part of their number through death or emigration, each bar is
usually shorter than the previous one, which gives the impression of a pyramid. A vertical comparison
93
of the bars shows the relative proportions of each age or age group in the population, while a horizontal
comparison shows the proportion of males and females in each age or age group.
The population pyramid can be based on absolute numbers or on percentages, the latter is more
common. The percentages are calculated using the total population of both sexes combined as the
denominator. If the percentages are calculated separately for males and females, then the pyramid will
present a false picture.
Types of Pyramids
There are several types of population pyramids but we will discuss the three more frequent forms.
The first class of population pyramid looks like an ordinary triangle. It reflects a population with
relatively high vital rates and a low median age. The age structure in the Netherlands in 1849 (Figure
11.2) fits this category.
The second variety has a broader base than the first. The 0-14 group is larger because this population is
beginning to control mortality but not fertility, and the most impressive gains in reduction are made in
94
the younger age groups. The steeply sloping sides reflect the large proportion of younger people and the
small percentage of aged people. The population structure of Hai District, Tanzania in 1994 fits this
description (Figure 11..3).
The third class of pyramids looks like a beehive. The numbers in this age-sex profile are roughly equal
for all age groups, gradually decreasing at the apex. Many Western populations conformed to this
pattern in the 1930's as seen in the Figure 11.4.
95
EXERCISE
1.
Discuss the advantages and disadvantages for each of the systems of collecting data.
2.
During an epidemic of gastro-enteritis the number of cases and deaths in a city hospital and
all hospitals were as shown below:
CITY HOSPITAL
ALL HOSPITALS
Age group (years) Cases
Deaths
Cases
Deaths
Under 1
240
41
1550
341
1-4
140
21
1880
235
Above 4
20
8
500
16
Total
400
70
3930
592
(a)
(b)
Calculate for the city hospital and all hospitals the case mortalities in each age group
and for all ages combined.
Find the standardized mortality rate (Comparative Mortality Ratio) for the city
hospital by the direct method using the case mortalities by age group of all hospitals
as the standard rates.
96
BIBLIOGRAPHY
1.
Armitage, P. and Berry, G. (1994). Statistical Methods in Medical Research, 3rd Edition.
Oxford: Blackwell Scientific Publications. (older versions are just as good for most topics).
2.
Brownlee, A., Pathmanathan, I., Varkevisser, C. (1991). Health Systems Research Training
Series, Volume 2 (Part 1): Designing and Conducting Health Systems Research Projects.
Canada: IDRC.
3.
Healy, M.J.R., Hills, M. and Osborn, J. (1987). Manual of Medical Statistics. Volume II.
London: London School of Hygiene and Tropical Medicine.
4.
Hill, A. Bradford (1984). A Short Textbook of Medical Statistics, 11th Edition. London:
Hodder and Stoughton.
5.
Kirkwood, B.R. (1988). Essentials of Medical Statistics, 1st Edition. London: Blackwell
Scientific Publications.
6.
7.
8.
Petrie, Aviva (1990). Lecture Notes on Medical Statistics, 2nd Edition. Oxford: Blackwell
Scientific Publications.
97