Académique Documents
Professionnel Documents
Culture Documents
Mekonnen Assefa
/Bsc,
Mph/Epidemiology &
Biostatistics.
June,2016
1
Descriptive statistics
HANDLING OF DATA
2
Learning Objectives
Understand the basic concepts and terminology of
biostatistics, including the various kinds of variables,
measurement, and measurement scales.
Understand how data can be appropriately organized
and displayed.
Understand how to reduce data sets into a few useful,
descriptive measures.
Be able to calculate and interpret measures of central
tendency, such as the mean, median, and mode.
Be able to calculate and interpret measures of
dispersion, such as the range, variance, and standard
deviation.
3
Cont
Understand classical, relative frequency, and subjective
probability.
Understand the properties of probability and selected
probability rules.
Be able to calculate the probability of an event.
Understand selected discrete distributions and how to use
them to calculate probabilities.
Understand selected continuous distributions and how to
use them to calculate probabilities.
Be able to explain the similarities and differences between
distributions of the discrete type and the continuous type
and when the use of each is appropriate.
4
Cont
Be able to construct a sampling distribution of a statistic.
Understand how to use a sampling distribution to
calculate basic probabilities.
Understand the basic concepts of sampling with
replacement and without replacement.
Understand the importance and basic principles of
estimation.
Understand how to correctly state a null and alternative
hypothesis and carry out a structured hypothesis test.
Understand the concepts of type I error, type II error.
Be able to calculate and interpret z, and chi-square test
statistics for making statistical inferences
5
Define terms
Statistics ???
Biostatistics ???
Descriptive statistics ?
Inferential statistics ?
Variables???
6
Variable
Variable: A characteristic which takes different
values in different persons, places, or things.
Any aspect of an individual or object that is
measured (e.g., BP) or recorded (e.g., age,
sex) and takes any value.
There may be one variable in a study or many.
E.g., A study of treatment outcome of TB
patients
7
Before summarization and organization, we
need to know the types of variables and
measurement scales of our data.
Before displaying or analyzing data, we
should classify the variables into their
different types.
8
Variables can be broadly classified into:
Categorical (or Qualitative) or
Quantitative (or numerical variables).
9
Categorical variable: A variable or
characteristic which can not be measured in
quantitative form but can only be sorted by
name or categories.
10
Quantitative variable: A variable that can be
measured (or counted) and expressed
numerically.
11
Quantitative variable is divided into two:
1. Discrete: It can only have a limited number of
discrete values (usually whole numbers).
E.g., the number of episodes of diarrhoea a child has had
in a year. You cant have 12.5 episodes of diarrhoea
Characterized by gaps or interruptions in the
values.
Both the order and magnitude of the values matter.
The values arent just labels, but are actual
measurable quantities.
12
2. Continuous variable: It can have an
infinite number of possible values in any given
interval.
Both the magnitude and the order of the values
matter
Does not possess the gaps or interruptions
Weight is continuous since it can take on any
number of values.
13
SUMMARY
Variable
Types Quantitative
Qualitative
of measurement
or categorical
variables
Measurement scales
14
Measurement and measurement
Scales
15
1. Nominal scale:
The simplest type of data, in which the
values fall into unordered categories or
classes
Uses names, labels, or symbols to assign
each measurement.
16
Example of nominal Scale:
17
If nominal data can take on only two possible
values, they are called dichotomous or
binary.
So sex is not just nominal, it is dichotomous
(male or female).
Yes/no questions
E.g., cured from TB at 6 months of Rx
18
2. Ordinal scale:
Assigns each measurement to one of a limited
number of categories that are ranked in terms
of order.
Although non-numerical, can be considered to
have a natural ordering
19
Example of ordinal scale:
Pain level:
21
4. Ratio scale:
- Measurement begins at a true zero point and
the scale has equal space.
- For example, age is a ratio data, some one
who is 40 is twice as old as someone who is
20.
22
Scales of measurement
1. Blood group
2. Temperature (Celsius)
3. Ethnic group
4. Job satisfaction index (1-5)
5. Number of heart attacks
6. Serum uric acid (mg/100ml)
7. Number of accidents in 3 - year period
8. Number of cases of each reportable disease reported by a health worker
9. The average weight gain of 61-year old dogs (with a special diet
supplement) was 950grams last month.
10. Injury severity (a score between 1and 3 is allocated depending on the
severity) scores 1 and 3 show mild and very severe respectively.
11. Gender of babies born in a Hospital
24
24
A) Identify the types of data which are qualitative
and quantitative.
B) Identify the types of data which are numerical
discrete and numerical continuous.
C) Identify the type of data/ measurement scale
(nominal, ordinal, interval or ratio). Confirm your
answers by giving your own examples.
D) Which nominal scales are dichotomous? Which
ones are multichotomous?
25
Methods Of Data Collection,
Organization And Presentation
26
26
Before any statistical work can be done data
must be collected.
27
27
1. Data Collection Methods
Observation
Face-to-face and self-administered interviews
Postal or mail method and telephone interviews
Using available information
Focus group discussions (FGD) Qualitative
In depth interview study
28
28
1. Observation:
Disadvantages:
Investigators or observers own biases,
prejudice, desires
Needs more resources and skilled human power
during the use of high level machines.
30
30
2. Interviews and self-administered questionnaire
31
31
Questions should be considered before
designing our questioning tools
include:
What exactly do we want to know?
32
32
Unstructured questionnaire:
Is flexible
33
33
Structured questionnaire
34
34
Adv. And Disadv.
Self-administered questionnaire
simpler and cheaper;
Can be administered to many persons
simultaneously (e.g. to a class of students)
Unlike interviews, can be sent by post.
On the other hand,
They demand a certain level of education and skill
on the part of the respondents;
People of a low socio-economic status are less likely
to respond to a mailed
35
35
Interviewing using questionnaire
can be face-to-face or telephone interviews.
36
36
Types of Questions
Depending on how questions are asked
and recorded they can be classified as:
1. Open-ended questions
Permit free responses that should be recorded
in the respondents own words.
37
37
Such questions are useful to obtain
information on:
38
38
2. Closed Questions
Offer a list of possible options or answers from
which the respondents must choose.
When designing closed questions one should try to:
Offer a list of options that are exhaustive and
mutually exclusive
Keep the number of options as few as possible.
Are useful :
if the range of possible responses is known
if one is only interested in certain aspects of the
issue
E.g. What is your religion? (Muslim, Orthodox,
Protestant, others)
39
39
Closed questions may be used as well to get
the respondents to express their opinions by
choosing rating points on a scale.
For example: How useful would you say the use of
insecticide treated bed net in the prevention of
malaria?
1. Extremely useful
2. Very useful
3. Useful
4. Not very useful
5. Not useful at all
40
40
Requirements of questions
41
41
Steps in Designing a Questionnaire
42
42
Step 3: SEQUENCING OF QUESTIONS
Sequence in logical order
Sensitive questions at last
43
43
3. Use of documentary sources:
Examples include:
44
44
Though the use of data from documents, are
less time consuming and relatively have
low cost,
45
45
Common Problems in gathering data
include:
Language barriers
Lack of adequate time
Expense
Inadequately trained and experienced staff
Invasion of privacy
Suspicion
Bias
Cultural norms (e.g. which may preclude men
interviewing women)
46
46
The statistical data may be classified under
two categories, depending upon the
sources.
47
47
1. Primary Data:
Are those data, which are collected by the investigator
himself for the purpose of a specific inquiry or study.
49
49
\
On the other hand, may also be full of errors
due to:
the fact that the purpose of the collection of the
data by the primary agency may have been
different from the purpose of the user of these
secondary data.
there may have been bias introduced,
the size of the sample may have been inadequate,
or
there may have been arithmetic or definition errors,
hence, it is necessary to critically investigate
the validity of the secondary data.
50
50
Even though the choice of methods of data
collection is largely based on the accuracy of the
information they yield, it is also based on practical
considerations, such as:
The need for personnel, skills, equipment, etc. in
relation to what is available and the urgency with
which results are needed.
The acceptability of the procedures to the
subjects.
The probability that the method will provide a
good coverage
The investigators familiarity with a study
procedure.
51
51
Methods of data organization and
presentation
52
52
The data collected in a survey is called raw data.
53
53
1. Frequency Distributions
A table which involves a listing of all
observed values of the variable being
studied and how many times each value is
observed.
54
54
Number of movies Number of persons Relative frequency (%)
0 72 18.0
1 106 26.5
2 153 38.3
3 40 10.0
4 18 4.5
5 7 1.8
6 3 0.8
7 0 0.0
8 1 0.3
Total 400 100.0
55
In the above distribution:
Number of movies represents the variable
under consideration,
56
56
A categorical distribution:
57
57
SENIORS PLAN NUMBER OF
SENIORS
Total 548
58
58
Grouped frequency distribution.
E.g. The age of persons arrested in a country.
Age (years) Number of persons
Under 18 1,748
18 24 3,325
25 34 3,149
35 44 1,323
45 54 512
55 and over 335
Total 10,392
59
The construction of grouped frequency
distribution consists four steps:
1. Choosing the classes,
2. Sorting (or tallying)
3. Counting the number of items in each class,
and
4. Displaying the results in the form of a chart or
table
60
60
No. of classes and class interval (width)
Choices are arbitrary to some extent, but they
depend on:
The nature of the data
61
61
1. No of classes usually between 6 & 20 (average
15)
Sturges Formula, given by:
62
62
3. Determination of class limits:
63
63
Example:
The birth weights (in Kilogram) of 30 children
were recorded as follow:
2.0, 2.1, 2.3, 3.0, 2.7, 2.8, 3.5, 3.1, 3.7, 4.0,
2.3, 3.5, 4.2, 3.7, 3.2, 2.7, 2.5, 2.7, 3.8, 3.1,
3.0, 2.6, 2.8, 2.9, 3.5, 4.1, 3.9, 2.8, 2.2, 3.1.
K = 1+3.322(log30) = 5.91
W = 4.2-2.0 = 0.37 0.4
5.91
64
64
Birth weight Tally mark No. of % Cumulative
children freq.
2.0 - 2.3 IIII 5 16.7 5
2.4 - 2.7 IIII 5 16.7 10
2.8 - 3.1 IIII IIII 9 30.0 19
3.2 - 3.5 IIII 4 13.3 23
3.6 - 3.9 IIII 4 13.3 27
Total 30 100.0
65
When frequencies of two or more classes are added up,
such total frequencies are called Cumulative
Frequencies.
Two types:
Lower & upper class limit: are the smallest and largest values that
can go into any class, respectively
66
66
Mid-point or class mark (Xc)
=Upper Class Limit + Lower Class Limit
2
Class Boundaries (true class limits):
Are those limits, which are determined mathematically
to make an interval of a continuous variable
continuous in both directions, and no gap exists
between classes.
67
67
Example: Frequency distribution of weights (in Ounces) of Malignant Tumors
Removed from the Abdomen of 57 subjects
69
69
Construction of tables
Although there are no hard and fast rules to follow, the
following
general principles should be addressed in constructing
tables.
1. Tables should be as simple as possible.
2. Tables should be self-explanatory. what? when? where?
how classified ? and it be placed above the table.
Each row and column should be labeled.
Numerical entities of zero should be explicitly written rather
than indicated by a dash. Dashed are reserved for missing
or unobserved data.
Totals should be shown.
3. If data are not original, their source should be given in
a footnote.
70
70
A) Simple or one-way table:
Is used when the individual observations
involve only to a single variable
71
71
Table 1: Overall immunization status of children in Adami
Tullu Woreda, Feb. 1995
72
B. Two-way table:
73
73
Table 2: TT immunization by marital status of the women of
childbearing age, Assendabo town, Jimma Zone, 1996
75
75
Table 3: Distribution of Health Professional by Sex and
Residence
Residence
Profession/Sex Urban Rural Total
Doctors Male 8(10.0) 35 (21.0) 43 (17.7)
Female 2 (3.0) 16 (10.0) 18 (7.4)
Nurses Male 46 (58.0) 36 (22.0) 82 (33.7)
Female 23 (29.0) 77 (47.0) 100 (41.2)
Total 79 (100.0) 164(100.0) 243(100.0)
76
s
3. Diagrammatic Representation of Data:
Importance:
1. They have greater attraction than mere figures.
2. They help in deriving the required information in
less time and without any mental strain.
3. They facilitate comparison.
4. They may reveal unsuspected patterns in a
complex set of data and may suggest directions
in which changes are occurring.
This warns us to take an immediate action.
77
77
Limitations:
1. Use only for purposes of comparison.
2. Diagrammatic representation is not an alternative
to tabulation. It only strengthens the textual
exposition of a subject, and cannot serve as a
complete substitute for statistical data.
3. It can give only an approximate idea and as such
where greater accuracy is needed diagrams will
not be suitable.
4. They fail to bring to light small differences
78
78
Specific types of graphs include:
Bar graph
Nominal, ordinal
Pie chart data
Histogram
Frequency polygon
Scatter plot Quantitative
Line graph data
Others
79
N.B: The choice of the particular form among the
different possibilities will depend on personal
choices and/or the type of the data.
80
80
1. Every graph should be self-explanatory and as simple
as possible.
2. Titles are usually placed below the graph and it
should again question what ? Where? When? How
classified?
3. Legends or keys should be used to differentiate
variables if more than one is shown.
4. The axes label should be placed to read from the left
side and from the bottom.
5. The units in to which the scale is divided should be
clearly indicated.
6. The numerical scale representing frequency must
start at zero or a break in the line should be shown.
81
81
1. Bar Chart
The categories are represented on the base line (X-
axis) at regular interval and the corresponding values
of frequencies or relative frequencies represented on
the Y-axis (ordinate) in the case of vertical bar
diagram and vis-versa in the case of horizontal bar
diagram.
All the bars must have equal width and the distance
between bars must be equal.
All the bars should rest on the same line called the base
82
82
A. Simple bar chart:
83
83
Fig 1. Distribution of patients in hospital X by source of
referral, 1999
84
84
B. Multiple bar chart:
85
85
Number of women
400
300
200
100
0
ed
d
le
d
d
we
te
ie
ng
rc
ra
ar
ido
vo
Si
M
pa
Immunized
Di
Se
Not Immunized
Marital status
Fig.2. TT immunization status by marital status
of women 15-49 years, Town X, 1990
86
86
C. Component (or sub-divided) Bar Diagram:
87
87
Number of women
500
400
300
200
100
0
ied
d
ed
le
ed
te
ng
rc
ow
ar
ra
vo
Si
M
Not Immunized
pa
id
Di
Se
Immunized
Marital status
Fig.3. TT immunization status by marital
status of women 15-49 years, Town X, 1990
88
88
ii) Percentage Component Bar Diagram:
89
89
Number of women
100%
80%
60%
40%
20%
0% le
d
ed
d
ed
te
ie
ng
rc
ow
Not Immunized
ra
ar
Si
ivo
M
pa
id
W
D
Immunized
Se
Marital Status
Fig.4. TT immunization status by marital status of
women 15-49 years, Town X, 1990
90
90
2) Pie-chart (qualitative or quantitative discrete data):
it is a circle divided into sectors so that the areas of the
sectors are proportional to the frequencies.
91
91
Example: Distribution of cause of death for females,
in England and Wales, 1989.
92
Fig 5. Distribution fo cause of death for females, in England and
Wales, 1989
93
93
3. Histograms (quantitative continuous data)
40
35
No of women
30
25
20
15
10
5
0
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group
96
96
Example: Age of women at the time of marriage
40
35
30
No. of women
25
20
15
10
5
0
12 17 22 27 32 37 42 47
Age
97
5. O give or cumulative frequency curve:
Some times it may be necessary to know the number of
items whose values are more or less than a certain
amount.
When the cumulative frequencies of a distribution are
graphed the resulting curve is called O give Curve.
To construct an O give curve:
I ) Compute the cumulative frequency of the
distribution.
ii) Prepare a graph with the cumulative frequency
on the vertical axis and the true upper class
limits (class boundaries) of the interval scaled
along the X-axis (horizontal axis).
98
98
Example: Heart rate of patients admitted in hospital Y, 1998
99
60
50
Cum. Frequency
40
LM
30
MM
20
10
0
54.5 59.5 64.5 69.5 74.5 79.5 84.5 89.5 94.5 99.5 105
Heart Rate
101
101
Response to administration of zidovudine in two groups of AIDS
patients in hospital X, 1999
8
7
Blood zidovudine
concentration
6
5
4
3
2
1
0
10
20
70
80
100
120
170
190
250
300
360
Tim e since adm inistration (Min.)
102
7. Scatter plot
103
103
A scatter diagram is constructed by drawing X-
and Y-axes.
And each observation is represented by a point
or dot().
104
104
Age and percentage saturation of bile for women patients in
hospital Z, 1998
160
140
120
Saturation of bile
100
80
60
40
20
0
0 10 20 30 40 50 60 70 80
Age
Fig 10. The relationships between age and percentage saturation of bile
for women patients in hospital Z, 1998
105
Summarizing Data
106
106
1. Measures of Central Tendency:
107
107
Characteristics of a good measure of central
tendency:
1.It should be based on all the observations
2.It should not be affected by the extreme values
3.It should be as close to the maximum number of
values as possible
4.It should have a definite value
5.It should not be subjected to complicated and tedious
calculations
6.It should be capable of further algebraic treatment
7.It should be stable with regard to sampling
108
108
1. The Arithmetic Mean or simple Mean
The most familiar MCT.
It is also popularly known as average.
a) Ungrouped data:
If x1, x2, ..., xn are n obseved values, then
_ n
x xi
i 1
____
n
109
b) Grouped data:
In calculating the mean from grouped data, we assume that
all values falling into a particular class interval are
located at the mid-point of the interval. It is calculated
as follow
k
m
i =1
i fi
x = k
f i =1
i
where,
k = the number of class intervals
mi = the mid-point of the ith class interval
fi = the frequency of the ith class interval
110
110
EXAPLE: Find the arithmetic mean of the following
data (age distribution of 100 pregnant mothers, at H X)
Age fi mi Mi fi
15-19 11 17 187
20-24 36 22 792
25-29 28 27 756
30-34 13 32 416
35-39 7 37 259
40-44 3 42 126
45-49 2 47 94
Total 100 2630
111
Solution
k
m
i=1
i fi
x = k
f i=1
i
112
112
Characteristics:-
Uniqueness
Simplicity: the arithmetic mean is easily
understood and easy to compute
The value of the arithmetic mean is
determined by every item in the series.
It is greatly affected by extreme values.
The sum of the deviations about it is zero.
113
Advantages:
It is based on all values given in the distribution.
It is most easily understood.
It is most amenable to algebraic treatment.
Disadvantages
It may be greatly affected by extreme items and its
usefulness as a Summary of the whole may be
considerably reduced.
When the distribution has open-end classes, its
computation would be based on assumption, and
therefore may not be valid.
114
114
0
2. Median
The median of a finite set of values is that value which
divides the set of values in to two equal parts such that
the number of values greater than the median is equal to
the number of values less than the median.
a) Ungrouped data
If the number of values is odd, the median will be the
middle value when all values have been arranged in order
of magnitude i.e. [(n+1)/2] th
When the number of observations is even, there is no
single middle observation but two middle
observations
In this case the median taken to be the mean of these two
middle observations, when all observations have been
arranged in the order of their magnitude, i.e. the
average of (n/2)th and [(n/2)+1]th values
115
115
Example:
Compute the sample median for the birth weight
data
Solution:
First arrange the sample in ascending order
2069, 2581, 2759, 2834, 2838, 2841, 3031, 3101,
3200, 3245, 3248, 3260, 3265, 3314, 3323, 3484,
3541, 3609, 3649, 4146
Since n=20 is even,
Median = average of the 10th and 11th
observation = (3245 + 3248)/2 = 3246.5 g
116
116
b) Grouped data
In calculating the median from grouped data, we
assume that the values within a class-interval are
evenly distributed through the interval.
118
118
Example: Age distribution of 75 malaria cases
attending at OPD of HC X was as follow:
Age fi Cum. freq
5-14 5 5
15-24 10 15
25-34 20 35
35-44 22 57
45-54 13 70
55-64 5 75
119
Solution
n/2 = 75/2 = 37.5
120
120
Characteristics
1) It is an average of position.
2) It is affected by the number of items than by
extreme values.
Advantages
It is easily calculated
It is a positional average and hence it is not
121
Disadvantages
The median is not so well suited to algebraic
treatment as the arithmetic, geometric and
harmonic means.
It is not so generally familiar as the arithmetic
mean
It is less sensitive to the actual numerical values
of the remaining data points.
122
122
3. Mode
It is a value which occurs most frequently in a
set of values.
Any observation of a variable at which the distribution
reaches a peak is called a mode.
a) Ungrouped data
If all the values are different there is no mode,
On the other hand, a set of values may have
more than one mode.
E.g. a) 22, 66, 69, 70, 73. (no modal value)
b) 1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2,
3.5 (modal value = 3.0 kg)
123
123
b) Grouped data
124
124
Advantages
Since it is the most typical value it is the most
descriptive average
It is not affected by extreme values
It can be calculated for distributions with open
end classes
Disadvantages:
It is not capable of mathematical treatment
In a small number of items the mode may not
exist.
125
125
4. Geometric mean (GM)
It is obtained by taking the nth root of the product of n
values, i.e, if the values of the observation are denoted by
x , x ,..., x
1 2 nthen, GM = n x1 x2 ...xn
n
GM n
xi
i 1
________
n
127
Example: The geometric mean may be calculated for
the following parasite counts per 100 fields of thick
films.
7, 8, 3, 14, 2, 1, 440, 15, 52, 6, 2, 1, 1, 25
12, 6, 9, 2, 1, 6, 7, 3, 4, 70, 20, 200, 2, 50
21, 15, 10, 120, 8, 4, 70, 3, 1, 103, 20, 90, 1, 237
GM 42
7 x8 x3x...x1x237
log Gm = 1/42 (log 7+log8+log3+..+log 237)
= 1/42 (.8451+.9031+.4771 +2.3747)
= 1/42 (41.9985)
= 0.9999 1.0000
The anti-log of 0.9999 is 9.9992 10 and this is the
Advantages:-
less affected by extremes values than the arithmetic mean
It is capable of algebraic treatment
It based on all values given in the distribution.
Disadvantages:-
Its computation is relatively difficult.
It cannot be determined if there is any negative value in the
distribution, or where one of the items has a zero value.
129
129
Skewness
If extremely low or extremely high observations are
present in a distribution, then the mean tends to shift
towards those scores. Based on the type of
skewness, distributions can be:
Negatively skewed distribution: occurs when
majority of scores are at the right end of the curve and
a few small scores are scattered at the left end.
Positively skewed distribution: Occurs when the
majority of scores are at the left end of the curve and
a few extreme large scores are scattered at the right
end.
Symmetrical distribution: It is neither positively nor
negatively skewed. A curve is symmetrical if one half
of the curve is the mirror image of the other half.
130
Guidelines help an investigator to decide which
measure of central tendency is best with a given set
of data.
1.The arithmetic mean is used for interval and ratio data
and for symmetric distribution.
2.The median and quartiles are used for ordinal, interval
and ratio data whose distribution is skewed.
3.For nominal data mode is the appropriate MCT.
4.The geometric mean is used primarily for observations
measured on a logarithmic scale.
131
2. Measures of variation
The measure of central tendency alone is not
enough to have a clear idea about the
distribution of the data.
Moreover, two or more sets may have the same
mean and/or median but they may be quite
different.
Thus to have a clear picture of data, one needs
to have a measure of dispersion or variability
(scatterness) amongst observations in the set.
E.g. A 1,2,3,4,5 mean =3 median =3
B 0,1,2,3,4,5,6 mean =3 median =3
132
132
1. Range
Is the difference between the highest and smallest
observation in the data.
Range = X max X min
It is the crudest measure of dispersion.
Since it is based upon two extreme cases, it may be
considerably changed if either of the extreme cases
happens to drop out.
The extreme values may be unreliable; that is, they are
the most likely to be faulty
Not suitable with regard to the mathematical treatment
required in driving the techniques of statistical inference.
Example: 60 40 30 50 60 40 70 50
Range=70-30 =40
133
133
2. INTERQUARTILE RANGE
Another approach that addresses some of the
shortcomings of the range is in quantifying the
spread in the data set is the use of quantiles.
Eg. Quartiles, Deciles, Percentiles etc.
The pth percentile is the value Vp such that p
percent of the sample points are less than or
equal to Vp.
The median, being the 50th percentile, is a
special case of a quantile.
134
134
The calculation of quantiles is not as simple as it might
seem.
The data should be ranked from 1 to n in order of
increasing size.
The kth percentile is obtained by calculating
q=k(n+1)/100 and then interpolating between the two
values with ranks either side of the qth.
For example, for the 5th centile of a sample of 145
observations we have q=5 x 146/100=7.3.
We estimate the 5th centile as the value 0.3 of the way
between the 7th and 8th ranked observations.
If these data values are 11.4 and 14.9 the estimated
centile is 12.45.
135
135
The quartiles divide the distribution into four
equal parts.
a) The first quartile (Q1): It is the measurement, i.e.,
25% of all the ranked observations are less than or
equal to Q1.
b) The second quartile (Q2): It is the measurement.
50% of all the ranked observations are less than or
equal to Q2. The second quartile is the median.
c) The third quartile (Q3): It is the observation. 75%
of all the ranked observations are less than or equal
to Q3.
The inter-quartile range (IQR) is the difference
between the first and the third quartiles.
136
136
3
E.g. Given the following data set (age of patients):
18,59,24,42,21,23,24,32
138
138
The inter-quartile range is a preferable
measure to the range.
Because
it is less prone to distortion by a single large
or small value.
i.e, outliers in the data do not affect the inter-
quartile range.
it can be computed when the distribution has
open-end classes.
139
5. Mean deviation (MD)
Mean deviation is the average of the
absolute deviations taken from a central
value, generally the mean or median.
Consider a set of n observations x1, x2, ...,
xn. Then:
1 n
MD x i A
n i 1
A is a central value (arithmetic mean or
median).
140
Properties of mean deviation:
141
6. Variance (2, s2)
The main objection of mean deviation, that
the negative signs are ignored, is removed by
taking the square of the deviations from the
mean.
142
It is squared because the sum of the
deviations of the individual observations of a
sample about the sample mean is always 0
0= (x i - x)
143
Variance is used to measure the dispersion
of values relative to the mean.
When values are close to their mean (narrow
range) the dispersion is less than when there
is scattering over a wide range.
Population variance = 2
Sample variance = S2
144
a) Ungrouped data
Let X1, X2, ..., XN be the measurement on N
population units, then:
(X i ) 2
2 i 1
where
N
N
X i
= i =1
is the population mean.
N
145
A sample variance is calculated for a sample of
individual values (X1, X2, Xn) and uses the sample
mean (e.g. ) rather than the population mean .
146
b) Grouped data
k
(m i x) f i 2
S2 i =1
k
f
i =1
i -1
where
mi = the mid-point of the ith class interval
fi = the frequency of the i th class interval
x = the sample mean
k = the number of class intervals
147
Properties of Variance:
The main disadvantage of variance is that
its unit is the square of the unite of the
original measurement values
The variance gives more weight to the
extreme values as compared to those
which are near to mean value, because
the difference is squared in variance.
The drawbacks of variance are
overcome by the standard deviation.
148
7. Standard deviation (, s)
It is the square root of the variance.
This produces a measure having the same
scale as that of the individual values.
and S = S
2 2
149
Following are the survival times of n=11
patients after heart transplant surgery.
150
151
Example. Compute the variance and SD of the age of
169 subjects from the grouped data.
Mean = 5810.5/169 = 34.48 years
S2 = 20199.22/169-1 = 120.23
SD = S2 = 120.23 = 10.96
Class
interval (mi) (fi) (mi-Mean) (mi-Mean)2 (mi-Mean)2 fi
10-19 14.5 4 -19.98 399.20 1596.80
20-29 24.5 66 -9-98 99.60 6573.60
30-39 34.5 47 0.02 0.0004 0.0188
40-49 44.5 36 10.02 100.40 3614.40
50-59 54.5 12 20.02 400.80 4809.60
60-69 64.5 4 30.02 901.20 3604.80
Total 169 1901.20 20199.22
152
Properties of SD
The SD has the advantage of being expressed in
the same units of measurement as the mean
153
8. Coefficient of variation (CV)
When two data sets have different units of
measurements, or their means differ
sufficiently in size, the CV should be used
as a measure of dispersion.
It is the best measure to compare the
variability of two series of sets of
observations.
Data with less coefficient of variation is
considered more consistent.
154
CV is the ratio of the SD to the mean multiplied by
100.
S
CV 100
x
SD Mean CV (%)
SBP 15mm 130mm 11.5
Cholesterol 40mg/dl 200md/dl 20.0
155
Thank you why
ee
156